OCR for your old scanner/printer’s E-mails

Yesterday I explained how to make a scrap computer do OCR on your scanner/printer’s scanned PDFs in case you have a SMB file share (a Windows file share) where the printer will write to.

I also promised I would make the E-Mail feature of the printer send E-mails with the PDFs in that E-mail being OCR scanned.

I had earlier explained how you can make your old scanner/printer support modern SMTP servers that have TLS, by introducing a scrap computer running Postfix to forward the E-mails for you. This article depends on that, of course. As we will let the scrap computer now do the OCR part. If you have not yet done that, first do it before continuing here.

I looked at Xavier Merten‘s CockooMX, and decided to massacre it until it would do what I want it to do. Namely call ocrmypdf on the application/pdf attachments and then add the resulting PDF/A (which will have OCR text) to the E-mail.

First install some extra software: apt-get install libmime-tools-perl . It will provide you with MIME::Tools, we will use MIME::Parser and MIME::Entity.

Create a Perl script called /usr/local/bin/ocrpdf.pl (chmod 755 it) that looks like this (which is Xavier’s CockooMX massacred and reduced to what I need – Sorry Xavier. Maybe we could try to make CockooMX have a plugin like infrastructure? But writing what looks suspicious to a database ain’t what I’m aiming for here):


#!/usr/bin/perl

# Copyright note

use Digest::MD5;
use File::Path qw(make_path remove_tree);
use File::Temp;
use MIME::Parser;
use Sys::Syslog;
use strict;
use warnings;

use constant EX_TEMPFAIL => 75; # Mail sent to the deferred queue (retry)
use constant EX_UNAVAILABLE => 69; # Mail bounced to the sender (undeliverable)

my $syslogProgram	= "ocrpdf";
my $sendmailPath	= "/usr/sbin/sendmail";
my $syslogFacility	= "mail";
my $outputDir		= "/var/ocrpdf";
my $ocrmypdf		= "/usr/bin/ocrmypdf";

# Create our working directory
$outputDir = $outputDir . '/' . $$;
if (! -d $outputDir && !make_path("$outputDir", { mode => 0700 })) {
  syslogOutput("mkdir($outputDir) failed: $!");
  exit EX_TEMPFAIL;
}

# Save the mail from STDIN
if (!open(OUT, ">$outputDir/content.tmp")) {
  syslogOutput("Write to \"$outputDir/content.tmp\" failed: $!");
  exit EX_TEMPFAIL;
}
while() {
  print OUT $_;
}
close(OUT);

# Save the sender & recipients passed by Postfix
if (!open(OUT, ">$outputDir/args.tmp")) {
  syslogOutput("Write to \"$outputDir/args.tmp\" failed: $!");
  exit EX_TEMPFAIL;
}
foreach my $arg (@ARGV) {
  print OUT $arg . " ";
}
close(OUT);

# Extract MIME types from the message
my $parser = new MIME::Parser;
$parser->output_dir($outputDir);
my $entity = $parser->parse_open("$outputDir/content.tmp");

# Extract sender and recipient(s)
my $headers = $entity->head;
my $from = $headers->get('From');
my $to = $headers->get('To');
my $subject = $headers->get('Subject');
chomp($from);
chomp($subject);

syslogOutput("Processing mail from: $from ($subject)");

processMIMEParts($entity);
deliverMail($entity);
remove_tree($outputDir) or syslogOuput("Cannot delete \"$outputDir\": $!");

exit 0;

sub processMIMEParts
{
  my $entity = shift || return;
  for my $part ($entity->parts) {
    if($part->mime_type eq 'multipart/alternative' ||
       $part->mime_type eq 'multipart/related' ||
       $part->mime_type eq 'multipart/mixed' ||
       $part->mime_type eq 'multipart/signed' ||
       $part->mime_type eq 'multipart/report' ||
       $part->mime_type eq 'message/rfc822' ) {
         # Recursively process the message
         processMIMEParts($part);
     } else {
       if( $part->mime_type eq 'application/pdf' ) {
         my $type = lc  $part->mime_type;
         my $bh = $part->bodyhandle;
         syslogOutput("OCR for: \"" . $bh->{MB_Path} . "\" (" . $type . ") to \"" . $bh->{MB_Path} . ".ocr.pdf" . "\"" );
         # Perform the OCR scan, output to a new file
         system($ocrmypdf, $bh->{MB_Path}, $bh->{MB_Path} . ".ocr.pdf");
         # Add the new file as attachment
         $entity->attach(Path   => $bh->{MB_Path} . ".ocr.pdf",
                         Type   => "application/pdf",
                         Encoding => "base64");
      }
     }
   }
   return;
}

#
# deliverMail - Send the mail back
#
sub deliverMail {
  my $entity = shift || return;

  # Write the changed entity to a temporary file
  if (! open(FH, '>', "$outputDir/outfile.tmp")) {
    syslogOutput("deliverMail: cannot write $outputDir/outfile.tmp: $!");
    exit EX_UNAVAILABLE;
  }
  $entity->print(\*FH);
  close(FH);

  # Read saved arguments
  if (! open(IN, "<$outputDir/args.tmp")) {
    syslogOutput("deliverMail: Cannot read $outputDir/args.tmp: $!");
    exit EX_TEMPFAIL;
  }
  my $sendmailArgs = ;
  close(IN);
	
  # Read mail content from temporary file of changed entity
  if (! open(IN, "<$outputDir/outfile.tmp")) {
    syslogOutput("deliverMail: Cannot read $outputDir/content.txt: $!");
    exit EX_UNAVAILABLE;
  }
	
  # Spawn a sendmail process
  syslogOutput("Spawn=$sendmailPath -G -i $sendmailArgs");
  if (! open(SENDMAIL, "|$sendmailPath -G -i $sendmailArgs")) {
    syslogOutput("deliverMail: Cannot spawn: $sendmailPath $sendmailArgs: $!");
    exit EX_TEMPFAIL;
  }
  while() {
    print SENDMAIL $_;
  }
  close(IN);
  close(SENDMAIL);
}

#
# Send Syslog message using the defined facility
#
sub syslogOutput {
  my $msg = shift or return(0);
  openlog($syslogProgram, 'pid', $syslogFacility);
  syslog('info', '%s', $msg);
  closelog();
}

Now we just do what Xavier’s CockooMX documentation also tells you to do: add it to master.cf:

Create a UNIX user: adduser ocrpdf

Change the smtp service:

smtp      inet  n       -       -       -       -       smtpd
-o content_filter=ocrpdf

Create a new service

ocrpdf  unix  -       n       n       -       -       pipe
user=ocrpdf argv=/usr/local/bin/ocrpdf.pl -f ${sender} ${recipient}

OCR for your old printer/scanner

Modern printers can do OCR on your scans. But as we talked about last time, aren’t all printers or scanners modern.

We have a scrap computer that is (already) catching all E-mails on a badly configured local SMTP server, to then forward it to a well configured SMTP server that has TLS. Now we also want to do OCR on the scanned PDFs.

My printer has a so called Network Scan function that scans to a SMB file share (that’s a Windows share). The scrap computer is configured to share /var/scan using Samba as ‘share’, of course. The printer is configured to use that share. Note that you might need in smb.conf this for very old printers:

client min protocol = LANMAN1
server min protocol = LANMAN1
client lanman auth = yes
client ntlmv2 auth = no
client plaintext auth = yes
ntlm auth = yes
security = share

And of course also something like this:

[scan]
path = /var/scan
writable = yes
browsable = yes
guest ok = yes
public = yes
create mask = 0777

First install software: apt-get install ocrmypdf inotify-tools screen bash

We need a script to perform OCR scan on a PDF. We’ll here use it in another script that monitors /var/scan for changes. Later in another post I’ll explain how to use it from Postfix’s master.cf on the attachments of an E-mail. Here is /usr/local/bin/fixpdf.sh:

! /bin/sh
a=$1
TMP=`mktemp -d -t XXXXX`
DIR=/var/scan
mkdir -p $DIR/ocr
cd $DIR
TIMESTAMP=`stat -c %Y "$a"`
ocrmypdf --force-ocr "$a" "$TMP/OCR-$a"
mv -f "$TMP/OCR-$a" "$DIR/ocr/$TIMESTAMP-$a"
chmod 777 "$DIR/ocr/$TIMESTAMP-$a"
cd /tmp
rm -rf $TMP

Note that I prepend the filename with a timestamp. That’s because my printer has no way to give the scanned files a good filename that I can use for my archiving purposes. You can of course do this different.

Now we want a script that monitors /var/scan and launches that fixpdf.sh script in the background each time a file is created.

My Xerox WorkCentre 7232 uses a directory called SCANFILE.LCK/ for its own file locking. When it is finished with a SCANFILE.PDF it deletes that LCK directory.

Being bad software developers the Xerox people didn’t use a POSIX rename for SCANFILE.PDF to do an atomic write operation at the end.

It looks like this:

inotifywait -r -m  /var/scan | 
while read file_path file_event file_name; do
echo ${file_path}${file_name} event: ${file_event}
done
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK event: CREATE,ISDIR
/var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.PDF event: CREATE
/var/scan/XEROXSCAN003.PDF event: OPEN
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: MODIFY
...
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: CLOSE_WRITE,CLOSE
/var/scan/XEROXSCAN003.PDF event: ATTRIB
/var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: DELETE_SELF
/var/scan/XEROXSCAN003.LCK event: DELETE,ISDIR

The printer deleting that SCANFILE.LCK/ directory is a good moment to start our OCR script (call it for example /usr/local/bin/monitorscan.sh):

! /bin/bash
inotifywait -r -m -e DELETE,ISDIR /var/scan |
while read file_path file_event file_name; do
if [ ${file_event} = "DELETE,ISDIR" ]; then
if [[ ${file_name} == *"LCK" ]]; then
suffix=".LCK"
filename=`echo ${file_name} | sed -e "s/$suffix$//"`.PDF
/usr/local/bin/fixpdf.sh $filename &
fi
fi
done

Give both scripts 755 permissions with chmod and now you just run screen /usr/local/bin/monitorscan.sh

When your printer was written by good software developers, it will do POSIX rename. That looks like this (yes, also when done over a SMB network share):

inotifywait -r -m  /var/scan | 
while read file_path file_event file_name; do
echo ${file_path}${file_name} event: ${file_event}
done
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: CREATE
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: OPEN
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MODIFY
...
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MOVED_FROM
/var/scan/GOODBRANDSCAN-123.PDF event: MOVED_TO

That means that your parameters for inotifywait could be -r -m -e MOVED_TO and in ${file_name} you’ll have that GOODBRANDSCAN-123.PDF. This is of course better than Xerox’s way with their not invented here LCK things that probably also wouldn’t be necessary with a POSIX rename call.

I will document how to do this to the E-mail feature of the printer with Postfix later.

I first need a moment in my life where I actually need this hard enough that I will start figuring out how to extract certain attachment MIME parts from an E-mail with Posix’s master.cf. I guess I will have to look into CockooMX by Xavier Mertens for that. Update: that article is available now.

.