OCR for your old scanner/printer’s E-mails

Yesterday I explained how to make a scrap computer do OCR on your scanner/printer’s scanned PDFs in case you have a SMB file share (a Windows file share) where the printer will write to.

I also promised I would make the E-Mail feature of the printer send E-mails with the PDFs in that E-mail being OCR scanned.

I had earlier explained how you can make your old scanner/printer support modern SMTP servers that have TLS, by introducing a scrap computer running Postfix to forward the E-mails for you. This article depends on that, of course. As we will let the scrap computer now do the OCR part. If you have not yet done that, first do it before continuing here.

I looked at Xavier Merten‘s CockooMX, and decided to massacre it until it would do what I want it to do. Namely call ocrmypdf on the application/pdf attachments and then add the resulting PDF/A (which will have OCR text) to the E-mail.

First install some extra software: apt-get install libmime-tools-perl . It will provide you with MIME::Tools, we will use MIME::Parser and MIME::Entity.

Create a Perl script called /usr/local/bin/ocrpdf.pl (chmod 755 it) that looks like this (which is Xavier’s CockooMX massacred and reduced to what I need – Sorry Xavier. Maybe we could try to make CockooMX have a plugin like infrastructure? But writing what looks suspicious to a database ain’t what I’m aiming for here):


#!/usr/bin/perl

# Copyright note

use Digest::MD5;
use File::Path qw(make_path remove_tree);
use File::Temp;
use MIME::Parser;
use Sys::Syslog;
use strict;
use warnings;

use constant EX_TEMPFAIL => 75; # Mail sent to the deferred queue (retry)
use constant EX_UNAVAILABLE => 69; # Mail bounced to the sender (undeliverable)

my $syslogProgram	= "ocrpdf";
my $sendmailPath	= "/usr/sbin/sendmail";
my $syslogFacility	= "mail";
my $outputDir		= "/var/ocrpdf";
my $ocrmypdf		= "/usr/bin/ocrmypdf";

# Create our working directory
$outputDir = $outputDir . '/' . $$;
if (! -d $outputDir && !make_path("$outputDir", { mode => 0700 })) {
  syslogOutput("mkdir($outputDir) failed: $!");
  exit EX_TEMPFAIL;
}

# Save the mail from STDIN
if (!open(OUT, ">$outputDir/content.tmp")) {
  syslogOutput("Write to \"$outputDir/content.tmp\" failed: $!");
  exit EX_TEMPFAIL;
}
while() {
  print OUT $_;
}
close(OUT);

# Save the sender & recipients passed by Postfix
if (!open(OUT, ">$outputDir/args.tmp")) {
  syslogOutput("Write to \"$outputDir/args.tmp\" failed: $!");
  exit EX_TEMPFAIL;
}
foreach my $arg (@ARGV) {
  print OUT $arg . " ";
}
close(OUT);

# Extract MIME types from the message
my $parser = new MIME::Parser;
$parser->output_dir($outputDir);
my $entity = $parser->parse_open("$outputDir/content.tmp");

# Extract sender and recipient(s)
my $headers = $entity->head;
my $from = $headers->get('From');
my $to = $headers->get('To');
my $subject = $headers->get('Subject');
chomp($from);
chomp($subject);

syslogOutput("Processing mail from: $from ($subject)");

processMIMEParts($entity);
deliverMail($entity);
remove_tree($outputDir) or syslogOuput("Cannot delete \"$outputDir\": $!");

exit 0;

sub processMIMEParts
{
  my $entity = shift || return;
  for my $part ($entity->parts) {
    if($part->mime_type eq 'multipart/alternative' ||
       $part->mime_type eq 'multipart/related' ||
       $part->mime_type eq 'multipart/mixed' ||
       $part->mime_type eq 'multipart/signed' ||
       $part->mime_type eq 'multipart/report' ||
       $part->mime_type eq 'message/rfc822' ) {
         # Recursively process the message
         processMIMEParts($part);
     } else {
       if( $part->mime_type eq 'application/pdf' ) {
         my $type = lc  $part->mime_type;
         my $bh = $part->bodyhandle;
         syslogOutput("OCR for: \"" . $bh->{MB_Path} . "\" (" . $type . ") to \"" . $bh->{MB_Path} . ".ocr.pdf" . "\"" );
         # Perform the OCR scan, output to a new file
         system($ocrmypdf, $bh->{MB_Path}, $bh->{MB_Path} . ".ocr.pdf");
         # Add the new file as attachment
         $entity->attach(Path   => $bh->{MB_Path} . ".ocr.pdf",
                         Type   => "application/pdf",
                         Encoding => "base64");
      }
     }
   }
   return;
}

#
# deliverMail - Send the mail back
#
sub deliverMail {
  my $entity = shift || return;

  # Write the changed entity to a temporary file
  if (! open(FH, '>', "$outputDir/outfile.tmp")) {
    syslogOutput("deliverMail: cannot write $outputDir/outfile.tmp: $!");
    exit EX_UNAVAILABLE;
  }
  $entity->print(\*FH);
  close(FH);

  # Read saved arguments
  if (! open(IN, "<$outputDir/args.tmp")) {
    syslogOutput("deliverMail: Cannot read $outputDir/args.tmp: $!");
    exit EX_TEMPFAIL;
  }
  my $sendmailArgs = ;
  close(IN);
	
  # Read mail content from temporary file of changed entity
  if (! open(IN, "<$outputDir/outfile.tmp")) {
    syslogOutput("deliverMail: Cannot read $outputDir/content.txt: $!");
    exit EX_UNAVAILABLE;
  }
	
  # Spawn a sendmail process
  syslogOutput("Spawn=$sendmailPath -G -i $sendmailArgs");
  if (! open(SENDMAIL, "|$sendmailPath -G -i $sendmailArgs")) {
    syslogOutput("deliverMail: Cannot spawn: $sendmailPath $sendmailArgs: $!");
    exit EX_TEMPFAIL;
  }
  while() {
    print SENDMAIL $_;
  }
  close(IN);
  close(SENDMAIL);
}

#
# Send Syslog message using the defined facility
#
sub syslogOutput {
  my $msg = shift or return(0);
  openlog($syslogProgram, 'pid', $syslogFacility);
  syslog('info', '%s', $msg);
  closelog();
}

Now we just do what Xavier’s CockooMX documentation also tells you to do: add it to master.cf:

Create a UNIX user: adduser ocrpdf

Change the smtp service:

smtp      inet  n       -       -       -       -       smtpd
-o content_filter=ocrpdf

Create a new service

ocrpdf  unix  -       n       n       -       -       pipe
user=ocrpdf argv=/usr/local/bin/ocrpdf.pl -f ${sender} ${recipient}