Modern printers can do OCR on your scans. But as we talked about last time, aren’t all printers or scanners modern.
We have a scrap computer that is (already) catching all E-mails on a badly configured local SMTP server, to then forward it to a well configured SMTP server that has TLS. Now we also want to do OCR on the scanned PDFs.
My printer has a so called Network Scan function that scans to a SMB file share (that’s a Windows share). The scrap computer is configured to share /var/scan using Samba as ‘share’, of course. The printer is configured to use that share. Note that you might need in smb.conf this for very old printers:
client min protocol = LANMAN1
server min protocol = LANMAN1
client lanman auth = yes
client ntlmv2 auth = no
client plaintext auth = yes
ntlm auth = yes
security = share
And of course also something like this:
[scan]
 path = /var/scan
 writable = yes
 browsable = yes
 guest ok = yes
 public = yes
 create mask = 0777
First install software: apt-get install ocrmypdf inotify-tools screen bash
We need a script to perform OCR scan on a PDF. We’ll here use it in another script that monitors /var/scan for changes. Later in another post I’ll explain how to use it from Postfix’s master.cf on the attachments of an E-mail. Here is /usr/local/bin/fixpdf.sh:
! /bin/sh
 a=$1
 TMP=`mktemp -d -t XXXXX`
 DIR=/var/scan
 mkdir -p $DIR/ocr
 cd $DIR
 TIMESTAMP=`stat -c %Y "$a"`
 ocrmypdf --force-ocr "$a" "$TMP/OCR-$a"
 mv -f "$TMP/OCR-$a" "$DIR/ocr/$TIMESTAMP-$a"
 chmod 777 "$DIR/ocr/$TIMESTAMP-$a"
 cd /tmp
 rm -rf $TMP
Note that I prepend the filename with a timestamp. That’s because my printer has no way to give the scanned files a good filename that I can use for my archiving purposes. You can of course do this different.
Now we want a script that monitors /var/scan and launches that fixpdf.sh script in the background each time a file is created.
My Xerox WorkCentre 7232 uses a directory called SCANFILE.LCK/ for its own file locking. When it is finished with a SCANFILE.PDF it deletes that LCK directory.
Being bad software developers the Xerox people didn’t use a POSIX rename for SCANFILE.PDF to do an atomic write operation at the end.
It looks like this:
inotifywait -r -m  /var/scan | 
    while read file_path file_event file_name; do 
           echo ${file_path}${file_name} event: ${file_event}
    done
Setting up watches.  Beware: since -r was given, this may take a while!
 Watches established.
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/XEROXSCAN003.LCK event: CREATE,ISDIR
 /var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
 /var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
 /var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/XEROXSCAN003.PDF event: CREATE
 /var/scan/XEROXSCAN003.PDF event: OPEN
 /var/scan/XEROXSCAN003.PDF event: MODIFY
 /var/scan/XEROXSCAN003.PDF event: MODIFY
 ...
 /var/scan/XEROXSCAN003.PDF event: MODIFY
 /var/scan/XEROXSCAN003.PDF event: MODIFY
 /var/scan/XEROXSCAN003.PDF event: CLOSE_WRITE,CLOSE
 /var/scan/XEROXSCAN003.PDF event: ATTRIB
 /var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
 /var/scan/XEROXSCAN003.LCK/ event: OPEN,ISDIR
 /var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
 /var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
 /var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
 /var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
 /var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/XEROXSCAN003.LCK/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/XEROXSCAN003.LCK/ event: DELETE_SELF
 /var/scan/XEROXSCAN003.LCK event: DELETE,ISDIR
The printer deleting that SCANFILE.LCK/ directory is a good moment to start our OCR script (call it for example /usr/local/bin/monitorscan.sh):
! /bin/bash
 inotifywait -r -m -e DELETE,ISDIR /var/scan | 
    while read file_path file_event file_name; do 
       if [ ${file_event} = "DELETE,ISDIR" ]; then
         if [[ ${file_name} == *"LCK" ]]; then
           suffix=".LCK"
           filename=`echo ${file_name} | sed -e "s/$suffix$//"`.PDF
           /usr/local/bin/fixpdf.sh $filename &
         fi
       fi
    done
Give both scripts 755 permissions with chmod and now you just run screen /usr/local/bin/monitorscan.sh
When your printer was written by good software developers, it will do POSIX rename. That looks like this (yes, also when done over a SMB network share):
inotifywait -r -m  /var/scan | 
    while read file_path file_event file_name; do 
           echo ${file_path}${file_name} event: ${file_event}
    done
Setting up watches.  Beware: since -r was given, this may take a while!
 Watches established.
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/ event: OPEN,ISDIR
 /var/scan/ event: ACCESS,ISDIR
 /var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
 /var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: CREATE
 /var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: OPEN
 /var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MODIFY
 ...
 /var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MOVED_FROM
 /var/scan/GOODBRANDSCAN-123.PDF event: MOVED_TO
That means that your parameters for inotifywait could be -r -m -e MOVED_TO and in ${file_name} you’ll have that GOODBRANDSCAN-123.PDF. This is of course better than Xerox’s way with their not invented here LCK things that probably also wouldn’t be necessary with a POSIX rename call.
I will document how to do this to the E-mail feature of the printer with Postfix later.
I first need a moment in my life where I actually need this hard enough that I will start figuring out how to extract certain attachment MIME parts from an E-mail with Posix’s master.cf. I guess I will have to look into CockooMX by Xavier Mertens for that. Update: that article is available now.
.