OCR for your old printer/scanner

Modern printers can do OCR on your scans. But as we talked about last time, aren’t all printers or scanners modern.

We have a scrap computer that is (already) catching all E-mails on a badly configured local SMTP server, to then forward it to a well configured SMTP server that has TLS. Now we also want to do OCR on the scanned PDFs.

My printer has a so called Network Scan function that scans to a SMB file share (that’s a Windows share). The scrap computer is configured to share /var/scan using Samba as ‘share’, of course. The printer is configured to use that share. Note that you might need in smb.conf this for very old printers:

client min protocol = LANMAN1
server min protocol = LANMAN1
client lanman auth = yes
client ntlmv2 auth = no
client plaintext auth = yes
ntlm auth = yes
security = share

And of course also something like this:

[scan]
path = /var/scan
writable = yes
browsable = yes
guest ok = yes
public = yes
create mask = 0777

First install software: apt-get install ocrmypdf inotify-tools screen bash

We need a script to perform OCR scan on a PDF. We’ll here use it in another script that monitors /var/scan for changes. Later in another post I’ll explain how to use it from Postfix’s master.cf on the attachments of an E-mail. Here is /usr/local/bin/fixpdf.sh:

! /bin/sh
a=$1
TMP=`mktemp -d -t XXXXX`
DIR=/var/scan
mkdir -p $DIR/ocr
cd $DIR
TIMESTAMP=`stat -c %Y "$a"`
ocrmypdf --force-ocr "$a" "$TMP/OCR-$a"
mv -f "$TMP/OCR-$a" "$DIR/ocr/$TIMESTAMP-$a"
chmod 777 "$DIR/ocr/$TIMESTAMP-$a"
cd /tmp
rm -rf $TMP

Note that I prepend the filename with a timestamp. That’s because my printer has no way to give the scanned files a good filename that I can use for my archiving purposes. You can of course do this different.

Now we want a script that monitors /var/scan and launches that fixpdf.sh script in the background each time a file is created.

My Xerox WorkCentre 7232 uses a directory called SCANFILE.LCK/ for its own file locking. When it is finished with a SCANFILE.PDF it deletes that LCK directory.

Being bad software developers the Xerox people didn’t use a POSIX rename for SCANFILE.PDF to do an atomic write operation at the end.

It looks like this:

inotifywait -r -m  /var/scan | 
while read file_path file_event file_name; do
echo ${file_path}${file_name} event: ${file_event}
done
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK event: CREATE,ISDIR
/var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.PDF event: CREATE
/var/scan/XEROXSCAN003.PDF event: OPEN
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: MODIFY
...
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: MODIFY
/var/scan/XEROXSCAN003.PDF event: CLOSE_WRITE,CLOSE
/var/scan/XEROXSCAN003.PDF event: ATTRIB
/var/scan/XEROXSCAN003.LCK event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: OPEN,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: ACCESS,ISDIR
/var/scan/XEROXSCAN003.LCK event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/XEROXSCAN003.LCK/ event: DELETE_SELF
/var/scan/XEROXSCAN003.LCK event: DELETE,ISDIR

The printer deleting that SCANFILE.LCK/ directory is a good moment to start our OCR script (call it for example /usr/local/bin/monitorscan.sh):

! /bin/bash
inotifywait -r -m -e DELETE,ISDIR /var/scan |
while read file_path file_event file_name; do
if [ ${file_event} = "DELETE,ISDIR" ]; then
if [[ ${file_name} == *"LCK" ]]; then
suffix=".LCK"
filename=`echo ${file_name} | sed -e "s/$suffix$//"`.PDF
/usr/local/bin/fixpdf.sh $filename &
fi
fi
done

Give both scripts 755 permissions with chmod and now you just run screen /usr/local/bin/monitorscan.sh

When your printer was written by good software developers, it will do POSIX rename. That looks like this (yes, also when done over a SMB network share):

inotifywait -r -m  /var/scan | 
while read file_path file_event file_name; do
echo ${file_path}${file_name} event: ${file_event}
done
Setting up watches. Beware: since -r was given, this may take a while!
Watches established.
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/ event: OPEN,ISDIR
/var/scan/ event: ACCESS,ISDIR
/var/scan/ event: CLOSE_NOWRITE,CLOSE,ISDIR
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: CREATE
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: OPEN
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MODIFY
...
/var/scan/.tmp123.GOODBRANDSCAN-123.PDF event: MOVED_FROM
/var/scan/GOODBRANDSCAN-123.PDF event: MOVED_TO

That means that your parameters for inotifywait could be -r -m -e MOVED_TO and in ${file_name} you’ll have that GOODBRANDSCAN-123.PDF. This is of course better than Xerox’s way with their not invented here LCK things that probably also wouldn’t be necessary with a POSIX rename call.

I will document how to do this to the E-mail feature of the printer with Postfix later.

I first need a moment in my life where I actually need this hard enough that I will start figuring out how to extract certain attachment MIME parts from an E-mail with Posix’s master.cf. I guess I will have to look into CockooMX by Xavier Mertens for that. Update: that article is available now.

.