43 Folders

43 Folders feed subscription icon - Shiny! Drowning in email? Try Inbox Zero to learn sane tips for dealing with high-volume email. And don’t miss the free Inbox Zero video. »

Login or register

Register for free on 43 Folders to comment on articles, post to our forum, customize your visits, and much more. Current users can login now.

Canon Pixma MP830, Omnipage and a few python scripts...

djol's picture

Canon Pixma MP830, Omnipage and a few python scripts...

Ah, paperless workflow - a topic close to my heart! How many hours have I spent setting this all up, when I probably should have just filed the original paper in my dusty filing cabinets! Ah, but so much fun was to be had…

Currently my paperless system uses a Canon Pixma MP830 multifunction thing - chosen because it does duplex scans from an auto document feeder and was readily available here in Australia (unlike the ScanSnap (sigh)). Duplex on this is fine, albeit slow, though probably no where near as reliable as that ‘sweet sweet ScanSnap magic’. Plus can only easily duplex A4.

The Canon MP830 is connected to our headless mac mini, acting as our home server. I drop a stack of documents into the ADF of the scanner, with discrete documents seperated by blank pages, and hit the ‘scan to pdf’ button. The scanner scans and dumps the resultant single pdf in a folder on the server.

Omnipage SE (included with the Canon) does a very good and very fast job of automatic OCR on the scanned pdf, including the text as a hidden layer behind the image of the scanned page. While the OCR isn’t 100%, it’s fine for spotlight searches and copy-pasting with quick proof-reading. The OCR happens automatically as part of the scan process.

On the server, an hourly cron job runs a python script that detects ‘blank’ pages within the pdf, then splits the original into seperate pdf’s on each of these blanks. Essentially the script just looks for pages that have no OCR’ed text, and marks these as blanks. Not perfect, as theoretically it will choke on pages with images but no text, although I have not had problems so far. (Here the not-quite-perfect OCR works to my advantage, as even an all-image page will probably have some area mistakenly detected as a random character. 80 gsm blank page = blank and no OCR text.)

These resultant pdf’s are then moved by the script to the main ‘files’ folder, where another python script (“argh, python, is there anything ye’ can’t do?”) files them into subdirs based on financial year, or filename keywords for specific things like birth certificates etc. This ‘files’ dir syncs regularly with the same on my powerbook using unison - thus I’m always carrying all the family’s files, and can edit and rearrange as I like, and have these changes automagically propogated back to the server when I’m back on the home network. (On wakeup the pbook pings the server to make sure it’s present, then runs unison to sync in the background. I have something similar for daily automatic wireless backups from each of our laptops to the server. Unison rocks. Really).

Currently the filing of pdf’s into year/keyword subdirs only happens once the pdf’s have been manually renamed to something meaningful. I’m currently juggling a small script that will pop-up a single yet-to-be-named pdf and ask for a suitable name everytime my pbook wakes - trying to enforce a ‘one file at a time’ approach to this sole manual step, while lowering the barrier to actually filing this stuff meaningfully.

While I played with Yep! for a bit (and liked), I’m trying to stick with a system using just the filesystem, finder, spotlight and quicksilver. I ‘tag’ my pdf’s within the filename (ie. PhoneBill.receipt.tax.pdf) - plus include the date of the document within it’s filename (…051007.pdf), which is then used by the filing script to change the creation-date of the actual pdf file to match. Thus I don’t have to worry about external databases for metadata etc. - just filenames, filesystem time stamps and the actual textual content of the file.

Standard tools. Nice. Simple.

Yes. I am a nerd. You all understand…

Vox Pop: Workflow for the Fujitsu ScanSnap? By: Merlin Mann (22 replies) October 23, 2007 - 10:18am
 
EXPLORE 43Folders THE GOOD STUFF

An Oblique Strategy:
Discard an axiom


STAY IN THE LOOP:

Subscribe with Google Reader

Subscribe on Netvibes

Add to Technorati Favorites

Subscribe on Pageflakes

Add RSS feed

The Podcast Feed

Inbox Zero

The original 43 Folders series looking at the skills, tools, and attitude needed to empty your email inbox — and then keep it that way. Don’t miss the free video of Merlin’s Inbox Zero presentation.

Get Started with ‘GTD’

David Allen’s popular productivity book and the system on which it’s based help turn ‘stuff’ into actions that support valuable outcomes.