43 Folders

Back to Work

Merlin’s weekly podcast with Dan Benjamin. We talk about creativity, independence, and making things you love.

Join us via RSS, iTunes, or at 5by5.tv.

”What’s 43 Folders?”
43Folders.com is Merlin Mann’s website about finding the time and attention to do your best creative work.

Vox Pop: Workflow for the Fujitsu ScanSnap?

In comments about yesterday's "Making friends with paper" post, I was reminded by 43f member Adam Hooks...

A couple months ago, on a MBW episode, Merlin, you recommended some scanner/pdf solutions and you said you would elaborate on that on 43f at some point. I thought this was related to reducing your reliance on paper. How did your scanning experiment go?

Adam remembers correctly that I purchased and preliminarily fiddled with the Fujitsu ScanSnap S500M for OS X (Info, Amazon). It's a small-footprint, high-speed document scanner that a lot of people have been talking about lately. I'd read so many reviews and blog posts about how easy it is to use that I was intoxicated by the dream of a life -- if not without paper storage -- where I could at least try to minimize my unnecessary paper clutter and start making document archiving easier and more searchable.

Given the not inconsiderable cost of the unit, I'm embarrassed to say that I got busy with other stuff and haven't yet returned to using the ScanSnap in any automated way.

Doesn't mean I'm not interested or haven't gotten started...

cover of 'ScanSnap S500M' by Fujitsu

ScanSnap S500M
by Fujitsu

My initial experiences, while tentative in terms of time commitment and true workflow integration, have been very positive so far. It's easy and fast to set up the S500M and then start scanning one- or two-sided documents. The beauty part is that the included "ScanSnap Manager" app not only stores your document preferences, but directs the USB input from the ScanSnap right into the destination app of your choosing (which can, of course, be an OCR app -- that's where it gets powerful).

Initial experiments scanning directly to image-only PDFs were very positive, while scanning into "Yep" and "DevonThink Pro Office" (which has on-board OCR) seems to point even closer to the direction I eventually hope to go.

I know at least a few of you are ScanSnap studs who have come up with workflows that are really happening for you (hint: looking at you for a blog post here, Mr. Norbauer). In the absence of a more detailed report from me, I'm hoping a few of you can chime in here.

The Question to You

How are you integrating the ScanSnap (or another OS X-friendly document scanner) into your workflow? What are you using for OCR? Having particular success with ReadIris, Acrobat, DevonThink, or Yep? Any sexy Automator workflows to share?

djol's picture

Canon Pixma MP830, Omnipage and a few python scripts...

Ah, paperless workflow - a topic close to my heart! How many hours have I spent setting this all up, when I probably should have just filed the original paper in my dusty filing cabinets! Ah, but so much fun was to be had...

Currently my paperless system uses a Canon Pixma MP830 multifunction thing - chosen because it does duplex scans from an auto document feeder and was readily available here in Australia (unlike the ScanSnap (sigh)). Duplex on this is fine, albeit slow, though probably no where near as reliable as that 'sweet sweet ScanSnap magic'. Plus can only easily duplex A4.

The Canon MP830 is connected to our headless mac mini, acting as our home server. I drop a stack of documents into the ADF of the scanner, with discrete documents seperated by blank pages, and hit the 'scan to pdf' button. The scanner scans and dumps the resultant single pdf in a folder on the server.

Omnipage SE (included with the Canon) does a very good and very fast job of automatic OCR on the scanned pdf, including the text as a hidden layer behind the image of the scanned page. While the OCR isn't 100%, it's fine for spotlight searches and copy-pasting with quick proof-reading. The OCR happens automatically as part of the scan process.

On the server, an hourly cron job runs a python script that detects 'blank' pages within the pdf, then splits the original into seperate pdf's on each of these blanks. Essentially the script just looks for pages that have no OCR'ed text, and marks these as blanks. Not perfect, as theoretically it will choke on pages with images but no text, although I have not had problems so far. (Here the not-quite-perfect OCR works to my advantage, as even an all-image page will probably have some area mistakenly detected as a random character. 80 gsm blank page = blank and no OCR text.)

These resultant pdf's are then moved by the script to the main 'files' folder, where another python script ("argh, python, is there anything ye' can't do?") files them into subdirs based on financial year, or filename keywords for specific things like birth certificates etc. This 'files' dir syncs regularly with the same on my powerbook using unison - thus I'm always carrying all the family's files, and can edit and rearrange as I like, and have these changes automagically propogated back to the server when I'm back on the home network. (On wakeup the pbook pings the server to make sure it's present, then runs unison to sync in the background. I have something similar for daily automatic wireless backups from each of our laptops to the server. Unison rocks. Really).

Currently the filing of pdf's into year/keyword subdirs only happens once the pdf's have been manually renamed to something meaningful. I'm currently juggling a small script that will pop-up a single yet-to-be-named pdf and ask for a suitable name everytime my pbook wakes - trying to enforce a 'one file at a time' approach to this sole manual step, while lowering the barrier to actually filing this stuff meaningfully.

While I played with Yep! for a bit (and liked), I'm trying to stick with a system using just the filesystem, finder, spotlight and quicksilver. I 'tag' my pdf's within the filename (ie. PhoneBill.receipt.tax.pdf) - plus include the date of the document within it's filename (...051007.pdf), which is then used by the filing script to change the creation-date of the actual pdf file to match. Thus I don't have to worry about external databases for metadata etc. - just filenames, filesystem time stamps and the actual textual content of the file.

Standard tools. Nice. Simple.

Yes. I am a nerd. You all understand...




An Oblique Strategy:
Honor thy error as a hidden intention


Subscribe with Google Reader

Subscribe on Netvibes

Add to Technorati Favorites

Subscribe on Pageflakes

Add RSS feed

The Podcast Feed


Merlin used to crank. He’s not cranking any more.

This is an essay about family, priorities, and Shakey’s Pizza, and it’s probably the best thing he’s written. »

Scared Shitless

Merlin’s scared. You’re scared. Everybody is scared.

This is the video of Merlin’s keynote at Webstock 2011. The one where he cried. You should watch it. »