43 Folders

43 Folders feed subscription icon - Shiny! Drowning in email? Try Inbox Zero to learn sane tips for dealing with high-volume email. And don’t miss the free Inbox Zero video. »

”What’s 43 Folders?”
43Folders.com is Merlin Mann’s website about finding the time and attention to do your best creative work.

Paperless on a budget

Been thinking a bit about getting rid of some old paper files I have and want to get them on a pc for the lowest cost possible.

I wanted to kick this idea around a bit and see if it fights back…

The problem: I have a cheap scanner and my budget doesn’t stretch to a nice automated scanner like the ScanSnap.

Solution: My idea is this, scan the pages as images and ocr them. Ocr isn’t perfect, as you all know, but as long as it can produce valid search text I don’t care too much.

The clever bit is (i think)…I want to store them as MHT files containing both the scanned page images and the ocr’d text. MHT is the multi part mime format that IE uses to save complete web pages as a single file.

The text will make the file searchable via google desktop or whatever, but the original image will still be there. I don’t need to pull the documents into other systems or anything, just to be able to retrieve them if necessary.

Sure, the files will be bigger than a PDF would have been, but who really cares these days with storage so cheap.

If storing them like this is viable, I’d do a small app that acts as a front end to scan, ocr, and compile the files.

Thoughts? Problems & pitfalls? Cheers S8


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Isura's picture

Unnecessary work

This seems like unnecessary work. What kind of volume are you talking about? It would be simpler to keep all your paper files in a filling cabinet, and implement your new system for only incoming (new) paper files. I think that 100% paperless is unnecessary and way too much work to make it worth while. Kind of defeats the purpose of GTD :)

Section8's picture

partial solutions found

Several things have come together to give a “good-enough” solution for this task.

  1. Evernote…used to OCR/Search the documents in question. no need to save as MHT.
  2. I discovered work’s multifunction printer can scan 10ppm for single sided.
  3. Scripting + ImageMagick used to combine pages for duplex/multi-page scanning
  4. AutoHotKey used to automate scanning and saving with the current date and time. Its slow, but you can walk away.

400+ pages scanned and looking good for search-ability.

About Section8

 
EXPLORE 43Folders THE GOOD STUFF

An Oblique Strategy:
Distorting time


STAY IN THE LOOP:

Subscribe with Google Reader

Subscribe on Netvibes

Add to Technorati Favorites

Subscribe on Pageflakes

Add RSS feed

The Podcast Feed

Inbox Zero

The original 43 Folders series looking at the skills, tools, and attitude needed to empty your email inbox — and then keep it that way. Don’t miss the free video of Merlin’s Inbox Zero presentation.

Making Time

3-part series on attention management for artists and makers. Read Bad Correspondence, The Job You Think You Have, and One Clear Line.