43 Folders

Back to Work

Merlin’s weekly podcast with Dan Benjamin. We talk about creativity, independence, and making things you love.

Join us via RSS, iTunes, or at 5by5.tv.

”What’s 43 Folders?”
43Folders.com is Merlin Mann’s website about finding the time and attention to do your best creative work.

Paperless on a budget

Been thinking a bit about getting rid of some old paper files I have and want to get them on a pc for the lowest cost possible.

I wanted to kick this idea around a bit and see if it fights back...

The problem: I have a cheap scanner and my budget doesn't stretch to a nice automated scanner like the ScanSnap.

Solution: My idea is this, scan the pages as images and ocr them. Ocr isn't perfect, as you all know, but as long as it can produce valid search text I don't care too much.

The clever bit is (i think)...I want to store them as MHT files containing both the scanned page images and the ocr'd text. MHT is the multi part mime format that IE uses to save complete web pages as a single file.

The text will make the file searchable via google desktop or whatever, but the original image will still be there. I don't need to pull the documents into other systems or anything, just to be able to retrieve them if necessary.

Sure, the files will be bigger than a PDF would have been, but who really cares these days with storage so cheap.

If storing them like this is viable, I'd do a small app that acts as a front end to scan, ocr, and compile the files.

Thoughts? Problems & pitfalls? Cheers S8

Section8's picture

partial solutions found

Several things have come together to give a "good-enough" solution for this task.

  1. Evernote...used to OCR/Search the documents in question. no need to save as MHT.
  2. I discovered work's multifunction printer can scan 10ppm for single sided.
  3. Scripting + ImageMagick used to combine pages for duplex/multi-page scanning
  4. AutoHotKey used to automate scanning and saving with the current date and time. Its slow, but you can walk away.

400+ pages scanned and looking good for search-ability.

 
EXPLORE 43Folders THE GOOD STUFF

Popular
Today

Popular
Classics

An Oblique Strategy:
Honor thy error as a hidden intention


STAY IN THE LOOP:

Subscribe with Google Reader

Subscribe on Netvibes

Add to Technorati Favorites

Subscribe on Pageflakes

Add RSS feed

The Podcast Feed

Cranking

Merlin used to crank. He’s not cranking any more.

This is an essay about family, priorities, and Shakey’s Pizza, and it’s probably the best thing he’s written. »

Scared Shitless

Merlin’s scared. You’re scared. Everybody is scared.

This is the video of Merlin’s keynote at Webstock 2011. The one where he cried. You should watch it. »