43 Folders

Back to Work

Merlin’s weekly podcast with Dan Benjamin. We talk about creativity, independence, and making things you love.

Join us via RSS, iTunes, or at 5by5.tv.

”What’s 43 Folders?”
43Folders.com is Merlin Mann’s website about finding the time and attention to do your best creative work.

Trick for analyzing Project Gutenberg texts

Hi everyone,

I've always been intrigued by Project Gutenberg's online texts, but until now have not known how to digest the massive hunks of plain text that each file represents. It may be great to have Emerson's , Thoreau's [URL="http://www.gutenberg.org/etext/205"]Walden[/URL], or Samuel Pepys' [URL="http://www.gutenberg.org/etext/4200">Diary in plain text on my hard drive, but there's no way I'm going to sit at my computer and read them. And you will never find me sitting on the subway, squinting at a Project Gutenberg text on my iPod's small screen. So these neglected texts have languished on my hard drive, gathering dust.

But recently, and to my great surprise and delight, I've discovered the ancient and elegant technology of the UNIX command line. And I've come to realize that a few basic commands (especially grep and tr) can make mincemeat of these massive bodies of text, offering all sorts of ways to search and analyze them.

The simplest is grep. Let's say I have the Diary of Samuel Pepys on my hard drive (pepys.txt) and I want to search it for all appearances of the word "ale," I would simply type:

grep -inw ale pepys.txt | less

-i = case insensitive
-n = print line numbers next to each result
-w = searches just for "ale" as a single word; will not include other words that contain the pattern "ale," such as "alehouse" or "hale"
less = allows one to scroll through the results one page at a time

The result is a printout of all the lines in the etext in which the word ale appears, together with their line numbers. If line 2555 looks interesting, I can jump to it in the text (from within less) by typing:

!less +2555 pepys.txt

I can browse the passage of the text. When I quit out of this view, I return to my search results.

Let's say I want a little more context with each search. Then I would simply add the number three (3) to my search options:

grep -3inw ale pepys.txt | less

This will print out each line containing ale together with the three lines above and below it (for a total of seven lines of context).

What about indexes? Let's say I want a complete index of all the words in Pepys' diary, together with their frequency of occurrence. For this I can type:

tr 'A-Z' 'a-z' < pepys.txt | tr -cs a-z '\012' | sort | uniq -c > pepysindex.txt

In a few seconds, I get a new text file with a list of all the words that appear in Pepys' diary together with their frequency. I can browse this for search ideas. Here's a very small excerpt of the results:

 101 alderman
  20 aldermen
   9 aldersgate
   7 aldgate
   2 aldrige
   1 aldworth
 109 ale
   1 alehoofe
  71 alehouse
   1 alehouses

It's easiest to make the index command an alias (this one's in my .tcshrc file):

alias textindex "tr 'A-Z' 'a-z' | tr -cs a-z '\012' | sort | uniq -c"

Then, to create an index of Pepys' diary, I simply type:

cat pepys.txt | textindex > pepysindex.txt

And now, for the icing on the cake, I can combine the index alias with the grep function to create subindexes. So to get an index of every word that appears within three lines of the word "ale," I could type:

grep -3iw ale pepys.txt | textindex | less


grep -3iw ale pepys.txt | textindex > pepysaleindex.txt

The results suggest new boolean searches--e.g. passages in the text where the words "ale" and "headache" occur within, say, 3 lines of each other. (Don't know if they actually do - just a hypothetical suggestion.)

Anyway, thought the academics out there might be interested in these little plain text hacks.

TOPICS: Life Hacks
terceiro's picture

mdl, this is great. You're...

mdl, this is great. You're becoming a real treasure-trove of CLI wisdom for us literary-types. Thanks!




An Oblique Strategy:
Honor thy error as a hidden intention


Subscribe with Google Reader

Subscribe on Netvibes

Add to Technorati Favorites

Subscribe on Pageflakes

Add RSS feed

The Podcast Feed


Merlin used to crank. He’s not cranking any more.

This is an essay about family, priorities, and Shakey’s Pizza, and it’s probably the best thing he’s written. »

Scared Shitless

Merlin’s scared. You’re scared. Everybody is scared.

This is the video of Merlin’s keynote at Webstock 2011. The one where he cried. You should watch it. »