< Bots Should Punch Up
November Film Roundup >

@everybrendan Season Two: Last year I wrote one of my first Twitter bots, @everybrendan. Inspired by Adam's infamous @everyword, it ran for two months, announcing possible display names for Brendan's Twitter account (background), taken from Project Gutenberg texts. Then I got tired of individually downloading, preparing, and scraping the texts, so I let it lapse a year ago today, with a call for requests for a "season two" that never materialized.

Well, season two is here, and it's a doozy. I've gone through Project Gutenberg's 2010 dual-layer DVD and found about 300,000 Brendan names in about 20,000 texts, enough to last @everybrendan until the year 2031. At that point I'll get whatever future-dump contains the previous twenty years of Project Gutenberg texts and do season three, which should keep us going until the Singularity. The season two bot announces each new text with a link, so it educates even as it infuriates.

I've been wanting to do this for a while, but it's a very tedious process to handle Project Gutenberg texts in bulk. Most texts are available in a wide variety of slightly different formats. The texts present their metadata in many different ways, especially when it comes to the dividing line between the text proper and the Project Gutenberg information. Some of the metadata is missing, some of it is wrong, and there's one Project Gutenberg book that doesn't seem to be in the database at all.

I started dealing with these problems for my NaNoGenMo project and realized that it wouldn't be difficult to get something working in time for the @everybrendan anniversary. I've put the underlying class in olipy: it's effectively a parser for Gutenberg texts, and a way to iterate over a CD or DVD image full of them. It can also act as a sort of lint for missing and incorrect metadata, although I imagine Project Gutenberg doesn't want to change the contents of files that have been on the net for fifteen years, even if some of the information is wrong.

The Gutenberg iterator still needs a lot of work. It's good enough for @everybrendan, but not for my other projects that will use Gutenberg data, so I'm still working on it. My goal is to cleanly iterate over the entire 2010 DVD without any problems or missing metadata. The problems are concentrated in the earlier texts, so if I can get the 2010 DVD to work it should work going forward.

Filed under:

[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.