The Bayes Motel

A rainy night. A tour bus pulls up to the decrepit Bayes Motel and disgorges its textual tourists. They wait in line to be assigned a floor for the night, but at sunrise the next day, they have mysteriously vanished. The motel is quiet and nothing remains, except—there! Lurking behind the front desk, an ever-growing Bayesian corpus!

Frightening? Certainly. Cunning metaphor for Bayesian text analysis? Quite possibly. The Bayes Motel makes it easy for a Python hacker to empirically determine whether or not a given problem can be solved with Bayesian text analysis. It's a very simple CGI program in which you classify items of text by putting them into different buckets. As you use it, the program's guesses about which bucket you'll choose will start to get better—or they won't, in which case you need to try some other approach to the problem.

Download

Get The Bayes Motel here.

The Bayes Motel requires the Reverend library.

Sample application

The Bayes Motel comes with a sample application called Curses. You can see it in the example/ directory, and more detailed information on how to run it is in that directory's README file.

The goal of the Curses application is to train a Bayesian corpus to distinguish sentences that contain curse words (as seen in this list of fictional curse words) from "clean" sentences that don't contain any curse words.

Curses defines a class called RandomSentenceBus, which is responsible for bringing "guests" into the motel. In this case the guests are randomly generated sentences which might or might not contain curse words. When you run RandomSentenceBus.py as a script, it will create 100 random sentences and queue them up for processing in the motel.

You can then run the motel's front desk interface by loading curses.cgi into your web browser. You'll be given a list of sentences and asked to classified each one as "clean" or "dirty":

Once you've classified your first set of sentences, Curses will start having opinions about the rest of the sentences in the queue:

Sentences will start showing up pre-categorized as "Clean" or "Dirty", and specific words in the sentences will be colored according to how often they show up in "Clean" sentences versus "Dirty" ones. A word will be green (#00FF00) if heavily associated with "Clean" sentences, and red (#FF0000) if heavily associated with "Dirty" sentences (if it's heavily associated with both kinds of sentences, it will tend towards this kind of olive color in between green and red: #777700). I call this process "connotating" (connotation+annotating; patent not pending), and it helps you visualize how your classifications are affecting the Bayesian corpus. If you classify enough sentences correctly, you'll start seeing the "clean" words show up green, and the "dirty" words show up red.

Behind the Scenes

Curses defines a CurseMotel class which knows which data files to use, and which has two floors: "Dirty" and "Clean". Both its RandomSentenceBus class and its skeleton curses.cgi script are invoked with an instance of this Motel subclass.

At any time you can run RandomSentenceBus.py script to put more randomly generated sentences into the queue. The "guests" in a motel can come from anywhere: emails, web pages, books, etc. Note that if your guests are not plain text, your Motel subclass needs to set its connotate member to False so that The Bayes Motel won't try to connotate your HTML tags or whatever.

Connotation works on any Motel with three floors or fewer. The first floor provides the red element, the second floor provides the green element, and the third floor (if present) provides the blue element. Of course, the obvious alternative implementation to the existing one, which can be confusing for people who don't have the color wheel in their head, is to pick the "winning" floor for each word, and only provide (say) the red component of the color when coloring the word. Let the intensity of that one color show how closely it is associated with its guessed floor. Anyway.

Confusing Diagram

If the explanation of the Curses application left you confused as to what actually goes on in the Bayes Motel, then the following diagram will probably not be helpful at all:


This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Sunday, August 07 2005, 22:54:57 Nowhere Standard Time and last built on Friday, October 31 2014, 13:00:05 Nowhere Standard Time.

Crummy is © 1996-2014 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.

Document tree:

http://www.crummy.com/
software/
BayesMotel/
Site Search: