TITLE There's always room at The Bayes Motel BGCOLOR eeeeee
A rainy night. A tour bus pulls up to the decrepit Bayes Motel and disgorges its textual tourists. They wait in line to be assigned a floor for the night, but at sunrise the next day, they have mysteriously vanished. The motel is quiet and nothing remains, except—there! Lurking behind the front desk, an ever-growing Bayesian corpus!
Frightening? Certainly. Cunning metaphor for Bayesian text analysis? Quite possibly. The Bayes Motel makes it easy for a Python hacker to empirically determine whether or not a given problem can be solved with Bayesian text analysis. It's a very simple CGI program in which you classify items of text by putting them into different buckets. As you use it, the program's guesses about which bucket you'll choose will start to get better—or they won't, in which case you need to try some other approach to the problem.
Get The Bayes Motel here.
The Bayes Motel requires the Reverend library.
The Bayes Motel comes with a sample application called Curses. You can see it in the example/ directory, and more detailed information on how to run it is in that directory's README file.
The goal of the Curses application is to train a Bayesian corpus to distinguish sentences that contain curse words (as seen in this list of fictional curse words) from "clean" sentences that don't contain any curse words.
Curses defines a class called
is responsible for bringing "guests" into the motel. In this case the
guests are randomly generated sentences which might or might not
contain curse words. When you run
a script, it will create 100 random sentences and queue them up for
processing in the motel.
You can then run the motel's front desk interface by loading
curses.cgi into your web browser. You'll be given a list
of sentences and asked to classified each one as "clean" or
Once you've classified your first set of sentences, Curses will start having opinions about the rest of the sentences in the queue:
Sentences will start showing up pre-categorized as "Clean" or "Dirty", and specific words in the sentences will be colored according to how often they show up in "Clean" sentences versus "Dirty" ones. A word will be green (#00FF00) if heavily associated with "Clean" sentences, and red (#FF0000) if heavily associated with "Dirty" sentences (if it's heavily associated with both kinds of sentences, it will tend towards this kind of olive color in between green and red: #777700). I call this process "connotating" (connotation+annotating; patent not pending), and it helps you visualize how your classifications are affecting the Bayesian corpus. If you classify enough sentences correctly, you'll start seeing the "clean" words show up green, and the "dirty" words show up red.
Curses defines a
CurseMotel class which knows which
data files to use, and which has two floors: "Dirty" and "Clean". Both
RandomSentenceBus class and its skeleton
curses.cgi script are invoked with an instance of this
At any time you can run
RandomSentenceBus.py script to
put more randomly generated sentences into the queue. The "guests" in
a motel can come from anywhere: emails, web pages, books, etc. Note
that if your guests are not plain text, your
subclass needs to set its
connotate member to
False so that The Bayes Motel won't try to connotate your
HTML tags or whatever.
Connotation works on any
Motel with three floors or
fewer. The first floor provides the red element, the second floor
provides the green element, and the third floor (if present) provides
the blue element. Of course, the obvious alternative implementation to
the existing one, which can be confusing for people who don't have the
color wheel in their head, is to pick the "winning" floor for each
word, and only provide (say) the red component of the color when
coloring the word. Let the intensity of that one color show how
closely it is associated with its guessed floor. Anyway.
If the explanation of the Curses application left you confused as to what actually goes on in the Bayes Motel, then the following diagram will probably not be helpful at all: