Twelve Steps To Ultra Gleeperdom
by Leonard Richardson (leonardr at segfault dot org)
----------------------------------------------------

Here's how to install your own webpage recommendation engine. You'll
need mysql, mod_python, MySQLdb, and version 0.5 of the SQLObject library (this
isn't the latest version, but it should cohabitate with the 0.6 tree
because they have different package names). You'll also need a more or
less dedicated machine, one on which you don't mind losing a lot of
bandwidth and processor time to the Ultra Gleeper.

1. Move this directory into your webspace. On my machine it's in
/web/leonardr/html/gleeper/.

2. Set up the schema, eg:

# mysql -uroot -p < schema.sql

This will create a database called "gleeper". Change the database name
in schema.sql if you want. Add user permissions afterwards to make it
accessible by a user besides mysql's root.

3. Change gleeper/cfg.py to contain the correct username and password
to make a connection to the database. Also give it the base URL to your
installation and its location on disk.

You may also have to change the absolute image references in
static/gleeper.js. They start out as "/gleeper/static/tu.png", etc.,
but if you put the directory somewhere besides [web htdoc
root]/gleeper/ you'll need to change them. If you want to use the
recommendation RSS feed, and you're not running a web-based RSS
aggregator on the same domain as you're running the Ultra Gleeper,
you'll need to make them absolute URLs to be sure your aggregator can
load the images.

4. Set up your Apache mod_python configuration. Here's what mine looks like:

<Directory /web/leonardr/html/gleeper>
 AddHandler mod_python .py
 PythonHandler mod_python.publisher
 PythonDebug On
 PythonAutoReload On
 AuthType Basic
 AuthName "Enter your Ultra Gleeper username and password."
 Require valid-user
 PythonAuthenHandler RequestHandler
 PythonPath "sys.path + ['/web/leonardr/html/gleeper/']" 
</Directory>

5. Run the addUser.py script to create a user account for
yourself. 

Syntax: python addUser.py [desired username] [email address]

You'll be asked to enter a password. Use this password along with your
chosen username when you're asked to log in in step 7.

6. Sign up for a Technorati API key
(http://www.technorati.com/developers/) and a Google API key
(http://www.google.com/apis/). The Ultra Gleeper uses these services
to find incoming links. Once you get the emails containing your keys,
log in to the Ultra Gleeper web interface and under the "Your Account"
tab, fill in your user account's "Technorati API key" and "Google API
key" fields.

If you don't want to get an API key, or have one and don't want to use
it on the Ultra Gleeper, just leave the corresponding field blank and
nothing will happen except your recommendations won't be as good as
they could be.

7. Use the Ultra Gleeper "Control Panel" tab to seed the dataset by
claiming your weblogs and accounts on social bookmarking services. For
example, my two claimed weblogs are http://www.crummy.com/ and
http://del.icio.us/leonardr/. 

=Special for del.icio.us accounts=

For del.icio.us accounts, you can run the claimDelicious.py script
instead of claiming through the web interface. It will claim your
del.icio.us page and also import all of your bookmarks as positive
ratings. When you run it you'll be asked for your del.icio.us
password.

Syntax: python claimDelicious.py [Ultra Gleeper username] [del.icio.us username]

Since getting all those bookmarks is a heavy-duty task on the
del.icio.us side, don't go crazy with claimDelicious.py. Specifically,
you don't need to run it more than once per user. The SyndicateFinder
will take care of keeping your links up to date.

=Special for weblog hackers=

If you can somehow finagle a list of all the links you ever posted to
your weblog, you can run the importRatings.py script on the list to
import those links as Ultra Gleeper ratings. importRatings.py takes a
list of links on standard input and expects one URL per line. If the
URL is the only thing on the line, or there's other junk on the line,
importRatings.py assumes the rating is 1 (positive), as it assumes for
anything posted to a weblog you've claimed. If the URL is followed by
a space and either "0", ".5", or "1", it will use that number as the
rating for the link.

Syntax: python importRatings.py [your username] < links.txt

Where links.txt looks like:

http://www.foo.com/
http://www.bar.com/
http://www.baz.com/

or:

http://www.foo.com/ 1
http://www.bar.com/ 0
http://www.baz.com/ .5

8. Use the Ultra Gleeper "Control Panel" tab to further seed the
dataset with your website subscriptions. The quickest way to do this
is by uploading an OPML file exported from your RSS aggregator. If you
don't use an RSS aggregator, or you can't get a hold of an OPML file,
then the quickest way to do this is by bookmarking the "Subscribe to
current page " bookmarklet, visiting all the sites you check
regularly, and subscribing each one.

9. This one is kind of tricky. So far you've been telling the Ultra
Gleeper what you like. It also needs to know what you don't like, or
it'll think you like just about everything it finds. You probably
don't keep a weblog where you post links you don't like, or have a
separate RSS subscription list of feeds you don't want to read. So you
need to do some manual ratings. Go to the "Control panel" tab and
bookmark the "Current page is bad" bookmarklet. Then go to some pages
you don't like and click the bookmarklet to rate them negatively.

It's not very helpful to just pick random sites that symbolize things
you hate. These need to be sites that you don't like, yet are linked
to by sites you do like. Basically you need to find the differences
between you and the people who write the weblogs you read.

The best way to find these is to go to weblogs you like and find some
links on there that you don't like. For instance, I read some weblogs
about role-playing games. RPG weblogs tend to link to comic book
weblogs in their blogrolls, but I think comic books are boring and I
don't like reading about them. Negatively rating some popular comics
weblogs as early as possible will keep me from getting so many
comics-related recommendations.
 
Don't rack your brains over this. If nothing comes to mind you can
skip this step, but you'll probably start out getting more
recommendations for sites you don't like.

10. Expand on the seed dataset by running the VerifyFinder,
SyndicateFinder, and UserGleeper several times in succession.

 for i in `seq 1 3`; do 
    python gleeper/SyndicateFinder.py;
    python gleeper/VerifyFinder.py;
    python gleeper/UserGleeper.py;
 done

You need to do this to build up a good data set. This can take a long,
long time; it can easily turn into a few hundred thousand web accesses
and take a day or two. The good news is that you can start using it
before it's totally done; see below.

11. The first time the UserGleeper starts running, recommendations
will start showing up in the "Recommendations" tab. This means you can
start using the Ultra Gleeper to get recommendations. The
recommendation quality will depend on the quality of your initial
dataset:

 * The number of links posted to your claimed weblogs; that is, the number
   of initial positive ratings the Ultra Gleeper  has to work with.
 * How many web pages you rated negatively in step 9, and how well
   they fit with the rest of the dataset.
 * How many subscriptions you have.
 * How many links get posted by your subscribed weblogs per unit time.

As you use the Ultra Gleeper, or even if you just leave it to work for
a while and come back later, you'll start to get better results.

12. Set up cronjobs to run the various scripts on a regular basis. I
like to run them late at night or early in the morning, when bandwidth
is more plentiful and so everything's ready in the morning. Be sure to
stagger the times for the scripts that access the outside world
(everything except the UserGleeper.py/GleeperReaper.py combo;
especially TechnoratiFinder.py and GoogleFinder.py) so there's not a
big Ultra Gleeper effect across the web every night when everyone's
crons kick in. Below is a slightly altered version of my crontab which
shows what I think is the best order for the crons.

0 2 10 * * python /home/leonardr/public_html/gleeper/gleeper/SyndicateFinder.py
0 3 10 * * python /home/leonardr/public_html/gleeper/gleeper/TechnoratiFinder.py
0 3 15 * * python /home/leonardr/public_html/gleeper/gleeper/GoogleFinder.py
0 3 25 * * python /home/leonardr/public_html/gleeper/gleeper/UserGleeper.py && python /home/leonardr/public_html/gleeper/gleeper/GleeperReaper.py
0 4 30 * * python /home/leonardr/public_html/gleeper/gleeper/VerifyFinder.py

All these scripts have been mentioned before except
GleeperReaper.py. That script marks currently shown pages as unshown
(so you'll get new ratings the next time you refresh the page) and
deletes pages that weren't given a rating in the UserGleeper
run. These pages are out of reach because they're too far away from a
rated page to be considered for rating themselves. Deleting them means
the various Finders won't waste time and bandwidth getting their links
or syndicating them. If they ever do come within reach they'll be put
back into the database and rated so they don't get deleted again. Note
that GleeperReaper.py assumes the UserGleeper has run since the last
time a page was added, since it believes anything not already rated is
too far removed from the starting points to be rated. That's why it's
best to run it as part of the UserGleeper.py command.

If you can afford the processor time, running the UserGleeper more
than once a day will give you better recommendations. Running it
doesn't affect your bandwidth.

13. If you have don't want to wait for the crons to run, now's a good
time to run gleeper/TechnoratiFinder.py and gleeper/GoogleFinder.py.
That way you'll use up your API call allowance for the day and get
some new links. These scripts will use your Technorati and Google API
keys to find other pages that linked to pages you liked. Remember that
to get any new recommendations from the links it finds, you need to
rerun the UserGleeper.

You should now have a fully functioning Ultra Gleeper. Have fun!