Twelve Steps To Ultra Gleeperdom by Leonard Richardson (leonardr at segfault dot org) ---------------------------------------------------- Here's how to install your own webpage recommendation engine. You'll need mysql, mod_python, MySQLdb, and version 0.5 of the SQLObject library (this isn't the latest version, but it should cohabitate with the 0.6 tree because they have different package names). You'll also need a more or less dedicated machine, one on which you don't mind losing a lot of bandwidth and processor time to the Ultra Gleeper. 1. Move this directory into your webspace. On my machine it's in /web/leonardr/html/gleeper/. 2. Set up the schema, eg: # mysql -uroot -p < schema.sql This will create a database called "gleeper". Change the database name in schema.sql if you want. Add user permissions afterwards to make it accessible by a user besides mysql's root. 3. Change gleeper/cfg.py to contain the correct username and password to make a connection to the database. Also give it the base URL to your installation and its location on disk. You may also have to change the absolute image references in static/gleeper.js. They start out as "/gleeper/static/tu.png", etc., but if you put the directory somewhere besides [web htdoc root]/gleeper/ you'll need to change them. If you want to use the recommendation RSS feed, and you're not running a web-based RSS aggregator on the same domain as you're running the Ultra Gleeper, you'll need to make them absolute URLs to be sure your aggregator can load the images. 4. Set up your Apache mod_python configuration. Here's what mine looks like: AddHandler mod_python .py PythonHandler mod_python.publisher PythonDebug On PythonAutoReload On AuthType Basic AuthName "Enter your Ultra Gleeper username and password." Require valid-user PythonAuthenHandler RequestHandler PythonPath "sys.path + ['/web/leonardr/html/gleeper/']" 5. Run the addUser.py script to create a user account for yourself. Syntax: python addUser.py [desired username] [email address] You'll be asked to enter a password. Use this password along with your chosen username when you're asked to log in in step 7. 6. Log in to the Ultra Gleeper (the URL is "/ultra.py/recommendations"). Sign up for a Technorati API key (http://www.technorati.com/developers/) and a Google API key (http://www.google.com/apis/). The Ultra Gleeper uses these services to find incoming links. Once you get the emails containing your keys, go to the Ultra Gleeper web interface, and under the "Your Account" tab, fill in your user account's "Technorati API key" and "Google API key" fields. If you don't want to get an API key, or have one and don't want to use it on the Ultra Gleeper, just leave the corresponding field blank and nothing will happen except your recommendations won't be as good as they could be. 7. Use the Ultra Gleeper "Control Panel" tab to seed the dataset by claiming your weblogs and accounts on social bookmarking services. For example, my two claimed weblogs are http://www.crummy.com/ and http://del.icio.us/leonardr/. =Special for del.icio.us accounts= For del.icio.us accounts, you can run the claimDelicious.py script instead of claiming through the web interface. It will claim your del.icio.us page and also import all of your bookmarks as positive ratings. When you run it you'll be asked for your del.icio.us password. Syntax: python claimDelicious.py [Ultra Gleeper username] [del.icio.us username] Since getting all those bookmarks is a heavy-duty task on the del.icio.us side, don't go crazy with claimDelicious.py. Specifically, you don't need to run it more than once per user. The SyndicateFinder will take care of keeping your links up to date. =Special for weblog hackers= If you can somehow finagle a list of all the links you ever posted to your weblog, you can run the importRatings.py script on the list to import those links as Ultra Gleeper ratings. importRatings.py takes a list of links on standard input and expects one URL per line. If the URL is the only thing on the line, or there's other junk on the line, importRatings.py assumes the rating is 1 (positive), as it assumes for anything posted to a weblog you've claimed. If the URL is followed by a space and either "0", ".5", or "1", it will use that number as the rating for the link. Syntax: python importRatings.py [your username] < links.txt Where links.txt looks like: http://www.foo.com/ http://www.bar.com/ http://www.baz.com/ or: http://www.foo.com/ 1 http://www.bar.com/ 0 http://www.baz.com/ .5 8. Use the Ultra Gleeper "Control Panel" tab to further seed the dataset with your website subscriptions. The quickest way to do this is by uploading an OPML file exported from your RSS aggregator. If you don't use an RSS aggregator, or you can't get a hold of an OPML file, then the quickest way to do this is by bookmarking the "Subscribe to current page " bookmarklet, visiting all the sites you check regularly, and subscribing each one. 9. This one is kind of tricky. So far you've been telling the Ultra Gleeper what you like. It also needs to know what you don't like, or it'll think you like just about everything it finds. You probably don't keep a weblog where you post links you don't like, or have a separate RSS subscription list of feeds you don't want to read. So you need to do some manual ratings. Go to the "Control panel" tab and bookmark the "Current page is bad" bookmarklet. Then go to some pages you don't like and click the bookmarklet to rate them negatively. It's not very helpful to just pick random sites that symbolize things you hate. These need to be sites that you don't like, yet are linked to by sites you do like. Basically you need to find the differences between you and the people who write the weblogs you read. The best way to find these is to go to weblogs you like and find some links on there that you don't like. For instance, I read some weblogs about role-playing games. RPG weblogs tend to link to comic book weblogs in their blogrolls, but I think comic books are boring and I don't like reading about them. Negatively rating some popular comics weblogs as early as possible will keep me from getting so many comics-related recommendations. Don't rack your brains over this. If nothing comes to mind you can skip this step, but you'll probably start out getting more recommendations for sites you don't like. 10. Expand on the seed dataset by running the VerifyFinder, SyndicateFinder, and UserGleeper several times in succession. for i in `seq 1 3`; do python gleeper/SyndicateFinder.py; python gleeper/VerifyFinder.py; python gleeper/UserGleeper.py; done You need to do this to build up a good data set. This can take a long, long time; it can easily turn into a few hundred thousand web accesses and take a day or two. The good news is that you can start using it before it's totally done; see below. 11. The first time the UserGleeper starts running, recommendations will start showing up in the "Recommendations" tab. This means you can start using the Ultra Gleeper to get recommendations. The recommendation quality will depend on the quality of your initial dataset: * The number of links posted to your claimed weblogs; that is, the number of initial positive ratings the Ultra Gleeper has to work with. * How many web pages you rated negatively in step 9, and how well they fit with the rest of the dataset. * How many subscriptions you have. * How many links get posted by your subscribed weblogs per unit time. As you use the Ultra Gleeper, or even if you just leave it to work for a while and come back later, you'll start to get better results. 12. Set up cronjobs to run the various scripts on a regular basis. I like to run them late at night or early in the morning, when bandwidth is more plentiful and so everything's ready in the morning. Be sure to stagger the times for the scripts that access the outside world (everything except the UserGleeper.py/GleeperReaper.py combo; especially TechnoratiFinder.py and GoogleFinder.py) so there's not a big Ultra Gleeper effect across the web every night when everyone's crons kick in. Below is a slightly altered version of my crontab which shows what I think is the best order for the crons. 0 2 10 * * python /home/leonardr/public_html/gleeper/gleeper/SyndicateFinder.py 0 3 10 * * python /home/leonardr/public_html/gleeper/gleeper/TechnoratiFinder.py 0 3 15 * * python /home/leonardr/public_html/gleeper/gleeper/GoogleFinder.py 0 3 25 * * python /home/leonardr/public_html/gleeper/gleeper/UserGleeper.py && python /home/leonardr/public_html/gleeper/gleeper/GleeperReaper.py 0 4 30 * * python /home/leonardr/public_html/gleeper/gleeper/VerifyFinder.py All these scripts have been mentioned before except GleeperReaper.py. That script marks currently shown pages as unshown (so you'll get new ratings the next time you refresh the page) and deletes pages that weren't given a rating in the UserGleeper run. These pages are out of reach because they're too far away from a rated page to be considered for rating themselves. Deleting them means the various Finders won't waste time and bandwidth getting their links or syndicating them. If they ever do come within reach they'll be put back into the database and rated so they don't get deleted again. Note that GleeperReaper.py assumes the UserGleeper has run since the last time a page was added, since it believes anything not already rated is too far removed from the starting points to be rated. That's why it's best to run it as part of the UserGleeper.py command. If you can afford the processor time, running the UserGleeper more than once a day will give you better recommendations. Running it doesn't affect your bandwidth. 13. If you have don't want to wait for the crons to run, now's a good time to run gleeper/TechnoratiFinder.py and gleeper/GoogleFinder.py. That way you'll use up your API call allowance for the day and get some new links. These scripts will use your Technorati and Google API keys to find other pages that linked to pages you liked. Remember that to get any new recommendations from the links it finds, you need to rerun the UserGleeper. You should now have a fully functioning Ultra Gleeper. Have fun!