The Ultra Gleeper: A Recommendation Engine for Web Pages

by Leonard Richardson (leonardr at segfault dot org)
Paper revision 1: 02/06/2005

Introduction

Recommendation engines enjoyed a vogue in the mid-90s. They would solve the problem of information overload by matching user preferences against a large universe of data. The ultimate realization of this strategy would be a recommendation engine capable of mining that Northwest territory of data, the World Wide Web.

Recommendation engines were built and run into troubles. Seemingly insurmountable problems emerged and the flame of hype moved elsewhere. Recommendation engines for web pages were not built or successfully launched. To even attempt one would require development of a web crawler and the associated resources. Today, recommendation engines have something of the reputation of a well-meaning relative who gives you gifts you often already have or don't quite want. Most useful recommendations come from knowledgeable friends or trusted web sites.

But over the years, as people built these web sites, they came up with models and tools for solving the basic problem of finding and tracking useful web sites. The wide adoption of these strategies has not only brought down the cost of building a web page recommendation engine, it's removed some of the insurmountable problems that still plague recommendation engines for other domains. It's now possible for someone with a dedicated server to run a recommendation system for themselves and their friends. I've done it and I'll show you how to do it.

The shoulders of giants

A web page recommendation engine is now possible because a lot of the work is done elsewhere on the web and exposed to the public, because web surfers track new types of information, and because new ideas have taken root. Here are the tools and concepts used by my recommendation engine that didn't exist in the mid-90s, or weren't the juggernauts they are today:

Millions more write for the web today
Hundreds of thousands more have weblogs
RSS, RSS aggregators, and OPML: structure for reading weblogs
Google PageRank
Publicly accessible search engine APIs, in particular, the Technorati and Google web APIs
Social bookmark sites like del.icio.us

Giving away the store

In this section I have collected all the little epiphanies I had that led to the Ultra Gleeper. The rest of this paper will explain their ramifications and have a narrative and details and all the good things that go into such papers, but it will be built around a few concepts, just as the design is built around a few moments of clarity.

Every day, hundreds of thousands of people post their recommendations for web pages in machine-parseable form. Against our own better judgment, we call these people "bloggers". They can be treated as users of the recommendation engine who contribute but never make demands on the system. So can people who use social bookmarking sites.
Tens of thousands of people carefully maintain lists of the link recommenders ("bloggers") they trust and check often. These lists, and the pages linked to by them, often encompass most of these people's casual web browsing. They keep these lists as part of their use of a tool, an RSS aggregator, that helps them read all that stuff. Their subscription lists can usually be exported to machine-parseable OPML and imported into other tools. These people are our ideal users, especially if they also have weblogs or use social bookmarking sites.
You don't have to crawl the entire web to get good recommendations. You only need to crawl the part of the web that's near a user's trusted recommenders. This is still hundreds of thousands of documents, but it's within the capabilities of a single computer with a high-bandwidth connection.
A heuristic for avoiding monotonous recommendations: present tiers of recommendations instead of just the top-tier "best" results.
A heuristic for avoiding recommendations of pages already seen: punish excessive popularity.

Things that kill recommendation engines but can't kill the Ultra Gleeper

Several longstanding problems plague the field of recommendation engines, and have somewhat dampened enthusiasm for them. But a recommendation engine for web pages doesn't have the same problems as one for books or movies.

Early adopter problem

One problem is early adopters: specifically, what happens when you are one. A recommendation engine can only give good recommendations starting from a well-populated database of ratings. But when you're an early adopter, you're the poor sucker who has to populate the database so people who come along later can get the benefit. It's a Ponzi scheme in reverse.

The web itself contains the information we need to jump-start our database. It's full of people who post recommendation links on their web pages, but with other people as their audience. If we listen in as well, we can treat the web as an enormous base of "users" who provide recommendations, and we won't have to sucker anyone. More generally, we can model the links on a web page or weblog as recommendations, the way Google PageRank does. Other recommendation engines correlate your ratings against those provided by other users, but the Ultra Gleeper doesn't need any other real users at all. Correlating with other users' Ultra Gleeper ratings would be useful, but it's not neccessary and I haven't implemented it yet.

Startup costs

Even if you're not an early adopter, when you start using a recommendation engine you play out the early adopter drama in miniature. Because the recommendation engine knows nothing about you to begin with, you end up giving a lot of meaningless ratings just to calibrate the system.

If you've read this far and are still interested, though, there's a good chance you can calibrate a web page recommendation engine by giving it just a couple pieces of information. If you use an RSS reader to keep up with your web reading, you should be able to export your feed list to OPML format and give it to the Ultra Gleeper. It will then keep track of those subscriptions and use their ratings as a discounted proxy for your own ratings.

If you have our own weblog, you're one of the hundreds of thousands of people who are already driving the Ultra Gleeper's dataset. If you register your weblog with the Ultra Gleeper, you can associate the links you're already posting with your Ultra Gleeper account. The Ultra Gleeper will use your weblog's ratings as your ratings, and posting to your weblog will be the same as rating that link positively. The same is true if you use del.icio.us, some other social bookmark service, or just publish your browser bookmarks to the web. You'll get added benefits from doing what you already do.

Movies rarely recommend other movies, and songs rarely recommend other songs. Nonfiction books sometimes recommend other books in their bibliographies. But to the Ultra Gleeper, all web pages and RSS feeds are nothing but lists of recommendations to other web pages. This means that each one of your ratings has greater effect; you're ratifying the interestingness of not just that page, but, with a discount, the pages it links to, and those that link to it.

Recommendations for things you've already seen

Recommendation engines in other domains have a difficult time keeping track of which items in that domain you've encountered. Either they assume you experience the domain entirely through the recommendation engine, they ask you to manually keep it informed of the items you've experienced outside the recommendation engine, or they just accept that they'll give you recommendations for things you've already experienced.

But web pages are always consumed in a way communicable to a recommendation engine. The Ultra Gleeper's ideal users do most of their casual browsing entirely through their RSS aggregators. Since it has their RSS subscription list, by keeping up with their subscriptions as they do it can avoid most repeats. (The Ultra Gleeper does ask for ratings of links already seen, because such ratings are very useful, but they're presented separately.)

The Ultra Gleeper also uses a heuristic to lower the repeat rate by probabilistically excluding extremely popular links, as seen below. You could lower the repeat rate even more by creating a browser plugin that reported every viewed URL to your Ultra Gleeper account. However I think this would be kind of creepy (perhaps the plugin could be called the Ultra Creeper).

As other recommendation domains go online, recommendation engines immune to these problems can be constructed. For instance: many people have their entire music collections in digital form, download all their new music from online sources (legal or otherwise), and listen to music on a device capable of constructing a complete playlist and a listening history. It should be possible to create a music recommendation engine that uses that information the same way the Ultra Gleeper uses RSS subscription lists and the like. The Audioscrobbler recommendation engine takes advantage of at least some of this information.

Things that kill recommendation engines but which the Ultra Gleeper avoids

Recommendation engines have some other problems that aren't specific to the recommendation domain. The Ultra Gleeper uses heuristics to avoid some of these problems. Because the problems have nothing to do with the recommendation domains, these heuristics should be generally applicable. Of course it's possible these problems have been solved in other recommendation engines, and my perception of them is somehow misinformed. At any rate, these are problems to avoid.

Sales-driven scoring

My idea of a recommendation engine is one that surprises me with unexpected but interesting items from the recommendation domain. But most recommendation engines I've used seem timid and afraid to explore a large portion of the domain. This is just a suspicion of mine, but I think a big factor here is that many recommendation engines are used to drive sales of items in the domain. The scoring algorithm is therefore driven by a desire to maximize the money obtained from each recommendation. This can make using the engine an exercise in seeing things you already knew about, driven more by the store's needs (hoping you'll decide now is the time to buy) than yours. For instance, Amazon's recommendation engine gives you a lot of books by the authors you've already noted you like.

Another interpretation of this phenomenon is that publicly accessible commercial recommendation engines can't afford the processing time necessary to explore more of the domain for each user.

Obvious recommendations

Whatever it is, many recommendation engines don't dig deep for their recommendations. There is little surprise, and often recommendation lists are dominated by the current best-sellers or most popular items in the recommendation space. This can be a design decision to drive sales, or because the scoring algorithm believes that, all else being equal, a more popular item should be recommended above a less popular one.

Snobbishness aside, we don't need recommendation engines to tell us what's popular. We don't live in the recommendation engine, but in a world where we're constantly being told about the currently most popular items in various recommendation domains. That's part of what it means for something to be popular.

If you want best-sellers you can check the best-seller list. In my opinion, a good recommendation engine must have an element of anti-popularity. I implemented this in the Ultra Gleeper with something I call the Indie Rock Peter Principle, named after the comic strip character Indie Rock Pete, who has a kind of sadomasochistic relationship with popularity. The IRPP stipulates that beyond a certain point, additional recommendations (i.e. incoming links) decrease a page's score instead of increasing it. Not only does this clear the ground for real surprises, it acts as an additional, probabilistic filter for pages you've seen in casual browsing. Anything it misses will probably show up within a couple days in the weblogs you read.

Stuck in a rut

Just as a good recommendation engine should surprise its user, it must sometimes be surprised by the user for its results to stay relevant. A surprise happens when the user rates the page lower or higher than the recommendation engine had expected. A surprise low rating trims the recommendation engine's hypothesis tree about the user's likes; a surprise high rating opens new branches in the hypothesis tree.

The problem is that it goes against the purpose of a recommendation engine to present a recommendation it thinks will get a low rating. It's not surprising that a typical recommendation engine doesn't do this. For such a recommendation engine, therefore, all its surprises are surprise low ratings. Its hypothesis tree is trimmed closer and closer until it's stuck in a rut. This creates the phenomenon described in "My TiVo Thinks I'm Gay!". If the recommendation domain is large enough or grows fast enough, and if the rut is in some sense the "right" rut, then the engine can keep delivering good recommendations, but the scope for serendipity is reduced.

The Ultra Gleeper presents multiple tiers of recommendations to strike a balance between surprising the user and giving the user a chance to surprise it. It also displays in a separate tier links found in the user's subscriptions and presumed to have been seen already. Getting the user to rate these links is very useful, but since we know they've already seen those links we have to ask for ratings as a favor.

How It Works

The Ultra Gleeper is a set of Python scripts that write to a database, and a simple mod_python application that presents a user's view of the database in HTML or RSS format.

Backend scripts

The scripts have four tasks. First, to gather the URLs to new web pages and find new links to pages. Second, to gather information about web pages such as the page filter and the address of any associated syndication feed; to consolidate duplicate pages, to reap 404s, and to update in response to redirects. Third, for each user, to give an estimated score to each page within a certain distance of that user's subscriptions. Finally, to remove from the dataset pages too distant from a user's subscriptions to be rated.

Script Purpose

SyndicationFinder Gathering new links (from syndication feeds)

VerifyFinder Getting page information, getting new links (from screen-scraping)

TechnoratiFinder Getting new links (from Technorati web API)

GoogleFinder Getting new links (from Google link: queries)

UserGleeper Rating pages

GleeperReaper Removing pages too far away to be rated

Script	Purpose
SyndicationFinder	Gathering new links (from syndication feeds)
VerifyFinder	Getting page information, getting new links (from screen-scraping)
TechnoratiFinder	Getting new links (from Technorati web API)
GoogleFinder	Getting new links (from Google link: queries)
UserGleeper	Rating pages
GleeperReaper	Removing pages too far away to be rated

The *Finder scripts act on pages already in the database, starting from a user's weblogs and subscriptions. Our goal is to capture and score all the pages within four links of a user's subscription. This is a wide enough net to get lots of unseen pages to recommend, but not so wide that the Ultra Gleeper becomes a general web crawler and has to score the whole web.

The Ultra Gleeper uses both incoming and outgoing links in its scoring calculation. Outgoing links are easy to get: the VerifyFinder and SyndicateFinder retrieve them when they pull the page or its syndication feed. Incoming links are not easy to get, because getting them requires you have a picture of most of the web. We get most of our incoming links by using the APIs of services which do have a picture of most of the web; this is what the TechnoratiFinder and GoogleFinder do. Unfortunately for us, our access to these services is limited and their purposes may not be aligned with the Ultra Gleeper's. (See the section "Missing Technology" below for details.)

To avoid burdening the sites whose pages it crawls, the Ultra Gleeper makes all its web accesses through the Coral public web proxy. To make best use of limited bandwidth and limited access to external APIs, it keeps track of the last time an access method was used against a particular page. Each method has an associated timeout: a page's syndication feed will be retrieved at most once a day, but its Technorati Cosmos will be retrieved at most once every 60 days. For services that require an API key (Technorati and Google), users are asked to obtain and provide a key for that service. A user's key is used to explore the space near that user's subscriptions.

Recommendation engines can operate on the actual contents of items in their recommendation domain, on the metadata of those items, or on the links between items. The Ultra Gleeper scores a page considering only the scores of the pages that link to or are linked from it. Though most pages have both types of links, weblogs tend to have more outgoing than incoming links, the reverse being true of non-weblogs. A high average outgoing score indicates the page is a reliable source of links the user likes, and a high average incoming score indicates that the page is related to other well-regarded pages. It's at this point that we apply the Indie Rock Peter Principle.

You may have noticed that this algorithm looks like the Google PageRank algorithm. The major differences are the use of outgoing links to determine a page's quality, and the fact that the score for a page is different for each user.

As with the PageRank algorithm, the Ultra Gleeper scoring algorithm is iterative. When the script runs, ratings flow from rated pages to the pages one link away. To get pages four links away from a user's initial subscriptions, the script needs to be run four times.

It's also neccessary to rate each page multiple times to reach an equilibrium score. Suppose page A links to page B. When page A is scored by the Ultra Gleeper algorithm, page B's rating must be updated because its rating depends on the ratings of all pages that link to it. Once its rating has been updated, though, page A's rating must be updated again, because its score depends on the scores of the pages to which it links.

PageRank is calculated many times over until it reaches an equilibrium value. Unfortunately, unlike with PageRank, a page's Ultra Gleeper score is not solely a function of the page's place in the graph of the web. It derives from a specific user's opinions of that graph. It's as though each user had their own customized PageRank. Because we can't amortize across the user base the job of rating, and because the algorithm must be run multiple times, this script and the code it calls have been highly optimized to reduce database accesses and the number of pages to be rated on each iteration.

After being added to the database, a page gets rated the next three times UserGleeper runs. Subsequently it only gets rated when its incoming or outgoing links change or are re-rated. The cost in accuracy and code cleanliness are compromises made for optimization's sake. On my not-particularly-powerful server the Ultra Gleeper completes one rating iteration for one user in a quarter-hour to an hour, making it possible to do several runs a day.

The Front End

Once the dataset has been populated, the front end display is fairly simple. In addition to the usual account management screens (including bookmarklets for subscribing to and rating pages), there's a web-based and an RSS-based interface for recommendation. Both are HTML-based and use (but do not require) Javascript. Our goals in the display application are to show the user cool things and to collect much-needed feedback.

The web interface presents four tiers of links to be rated, with a selection chosen randomly from each tier. The tiers are: highly-scoring pages, middle-scoring pages, low-scoring pages, and pages seen elsewhere. If Javascript is enabled, the user can rate a link without reloading the entire page; when the page is reloaded, any rated links are replaced by new recommendations.

The RSS interface presents a customizable mix of recommendations from the four tiers. The scoring controls work the same way as in the web interface. The main difference is that requesting an RSS feed marks the pages returned as "seen but not rated". This prevents those recommendations from being shown again, except (if you never rate them) as links in the "already seen" tier.

Missing technology and other problems

As I write this in January 2005, there are still a boatload of problems with the implementation of the Ultra Gleeper, and one major piece of missing technology in the stack of giants on whose shoulders the Ultra Gleeper stands.

Missing Technology: Better Incoming Link APIs

The most difficult piece of information for the Ultra Gleeper to get is the list of incoming links to a particular page. It uses (or used at one time) several mechanisms for collecting this information, but all of them are limited in some way.

Method Link quality Limitations

Stumbling upon incoming links while following outgoing links Pretty good Depends on serendipity

Google Web API (link: queries) Not good: ordered by PageRank instead of recentness 1000 queries/user/day

Technorati web API (Cosmos query) Excellent 500 queries/user/day, frequently down

del.icio.us screen scraping Excellent I tried this and Joshua Schachter got mad at me

Method	Link quality	Limitations
Stumbling upon incoming links while following outgoing links	Pretty good	Depends on serendipity
Google Web API (`link:` queries)	Not good: ordered by PageRank instead of recentness	1000 queries/user/day
Technorati web API (Cosmos query)	Excellent	500 queries/user/day, frequently down
del.icio.us screen scraping	Excellent	I tried this and Joshua Schachter got mad at me

The ideal Ultra Gleeper-friendly implementation of this feature would feature links as up-to-date as Technorati's, but would have a higher daily limit or a time-based limit like Amazon's web API which allows one request per second.

Because the threshold for posting to a social bookmark site is so much lower than that for a weblog, such sites are very useful to the Ultra Gleeper. An API for del.icio.us and furl to determine who has bookmarked a particular page would greatly improve the quality of the recommendations of the Ultra Gleeper and any future web page recommendation engines.

Scoring is slow

I don't say "too slow" because the scoring algorithm is now fast enough for my personal use, which was not originally the case. Additional optimizations are possible: a smarter method of propagating the dirty bit, and grouping similar webpages so as to be able to score them in bulk.

A more fundamental problem is scalability. An Ultra Gleeper installation with n users uses less bandwidth but the same amount of processor time as n separate installations. Processor power is increasing, of course, but so are the scope of the problem and the number of people in need of a solution. User base scalability may require sacrificing recommendation quality, by clustering users' ratings together, for instance.

Too many unpleasant surprises

Over the years people have developed weblogs and social bookmarking sites, which track the links a person likes, and RSS subscriptions and start pages, which track the weblogs a person finds useful sources of links. The Ultra Gleeper exploits this to limit the number of initial ratings a user must do to start getting good results. However, no similar technologies or conventions have been developed for keeping track of which links and weblogs a person can't stand. The Ultra Gleeper must gather this information from scratch, and without it all the recommendations it gives will be skewed towards the positive.

Getting more accurate rating distribution requires getting surprise low ratings from the user; that is, the user must give low ratings to pages the Ultra Gleeper thought would get high ratings. Unfortunately, when the recommendation engine gets a surprise low rating, the user thinks "it gave me a bad link", and reduces their opinion of the recommendation engine accordingly.

For this reason I recommend in the Ultra Gleeper documentation that users volunteer some negative ratings when just starting out. I don't know how effective this will actually be, though, and I suspect that under actual usage conditions the Ultra Gleeper will generally find manually entered negative ratings more useful than manually entered positive ones.

Overzealous URL stemming

To avoid capturing a weblog's links to all its permalinks, and then recommending each page of a weblog on the basis of a positive rating for the weblog, the Ultra Gleeper disregards all links to other pages in the same domain. This works but it also loses some links between weblogs on large hosting sites like LiveJournal that don't use per-weblog vhosting. General web crawlers may have solved this problem, but I don't know how.

Newsworthiness problems

The Ultra Gleeper can't distinguish between a webpage that describes a current event, a webpage that describes an event current a year ago, and a year-old webpage that describes something still current or interesting. All three might have the same pattern of incoming links and get the same rating. The problem is not evident at first because a new user's dataset is initialized with the links currently active in the weblogs they read. Theoretically, however, usage over a period of time could lead to a situation where old links are chosen over new links with the same rating.

No language restrictability

Most people understand only a fraction of the languages used on the web. A desirable feature would be to eliminate from the recommendation domain pages known to be in an unknown language. The Ultra Gleeper determines page languages from the VerifyFinder and SyndicationFinder, but doesn't have a framework for recording user language preferences or incorporating page languages into the scoring algorithm. As it turns out, this isn't the problem I thought it would be, because links across languages are uncommon.

Spam and other attacks

An attacker might want to game a recommendation engine either to make worthless items show up as recommended (i.e. spamming it), or to prevent worthy items from being recommended (some kind of censorship?). Since the only source of negative ratings in Ultra Gleeper is the actual user, we can disregard the second type of attack and focus on spam.

Suppose you're an attacker. To make your worthless web page get recommended for a particular user, you must connect it to a number of web pages which that user has ranked positively. These pages must also be within a few links of one of that user's subscriptions. Finally, your page must be picked up by one of the finders, so that it makes it into the database.

The accepted way of connecting a well-regarded page to your worthless one is to take advantage of a site's openness by posting comment or wiki spam. Since the Ultra Gleeper considers both incoming and outgoing links, you might also think of linking your worthless page to the well-regarded pages. The latter is trivial, assuming you know which pages to hit, and no one can stop you from doing it. But incoming links are only gathered through the GoogleFinder and TechnoratiFinder, which have limited usages per day. So you have to game one of those other services as well, and then you have to hope that your worthless page is one of the few that gets returned from the service in an API query.

I couldn't find any way to exploit the special features of the Ultra Gleeper to get a better return on a spam investment. As is so often the case, the best spamming strategy turns out to be to spam everything and hope it sticks. Because the Ultra Gleeper mistrusts pages with too many or too few incoming links, a fair amount of the spam must stick, but if too much of it sticks then the Indie Rock Peter Principle will kick in. Spam that spreads too successfully will achieve the same oblivion as a too-popular Livejournal quiz.

My conclusion is that it is possible to spam the Ultra Gleeper, especially if the user's page subscriptions are close to unmaintained wikis or weblogs prone to comment spam. I haven't seen any spam yet, but I have gotten recommendations for other peoples' subscription lists and blogrolls just because we like a lot of the same weblogs. Those pages can be interesting but they're usually not, and they're topographically identical to the page I'd create if I were a spammer. If spam becomes a problem, the Ultra Gleeper will need to change its tactics. Possibilities I'm considering are the introduction of some simple content filtering, or more sophisticated link analysis: something like the Indie Rock Peter Principle for outgoing links.

The other attack I've come up with against the Ultra Gleeper is to publish an RSS feed that advertises itself as the feed of a popular page that actually has no RSS feed. If the Ultra Gleeper finds the fake feed, it'll associate it with the popular page, and ratings given to the popular page will trickle down into the pages linked to by the fake RSS feed. I'm not sure how to distinguish between that malicious case and the case of a site whose legitimate feed is provided by feedburner or some other external RSS service, but the problem doesn't seem intractable.

Conclusion

It's now possible for one person of moderate skill and resources to build their own web page recommendation engine. The domain negates many of the traditional problems with recommendation engines, and well-chosen heuristics help ameliorate other problems. The result is a recommendation engine capable of surprising its users and the start of a solution to a problem open for almost a decade.