< Lack-Of-Cliffhanger Cliffhanger
When Ideas Collide >

RSS feed construction helper: This is kind of a neat tool I wrote but I'm not sure what to call it. It's a wrapper around the famous PyRSS2Gen library. It provides a simple pickle-based backend storage for state relating to an RSS feed that's screen-scraped from a web page. Some of the state it stores is redundant with the actual content of the RSS feed, but some of it is contextual information like "when was the first time a screen-scrape attempt found the item that has this guid"? (I am too tired to explain why this information is useful, but trust me, it quite often is.)

The tool provides convenience functions for fetching a new version of the web page, and does the good-citizen Etag and Last-Modified thing, so all you have to do is write a hook method that scrapes the webpage into a bunch of RSS items and adds them to the feed. As items go on the top of the feed, older ones automatically drop off the end.

In conjunction with Beautiful Soup) this makes it incredibly easy for me to screen-scrape a web page into an RSS feed and have it keep working over time. Up to this point I've been trying to run the various Syndication Automat feeds out of NewsBruiser notebooks. It's a clever idea but cleverness is just about all it's got going for it. It's clunky and awkward, and there are some cases I just can't handle, such as the Dover page where the items I want to RSSify don't have any dates on them. (That's why the contextual information mentioned earlier in this entry is useful, BTW)

I thought any alternative to using NewsBruiser would be a lot of backend work, but it's not. My module's about 150 lines of code: here it is in a temporary location until I come up with a real name for it: RSSHelper.py. Here's a sample script that uses it and Beautiful Soup and ASCII, Dammit (actually the "HTML, Dammit" subset) to make a Dover automat feed.

It originally took me an hour of work and eventual failure to get a NewsBruiser notebook-backed feed for the Dover site. It took about five minutes to write that script linked above. And the new script actually works, instead of putting the same books into the feed over and over again.

I've got the same feeling with this as I did when I stopped writing custom parsers for screen-scraping and started using Beautiful Soup. Is this as cool as I think it is? Do things like this already exist? What should I call it?

PS: Danny deserves a midwife credit since I came up with this and wrote it while working on my second hack for the Life Hacks book.

Filed under:


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.