< So Long, And Thanks For All The Non-Fish
Darmonodes’ Elephant >

[Comments] (2) : Decklin Foster, who seems to have some connection to the Chicago cabal, has written roux. This is a screen-scraper->RSS feed gateway that uses Beautiful Soup, much like Scrape 'N' Feed, but where S'N'F gives you the page and has you write a screen-scraper, roux has you define some suspiciously regexp-looking things that magically reach into the soup and extract the things you want. As Vic Fontaine would say, crazy! Decklin, how does it work?

Filed under:


Posted by Decklin Foster at Wed Jun 29 2005 14:43


More often than not, it doesn't work :)

It's really a big ol' hack, and I wanted to make use of it before I really had time to make it nice, so I threw the regex stuff in mostly as a cheap way to port over a bunch of my old custom-hacky screen-scraping recipies, mostly for, er, this (note the ones with the file:// URIs).

Getting it to really be What I Want(TM) will require some deep thought and probably making some improvements to BeautifulSoup. I haven't even read 2.0 yet... someone had to tell me it was out so I could update the Debian package. In fact I sent you a short email about a different way of approaching the nested-tags-in-the-tag-stack problem a while ago, but I don't even know if you looked at it.

(yeah, I have big lack of hacking-time issues. Too much personal stuff on my plate.)

It's cool someone noticed though! Maybe one day it'll be more than marginally useful.

Posted by Decklin Foster at Wed Jun 29 2005 14:48

(Also, I really want to provide a full-text version of this that doesn't break every week).


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.