<D <M <Y
Y> M> D>

[Comments] (2) Rewrite Rules: It seems I am, relatively speaking, the master of Apache rewrite rules. If you're setting up NewsBruiser and you want the URLs to look a certain way, send me email letting me know what you want them to look like and I'll come up with some rewrite rules for you to try.

Note that rewriting goes both ways. If you use rewrite rules you also need to create a Python class that rewrites the URLs NewsBruiser generates, so that it outputs the nice URLs that will trigger your rewrite rules. I can help you with this too, if you want.

[Comments] (6) HTML As She Is Spoke: Let me toot a slightly different horn here, if I may. A while ago I put out a call for a Python library that parsed an HTML file into a DOM-like data structure. I needed something that worked on Python 1.5.2 with no external dependencies, something that either fixed or forcibly parsed bad HTML, something that made screen-scraping easy; yea, something that both sliced and diced.

You may be surprised to learn that none of my faithful readers' helpful suggestions met with the approval of my demanding eye, and so I wrote my own library, which I like to call Beautiful Soup and which I mentioned in passing yesterday. It's the HTML parser that just doesn't care. If you give it perfect HTML, it'll give you a perfect data structure, just like the big-name parsers. But other parsers know too much about HTML. They choke on or try to rewrite bad markup. They assume you care about the whole document. A pirate might make you walk the plank, but only a parser would make you walk the whole tree.

Not so with Beautiful Soup. What you wrote is what you get. If the HTML is horrible, so is the data structure you get--but you get something, and if you're screen-scraping, you don't care about the whole data structure. You're not writing a web browser. You want to grab some data and get out. Beautiful Soup hides all the tree traversal behind a couple of methods that let you slurp up all the links, all the headings of a certain class, the specific span that contains the train schedule time, whatever. It's similar in philosophy to Aaron's xmltramp.

If, like Lucy Ricardo, you got some 'scrapin' to do, give it a try. I love it. But then, I love all my children. The ones I designed to go on rampages no less than the ones where it just happens.

PS: Since Beautiful Soup knows very little about HTML, and it's based on SGMLParser, you could probably use it on anything that looks like HTML, eg. XML or your domain language that has the same structure as HTML but different tags.


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.