< Rewrite Rules
Pretty Good Hummus >

[Comments] (6) HTML As She Is Spoke: Let me toot a slightly different horn here, if I may. A while ago I put out a call for a Python library that parsed an HTML file into a DOM-like data structure. I needed something that worked on Python 1.5.2 with no external dependencies, something that either fixed or forcibly parsed bad HTML, something that made screen-scraping easy; yea, something that both sliced and diced.

You may be surprised to learn that none of my faithful readers' helpful suggestions met with the approval of my demanding eye, and so I wrote my own library, which I like to call Beautiful Soup and which I mentioned in passing yesterday. It's the HTML parser that just doesn't care. If you give it perfect HTML, it'll give you a perfect data structure, just like the big-name parsers. But other parsers know too much about HTML. They choke on or try to rewrite bad markup. They assume you care about the whole document. A pirate might make you walk the plank, but only a parser would make you walk the whole tree.

Not so with Beautiful Soup. What you wrote is what you get. If the HTML is horrible, so is the data structure you get--but you get something, and if you're screen-scraping, you don't care about the whole data structure. You're not writing a web browser. You want to grab some data and get out. Beautiful Soup hides all the tree traversal behind a couple of methods that let you slurp up all the links, all the headings of a certain class, the specific span that contains the train schedule time, whatever. It's similar in philosophy to Aaron's xmltramp.

If, like Lucy Ricardo, you got some 'scrapin' to do, give it a try. I love it. But then, I love all my children. The ones I designed to go on rampages no less than the ones where it just happens.

PS: Since Beautiful Soup knows very little about HTML, and it's based on SGMLParser, you could probably use it on anything that looks like HTML, eg. XML or your domain language that has the same structure as HTML but different tags.

Filed under:

Comments:

Posted by Bob Ippolito at Fri May 21 2004 09:27

Why did you need Python 1.5.2 support?

Posted by Leonard at Fri May 21 2004 09:32

I'm using it in my weblog software, NewsBruiser (http://newsbruiser.tigris.org), which I try really hard to keep 1.5.2-compatible because a lot of web hosting outfits don't have Python v2.

Posted by Paul Heymann at Fri May 21 2004 11:46

Thanks, this just saved me from having to write a few ugly regular expressions. One bug though: presumably if you feed in data that doesn't have any tags in it, you should get the data back, rather than "".

Posted by Leonard at Fri May 21 2004 11:49

Good catch. I'll fix that tonight.

Posted by Leonard at Fri May 21 2004 14:06

Actually, the fix is pretty simple, and obvious once I thought about it. We don't have any code to deal with the last piece of text, because all our code is run by callbacks triggered by the presence of tag. Overriding SGMLParser.feed to handle the last piece of data will fix it:


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.