< Previous
Best Cheese Name >

[Comments] (4) Beautiful Soup Future: I've got a chunk of time off at the end of the year, not having used it earlier in the year. Among my other projects I'm going to redo Beautiful Soup. This entry is an early spelling out of my rationales and my plans.

Earlier this year I quietly retired Rubyful Soup because I think _why's hpricot does a better job of being a Rubyish screen-scraping parser than RS can be. But nothing similar has happened in Python, mainly because BS is the market leader. I want to keep that going, but I also want to take advantage of the work that's gone on in this field since 2004.

So, what are the useful features of Beautiful Soup?

  1. It can build an object model out of bad HTML.
  2. It can build an object model out of bad XML, if you tell it the rules of your XML vocabulary. (This is just the general case of #1.)
  3. It can convert almost any encoding into Unicode, usually in the absence of an explicit encoding marker or the presence of an incorrect one.
  4. It exposes a useful API. It's easy to learn, more Pythonic than CSS selectors or XPath, and it includes most common ways of traversing the tree.

Of these, the only one I really care about is #4. If I could rid myself of the need to handle all the edge cases in #1-#3, edge cases that in many cases have outstripped my ability to solve them with my current tools and sanity, I'd be happy.

Fortunately, there's html5lib, which is supposed to be as good at parsing HTML as a web browser.

My current plan is to write something that goes on top of html5lib and gives the BS API to whatever DOM you've built. This would take care of #1 and #3. It's not clear to me how you tell html5lib the rules of your XML vocabulary; maybe it only parses valid XML. But BS is relatively rarely used to parse invalid XML, so if I could outsource all the HTML and Unicode crap to html5lib I'd be much more inclined to hack randomly on BS, so I think it's a fair trade.

html5lib already has a "beautifulsoup" tree builder, which creates a tree of Beautiful Soup objects. So in theory I would just need to maintain those objects? I'll find out soon enough.

Filed under:

Comments:

Posted by Ian Bicking at Wed Dec 10 2008 22:40

In my tests html5lib was 3x slower than BS, which is significant. Relying on some possible future C API would detract from the pure-Python nature of BS, which is a big advantage for it. Of course, probably the Python parser could be optimized a great deal, e.g., to shortcut some of its more careful parsing when you can quickly determine it isn't necessary. Though that's kind of counter to its reference implementation status.

Posted by Leonard at Thu Dec 11 2008 09:06

My experience with bad-HTML parsers has been "fast, careful, written in Python; pick two." I suspect BS is faster than html5lib because BS's parser is mostly concerned with tag nesting rules.

Given other parallel BS work that I don't know if the guy doing it wants it to be made public, I'm starting to think it might be a good idea to write a textual standard for the BS API, and then write a shell that can have different parsers plugged into it. Hopefully something that can coexist with XPath and CSS Selector implementations.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.