< Previous
Best Cheese Name >

[Comments] (4) Beautiful Soup Future: I've got a chunk of time off at the end of the year, not having used it earlier in the year. Among my other projects I'm going to redo Beautiful Soup. This entry is an early spelling out of my rationales and my plans.

Earlier this year I quietly retired Rubyful Soup because I think _why's hpricot does a better job of being a Rubyish screen-scraping parser than RS can be. But nothing similar has happened in Python, mainly because BS is the market leader. I want to keep that going, but I also want to take advantage of the work that's gone on in this field since 2004.

So, what are the useful features of Beautiful Soup?

  1. It can build an object model out of bad HTML.
  2. It can build an object model out of bad XML, if you tell it the rules of your XML vocabulary. (This is just the general case of #1.)
  3. It can convert almost any encoding into Unicode, usually in the absence of an explicit encoding marker or the presence of an incorrect one.
  4. It exposes a useful API. It's easy to learn, more Pythonic than CSS selectors or XPath, and it includes most common ways of traversing the tree.

Of these, the only one I really care about is #4. If I could rid myself of the need to handle all the edge cases in #1-#3, edge cases that in many cases have outstripped my ability to solve them with my current tools and sanity, I'd be happy.

Fortunately, there's html5lib, which is supposed to be as good at parsing HTML as a web browser.

My current plan is to write something that goes on top of html5lib and gives the BS API to whatever DOM you've built. This would take care of #1 and #3. It's not clear to me how you tell html5lib the rules of your XML vocabulary; maybe it only parses valid XML. But BS is relatively rarely used to parse invalid XML, so if I could outsource all the HTML and Unicode crap to html5lib I'd be much more inclined to hack randomly on BS, so I think it's a fair trade.

html5lib already has a "beautifulsoup" tree builder, which creates a tree of Beautiful Soup objects. So in theory I would just need to maintain those objects? I'll find out soon enough.

Filed under:

Comments:

Posted by jgraham at Wed Dec 10 2008 06:07

html5lib does have a BeautifulSoup api but it currently depends on beautiful soup itself being available. If you want to reimplement BeautifulSoup on top of html5lib you'll need to implement the underlying Node objects separately from the parser and write an html5lib treebuilder that works with the BS nodes.

However there are a couple of caveats you should consider when switching to html5lib. First it is pretty slow. It is designed to be HTML5 compliant (although it needs some updates to match the current spec...) which means that we end up doing a lot of things carefully rather than fast. Hopefully we will eventually replace much of the core with C code so that it will be fast and correct. Also it is pretty difficult to use html5lib for anything other than HTML. There is a "liberal xml parser" which applies some generic fixups but it is pretty unloved and doesn't have any system for implementing per-vocabulary rules.

Anyway, once you get started on this, you should be able to find several html5lib types in #whatwg on irc.freenode.net should you have any questions or need repository access (there is also a mailing list of course).

Posted by Ian Bicking at Wed Dec 10 2008 22:40

In my tests html5lib was 3x slower than BS, which is significant. Relying on some possible future C API would detract from the pure-Python nature of BS, which is a big advantage for it. Of course, probably the Python parser could be optimized a great deal, e.g., to shortcut some of its more careful parsing when you can quickly determine it isn't necessary. Though that's kind of counter to its reference implementation status.

Posted by Leonard at Thu Dec 11 2008 09:06

My experience with bad-HTML parsers has been "fast, careful, written in Python; pick two." I suspect BS is faster than html5lib because BS's parser is mostly concerned with tag nesting rules.

Given other parallel BS work that I don't know if the guy doing it wants it to be made public, I'm starting to think it might be a good idea to write a textual standard for the BS API, and then write a shell that can have different parsers plugged into it. Hopefully something that can coexist with XPath and CSS Selector implementations.

Posted by Philip Taylor at Thu Dec 18 2008 09:57

I'm not aware of any fundamental reason for html5lib to be significantly slower than any other pure-Python angle-bracketed-markup parser. Much of the time is spent tokenising text and tag names and constructing the document tree, which any parser will have to do.

I added some simple local optimisations to html5lib in the past couple of days and made it around 20% faster (when using the HTML5 spec as input), so there seems to be plenty of scope for improving performance. I'd guess there are more expensive abstractions in the implementation than in most other parsers, since it's written to mirror the way the HTML5 spec describes the algorithm, but those could be improved. Copying ideas from faster parsers would be good.

But for now, html5lib performance is still a bit of a pain.


[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.