< Year Of Links
Next >

A Project for 2011: Beautiful Soup 4: I'm breaking my normal rule of not announcing projects until they're done, because I think it might help some people make plans if they know about this. In 2011 I'll be coming out with a major new version of Beautiful Soup that will work with Python 3, but that won't have the problems of the failed 3.1 branch.

The story so far: the most recent release of Beautiful Soup (3.2) uses a custom parser based on Python's standard-library SGMLParser. This was a really good parser back in 2005. Here in 2011, html5lib is better at handling bad markup, and lxml and ElementTree are much faster if the markup isn't too bad. Beautiful Soup's parser is no longer a competitive advantage.

What's more, SGMLParser goes away in Python 3, and its replacement is awful at handling bad markup. I tried to switch over in early 2009 and it just didn't work for anyone. So, Beautiful Soup has had the specter of death looming over it for two years.

Beautiful Soup 4 will not be a parser at all. It will be a tree-builder. You will plug a parser into Beautiful Soup, and you'll get an object tree that reflects how that parser sees a document. I have this working reasonably well for lxml and html5lib, which is why I'm comfortable announcing the project now.

Problems this will solve:

Problems this will not solve

I'll be spending alternate Fridays working on Beautiful Soup 4. I'll probably have a beta release in a few months. I'll be pushing my progress to this branch.

(nb. My code-in-progress includes some code from html5lib, and I'm not sure how the licenses interact, but since html5lib is MIT licensed I think it's just a matter of adding some more boilerplate, so I'm not too worried about it.)

Filed under:


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.