A Project for 2011: Beautiful Soup 4: I'm breaking my normal rule of not announcing projects until they're done, because I think it might help some people make plans if they know about this. In 2011 I'll be coming out with a major new version of Beautiful Soup that will work with Python 3, but that won't have the problems of the failed 3.1 branch.

The story so far: the most recent release of Beautiful Soup (3.2) uses a custom parser based on Python's standard-library SGMLParser. This was a really good parser back in 2005. Here in 2011, html5lib is better at handling bad markup, and lxml and ElementTree are much faster if the markup isn't too bad. Beautiful Soup's parser is no longer a competitive advantage.

What's more, SGMLParser goes away in Python 3, and its replacement is awful at handling bad markup. I tried to switch over in early 2009 and it just didn't work for anyone. So, Beautiful Soup has had the specter of death looming over it for two years.

Beautiful Soup 4 will not be a parser at all. It will be a tree-builder. You will plug a parser into Beautiful Soup, and you'll get an object tree that reflects how that parser sees a document. I have this working reasonably well for lxml and html5lib, which is why I'm comfortable announcing the project now.

Problems this will solve:

Problems this will not solve

I'll be spending alternate Fridays working on Beautiful Soup 4. I'll probably have a beta release in a few months. I'll be pushing my progress to this branch.

(nb. My code-in-progress includes some code from html5lib, and I'm not sure how the licenses interact, but since html5lib is MIT licensed I think it's just a matter of adding some more boilerplate, so I'm not too worried about it.)

[Comments] (2) : Hello, weblog. I am writing in you. I just finished "Four Kinds of Cargo", my first post-novel short story, so now I feel like I can write some other stuff. The story still needs to pass the Sumana "does this plot make basic sense" test, but I'm optimistic.

Beautiful Soup work has mostly focused on trying to get easy_install to work, working around a Launchpad bug--I'm still not totally sure if it's working. Last Friday I was in Dallas for work, so tomorrow will be a BS development day.

I seem to really like writing fiction about small businesses--"Mallory", "Awesome Dinosaurs", Constellation Games, and this new story all revolve around business partnerships. I'm just doing my part to boost the economy--did you know that stories about small businesses are responsible for most of this country's stories about job growth?

I'm still reading that space-astronomy book and I'm saving up more interesting bits to tell you. I'm also reading the first volume of Mark Twain's autobiography and it's awesome—he's just dictating random stories as they occur to him, and, like Groucho Marx, he's always in comedy mode. I don't think the printing of this autobiography will create many new famous Mark Twain quotes, if only because Mark Twain never said the existing "famous Mark Twain quotes". But this cubical advertisement for e-books is to me as America: the Book was to Sumana: every few minutes I'm disturbing her with some new weird kind of laugh.

Beautiful Soup 4 Status Report: The port is going pretty well. I've got almost all of the old test suite modernized and ported to the new framework, meaning that BS4 now works about as well as BS3.

There are a couple things I haven't figured out yet, the main one being the API for starting up a BeautifulSoup object with a given tree-builder. I'd like it to be as simple as passing a string like 'lxml' into the constructor, but I haven't thought through the details yet.

: A story I helped critique in writing group has been published: "Showoff" by L.K. Herndon. It's a story about an alien wereflamingo. If that doesn't sell you, you can't be sold.


