[Comments] (4) Beautiful Soup Future: I've got a chunk of time off at the end of the year, not having used it earlier in the year. Among my other projects I'm going to redo Beautiful Soup. This entry is an early spelling out of my rationales and my plans.

Earlier this year I quietly retired Rubyful Soup because I think _why's hpricot does a better job of being a Rubyish screen-scraping parser than RS can be. But nothing similar has happened in Python, mainly because BS is the market leader. I want to keep that going, but I also want to take advantage of the work that's gone on in this field since 2004.

So, what are the useful features of Beautiful Soup?

  1. It can build an object model out of bad HTML.
  2. It can build an object model out of bad XML, if you tell it the rules of your XML vocabulary. (This is just the general case of #1.)
  3. It can convert almost any encoding into Unicode, usually in the absence of an explicit encoding marker or the presence of an incorrect one.
  4. It exposes a useful API. It's easy to learn, more Pythonic than CSS selectors or XPath, and it includes most common ways of traversing the tree.

Of these, the only one I really care about is #4. If I could rid myself of the need to handle all the edge cases in #1-#3, edge cases that in many cases have outstripped my ability to solve them with my current tools and sanity, I'd be happy.

Fortunately, there's html5lib, which is supposed to be as good at parsing HTML as a web browser.

My current plan is to write something that goes on top of html5lib and gives the BS API to whatever DOM you've built. This would take care of #1 and #3. It's not clear to me how you tell html5lib the rules of your XML vocabulary; maybe it only parses valid XML. But BS is relatively rarely used to parse invalid XML, so if I could outsource all the HTML and Unicode crap to html5lib I'd be much more inclined to hack randomly on BS, so I think it's a fair trade.

html5lib already has a "beautifulsoup" tree builder, which creates a tree of Beautiful Soup objects. So in theory I would just need to maintain those objects? I'll find out soon enough.


