Hopefully this will be useful to the maintainers of libraries like lxml and html5lib who currently jump through hoops to make their parsers generate a Beautiful Soup parse tree. Now you should be able to just maintain a TreeBuilder implementation. The tree builder has a very simple interface, so take a look and make sure it does what you need. I'll be writing an html5lib tree builder and packaging it and the lxml builder in Beautiful Soup for a while, but I think long-term the TreeBuilders should live with their parent projects.
Tomorrow I'll be figuring out how to package this and trying to come up with a compatibility suite that will ensure your tree builder reacts sanely to different trees and different Beautiful Soup setups like SoupStrainers.
Thu Apr 09 2009 22:08:
Check out the trunk of Beautiful Soup and you'll see the future. I've created a simple interface that lets any parser build a Beautiful Soup tree. There are large built-in builders that encapsulate the old HTMLParser logic with their lists of nestable tags, and there's now a very small builder that delegates everything to lxml. In a simple performance test the lxml builder was about twice as fast as the HTMLParser builder.
