< Previous
Next >

: Check out the trunk of Beautiful Soup and you'll see the future. I've created a simple interface that lets any parser build a Beautiful Soup tree. There are large built-in builders that encapsulate the old HTMLParser logic with their lists of nestable tags, and there's now a very small builder that delegates everything to lxml. In a simple performance test the lxml builder was about twice as fast as the HTMLParser builder.

Hopefully this will be useful to the maintainers of libraries like lxml and html5lib who currently jump through hoops to make their parsers generate a Beautiful Soup parse tree. Now you should be able to just maintain a TreeBuilder implementation. The tree builder has a very simple interface, so take a look and make sure it does what you need. I'll be writing an html5lib tree builder and packaging it and the lxml builder in Beautiful Soup for a while, but I think long-term the TreeBuilders should live with their parent projects.

Tomorrow I'll be figuring out how to package this and trying to come up with a compatibility suite that will ensure your tree builder reacts sanely to different trees and different Beautiful Soup setups like SoupStrainers.

Filed under:


[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.