< Black Planning
Looking For Work >

: Got XML parsing working in Beautiful Soup 4, and then added a feature I've been wanting to add for a while. Instead of separate BeautifulSoup and BeautifulStoneSoup classes[0], in BS4 there's just BeautifulSoup. To get a tree-builder that's optimized for XML, you write BeautifulSoup(markup, "xml"). HTML is the default, but if you want to make it explicit, you write BeautifulSoup(markup, "html").

But this is just the tip of a general feature. "html" and "xml" are just strings, features for which a tree-builder might or might not advertise support. The tree builders also publish other features, like "fast", "permissive", "html5", and library names like "lxml". So you can make semi-fine distinctions:

BeautifulSoup(markup, ["html", "fast"])
BeautifulSoup(markup, ["html", "permissive"])
BeautifulSoup(markup, ["html", "lxml"])

The BS constructor will try to find the best tree-builder that matches all the features you specify, and will raise an exception if it can't match them all (because you don't have lxml installed or something).

This is overkill right now because there are only three tree-builders (["lxml", "xml"], ["lxml", "html"], and ["html5lib"]). But this gives me an easy way to add tree-builders to the code base, and for you to plug in additional builders, without making end-users learn where the classes are.

This is looking good enough that I can do an alpha release soon. I'm not sure why I've been putting so much work into BS, but I'm sure it has something to do with the fact that my other projects are stalled, blocked, or I want to procrastinate on them.

[0] Those little classes like ICantBelieveItsBeautifulSoup are also gone, because distinguishing between different techniques for parsing markup is now the parser's job. And those classes were kind of silly to begin with.

Filed under:


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.