< Previous
Black Planning >

Beautiful Soup 4 Status Report: What an exciting weblog I run. Actually this update is pretty cool. I've ported all the non-XML tests for BS4, which means you should now be able to use the code for all HTML processing purposes. If you want to try it, note that the module is now called 'beautifulsoup', not 'BeautifulSoup'. So from beautifulsoup import BeautifulSoup. I may rename it to bs4 just because I'm tired of typing "from BeautifulSoup import BeautifulSoup" for the past six years.

I also decided this would be a good time to run a performance test. Here's a moderately sized document:

Document is 66409 bytes
BS4 lxml time: 0.06
BS4 html5lib time: 0.25
BS3 time: 0.15

("BS3" here is the latest released version, 3.2.0.)

Pretty good! And here's a huge complicated document:

Document is 1329825 bytes
BS4 lxml time: 12.60
BS4 html5lib time: 2.88
BS3 time: 14.11

Okay, that's kind of random. The problem is in Unicode, Dammit. It takes a long time to figure out the encoding for this particular page. This is ultimately because the document is in ISO-8859-2, but it includes a <meta> tag that claims the document is in UTF-8. I don't yet understand the problem on any deeper level than that. If it's gonna be like that, I may just stop believing anything I see in a <meta> tag.

If you specify the encoding up front, the lxml time drops to 0.90 seconds. The html5lib tree builder doesn't have the problem because it uses html5lib's native Unicode conversion functionality instead of Unicode, Dammit.

Incidentally, BS3 has the same problem. Specify the encoding up front, and BS3 takes about 2 seconds on this page, which makes sense--faster than html5lib but slower than lxml. I find it very annoying that I'm only discovering this problem now—I think this has wasted a lot of cumulative time over the past couple years.

Anyway, now is a good time to start trying out BS4, if you're a fan of new things. I renamed all the major methods to be PEP-8 compliant--details are in the CHANGELOG.

Update: Profiler shows the bottleneck is in the chardet library, sbchardetprober.py, which goes through the file character by character and crunches some numbers for each character. If it can't make a decision until late in a huge file, there's your twelve seconds. So... I have a couple ideas, but it's not a bug in my code that I can just fix. But html5lib uses chardet, so it must be doable.

Filed under:


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.