<D <M <Y
Y> M> D>

[Comments] (1) Beautiful Soup 4 Benchmark: This is going to go into the Beautiful Soup 4 documentation, but you might find it interesting. It's my first legitimate benchmark of BS4, and the first benchmark of this stuff I've seen since Ian Bicking's excellent 2008 benchmark.

Ezio Melotti pointed me to a list of the top 10,000 domains worldwide, according to some random source. It looked legit, so I wrote a script to download the homepages of the top 200 domains as served to a desktop web browser. My dataset included many pages written in Chinese, Japanese, Russian, Portuguese, Polish, and German.

For every parser I was interested in, I parsed each homepage and timed the parse. This gave me 200 numbers for every parser. To reduce that to a single non-huge number I calculated a mean: how many kilobytes of real-world HTML the parser could process in a second. I also noted each parser's success rate: how many of the 200 homepages it had handled without raising an exception.

Here are the results, ordered by their performance under Python 2.7.

Python 2.7 Python 3.2
Parser Speed (KB/s) Success rate Speed (KB/s) Success rate
Beautiful Soup 3.2 (SGMLParser) 211 100% - -
html5lib (BS3 treebuilder) 253 99% - -
Beautiful Soup 4.0 + lxml 255 100% 2140 96%
html5lib (lxml treebuilder) 270 99% - -
Beautiful Soup 4.0 + html5lib 271 98% - -
Beautiful Soup 4.0 + HTMLParser 299 59% 1705 57%
html5lib (simpletree treebuilder) 332 100% - -
HTMLParser 5194 52% 3918 57%
lxml 17925 100% 14258 96%

Note that the "HTMLParser" tests don't actually produce anything. HTMLParser is an event-based parser, so when the HTML is parsed, nothing comes out because I didn't include any handler code. All the other tests build a parse tree in memory.

Another thing to keep in mind about the html5lib results: html5lib is kind of the opposite of BS4. BS4 always builds a tree of Beautiful Soup objects, but you can tell it to generate that tree using html5lib, lxml, or HTMLParser. Whereas html5lib always uses its own parser, but you can tell it to build a tree of lxml objects, a tree of BS3 objects, etc.

The big surprise for me is that on Python 2.7, lxml is the worst choice for a parser to drive BS4. It's a worse choice than html5lib! How did that happen? I have no idea. I was hoping to cash in on the lxml magic (see below), and it's not working. I need to look into this. Notice that html5lib takes a performance hit from using lxml's treebuilder. If the magic's not in the treebuilder and it's not in the parser, where is it?

Unless I can find that magic and exploit it, it remains the case that if you're paying by the minute for computer time, you should use lxml. It's written in C, and on Python 2.7 it builds a parse tree sixty times faster than BS4, three times faster than a pure-Python parser that does absolutely nothing with the data. Even on Python 3, lxml alone is seven times faster than BS4+lxml. I said stuff like this in the BS3 documentation, but I think I need to be more forceful about it in the BS4 docs.

The good news is that Beautiful Soup is 6-8 times faster on Python 3 than it is on Python 2, and even at its slowest, BS4 is noticeably faster than BS3.

The big caveat is that my definition of "success" is pretty minimal. Just because the parser parsed the file without crashing doesn't mean it will give you a useful parse tree.

Another caveat: on Python 3, I couldn't get HTMLParser to take raw bytes as input, so I ran the data through UnicodeDammit first. I counted this time as part of the parse time. This probably explains HTMLParser's slower speed on Python 3 and its higher success rate.

Update: Argh, I found out about this a year ago. The problem is that Unicode, Dammit is incredibly slow in some cases. Here are the results on 2.7 if I take out the prepare_markup methods in the builders for HTMLParser and lxml, and just assume everything's UTF-8:

Python 2.7 Python 3.2
Parser Speed (KB/s) Success rate Speed (KB/s) Success rate
Beautiful Soup 4.0 + lxml 2287 96%260096%
Beautiful Soup 4.0 + HTMLParser 2069 48%168057%

That's more like it! The problem is that reliability suffers. Both parsers crash in the 4% of cases where it's not UTF-8 but the encoding is declared in a <meta> tag. And there's an unknown number of cases where the data's not UTF-8 but the conversion doesn't crash, leading to garbled data. But at least now I remember this problem.

Also note that on Python 3.2, getting rid of Unicode, Dammit doesn't matter nearly as much. (It doesn't matter for HTMLParser at all.) Presumably Python 3.2 has better built-in support for encoding autodetection.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.