< easy_install beautifulsoup4
Constellation Games Author Commentary #11: "Launch Title" >

: Earlier I ran some speed/accuracy tests of Beautiful Soup driven by various parsers. Python's built-in HTMLParser scored very poorly, parsing only 52% (Python 2.7.1) or 57% (3.2.2) of my test pages without raising an exception. Well, Ezio Melotti, the maintainer of HTMLParser, has been working for a while on improving HTMLParser's handling of bad HTML. Most of this code is in Python 3.2.2, so I should have been getting the benefit, but it wasn't working for me because of a semi-related bug in HTMLParser, which is fixed in the as-yet-unreleased 3.2.3.

After talking with Ezio today, I was able to monkeypatch BS4 to avoid the bug in 3.2.2. This means on Python 3, BS4 with no external parser installed will give reliability comparable to BS4+lxml (98% versus 99%). It's still about 50% slower, though, parsing about 1300 kb of HTML per second, versus 2100 kb/second for BS4+lxml.

Filed under:

[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.