< Everybody Loves Dirt Candy
Best Of Bookmarks, January-February >

Beautiful Soup Progress: I spent some time today trying to get BS in shape to run under Python 3. Here's the branch I'm working on.

sgmllib doesn't exist in Python 3, so I switched to HTMLParser, which has gotten a lot better at parsing bad HTML. With my hacks in place, only 3 of my unit tests pass under sgmllib but fail under HTMLParser. That's acceptable given that my switching to HTMLParser creates part of the framework I'll use so that you can write a plugin for lxml, html5lib (not as slow as I'd thought), or some other parser. Eventually I'll get rid of the HTMLParser plugin, or at least strip it down so that it doesn't know anything about HTML, making my life easier.

What's left is some minor syntax problems and some huge problems dealing with the way strings work in Python 3 as they go in and out of encodings. At this point I need to stop hacking on BS and do some experiments to get a good understanding of the string changes.

Filed under:


[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.