< The Last Workshop on Theoretical Physics in the Soviet Union
Next >

Beautiful Soup 4 Status Report: Yesterday I ported some more tests and added basic doctype handling to the parser plugins. The work is slowing down a little because I'm porting tests where html5lib and lxml handle the same markup differently, such as incorrectly nested tables. I'm not going to find and test every such difference, but I want to have all the old tests working, and it'll give you an idea of what the differences are in common situations.

In BS3 you could choose to convert incoming entities into Unicode characters, or to leave them alone. You could also choose to convert Microsoft smart quotes into Unicode characters, XML entities, or HTML entities. In BS4 this will depend on the parser. Both lxml and html5lib convert everything to Unicode. I think this makes more sense--convert absolutely everything to Unicode, use Unicode internally, and optionally convert back to entities when writing the document out. (I'll probably have to write the "convert back to entities" part.)

Filed under:


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.