< Constellation Games Spoiler Conversation #2
Constellation Games Author Commentary #28, "Someone Is Wrong On The Outernet" >

Beautiful Soup 4.1.0 and detwingle(): Due to the contigencies of fate I get asked a lot of questions about bad HTML. Recently I noticed a problem cropping up which I haven't seen discussed much: documents with mixed encodings. This is typically a document that claims to be UTF-8, and mostly is UTF-8, but which contains bytestrings that only make sense according to some other encoding, usually Windows-1252.

I'll stop beating around the bush: sometimes otherwise UTF-8 documents contain Microsoft smart quotes. This isn't terribly common, but when it happens there's been no easy way to convert that document to Unicode... until now. Beautiful Soup 4.1.0, released today, adds the method UnicodeDammit.detwingle(). This method converts a mixed UTF-8/Windows-1252 document to pure UTF-8, allowing you to run it through BeautifulSoup() or UnicodeDammit() and get Unicode.

I'll let the documentation give the details. In theory I can expand detwingle() to handle other pairs of encodings, but UTF-8/Windows-1252 is the only one currently supported. I'm imagining adding support for other popular encoding pairs, maybe EUC-JP + Shift JIS. But I'm not imagining writing that code, just incorporating patches from other people.

If you're ever in this situation, try it out and let me know how it works.

Beautiful Soup 4.1.0 also includes a bunch of medium-level bug fixes, and a major refactoring of the search code that will hopefully have no effect whatsoever on the way searches work.

Filed under:

[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.