< Wedding Pictures
Just When You Thought It Was Safe To Not Look At Cute Baby Elephant Pictures >

[Comments] (10) Beautiful Soup 3.0 Beta: For your delectability. The major new feature is that Beautiful Soup 3.0 takes XML or HTML documents in any encoding and turns them into UTF-8; in most cases you don't have to know the current encoding. I wrote this without really knowing anything about encodings: most of the code is stolen from Mark Pilgrim's Universal Feed Parser. But I am able to write tests, and the tests work.

The other major new feature is that you can now rip out a chunk from the parse tree with the extract method. You can use the chunk and abandon the rest of the tree, or vice versa. This is especially useful because the data structures you abandoned can now be garbage-collected: in current Beautiful Soup, the whole tree stays in memory forever because every Tag and NavigableText is connected to every other Tag and NavigableText through an intricate web of lies. And by "lies", I mean "instance variables".

There are some more new features, but I have to take a shower now to go and meet Pete Peterson II for dinner. Test it out; I'll be rewriting the documentation over the next month or so, and hopefully by then I'll have gotten enough feedback to release it.

Filed under:


Posted by George Gesslein II at Sun Apr 23 2006 21:40

Dear Crummy,
I owe you much for your daily offering to the world.
I will repay you as soon as I break out of hell.
Q: Which Linux distro do you use?

Posted by Leonard at Sun Apr 23 2006 23:37


Posted by Fredrik at Mon Apr 24 2006 03:36

Umm. Not sure what you're saying here, but if the new Soup *returns* UTF-8-encoded 8-bit strings to the application, something's not quite right.

(the right way to do this is to use Unicode for everything at the API level. if you're concerned about memory use, use 8-bit strings for things that contain plain ASCII, and Unicode strings for everything else).

Posted by Leonard at Mon Apr 24 2006 08:28

Ok, I'll make the default toEncoding None and it'll leave it in Unicode by default. Or I could get rid of toEncoding in the Soup constructor, and make the string representation methods take an encoding. Which is better?

Posted by Leonard at Mon Apr 24 2006 08:33

I can't just make the end-user encode a Unicode representation themselves. XML and HTML documents sometimes declare what encoding they're in, and I don't want them to have to find and rewrite the declarations.

Posted by Aaron Bentley at Wed Apr 26 2006 15:07

My biggest issues with Beautiful Soup are the way it handles text areas: entity references are not translated and HTML comments are treated as text. Are these addressed in 3.0?

Posted by Leonard at Wed Apr 26 2006 19:13

Sure, if you want. What do you want to happen to entity references exactly?

Posted by anonymous at Sat Apr 29 2006 21:37

It would be nice if valid entity references were translated into the correct character. For example, 'é' would be turned into u'\u00e9', and numeric references would also be handled. Any invalid references would be left verbatim.

Below is the code I'm using to do something like this.

from htmlentitydefs import name2codepoint
def trydecode(match):
return unichr(name2codepoint[match.group(0)[1:-1]])
except KeyError:
return match.group(0)

Posted by Josh Myer at Sun Apr 30 2006 04:22

You'll need to allow users to override the encoding for a given document. A ton of sites hose their encoding declaration. More significantly, depending on the MIME type given in the HTTP transaction, the default encoding for X(HT)?ML changes. This is, of course, found in a note issued by the W3C, lucidly clarifying their insanity: XHTML Media Types: text/html.

I have several very dirty reinterpretations of the W3C acronym, entirely because of inane, bizarre behaviors like this one.

(Parenthetically, I've heard from a friend at Google that they don't pay much attention to encodings, but instead do some inference and statistical analysis to divine a page's encoding. It seems the much safer way to go, if a little expensive.)

Posted by Leonard at Sun Apr 30 2006 09:15

The encoding hierarchy is: programmer-specified encoding, encoding specified in document, autodetected encoding, utf-8, windows-1152. It picks the first one that successfully converts the document into Unicode.

[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.