(10) Sun Apr 23 2006 18:58 Beautiful Soup 3.0 Beta:
For your delectability. The major new feature is that Beautiful Soup 3.0 takes XML or HTML documents in any encoding and turns them into UTF-8; in most cases you don't have to know the current encoding. I wrote this without really knowing anything about encodings: most of the code is stolen from Mark Pilgrim's Universal Feed Parser. But I am able to write tests, and the tests work.
The other major new feature is that you can now rip out a chunk from the parse tree with the There are some more new features, but I have to take a shower now to go and meet Pete Peterson II for dinner. Test it out; I'll be rewriting the documentation over the next month or so, and hopefully by then I'll have gotten enough feedback to release it.
extract method. You can use the chunk and abandon the rest of the tree, or vice versa. This is especially useful because the data structures you abandoned can now be garbage-collected: in current Beautiful Soup, the whole tree stays in memory forever because every Tag and NavigableText is connected to every other Tag and NavigableText through an intricate web of lies. And by "lies", I mean "instance variables".
The other major new feature is that you can now rip out a chunk from the parse tree with the
There are some more new features, but I have to take a shower now to go and meet Pete Peterson II for dinner. Test it out; I'll be rewriting the documentation over the next month or so, and hopefully by then I'll have gotten enough feedback to release it.
Posted by Leonard at Sun Apr 23 2006 23:37
Umm. Not sure what you're saying here, but if the new Soup *returns* UTF-8-encoded 8-bit strings to the application, something's not quite right.(the right way to do this is to use Unicode for everything at the API level. if you're concerned about memory use, use 8-bit strings for things that contain plain ASCII, and Unicode strings for everything else).
Posted by Leonard at Mon Apr 24 2006 08:28
Ok, I'll make the default toEncoding None and it'll leave it in Unicode by default. Or I could get rid of toEncoding in the Soup constructor, and make the string representation methods take an encoding. Which is better?
Posted by Leonard at Mon Apr 24 2006 08:33
I can't just make the end-user encode a Unicode representation themselves. XML and HTML documents sometimes declare what encoding they're in, and I don't want them to have to find and rewrite the declarations.
My biggest issues with Beautiful Soup are the way it handles text areas: entity references are not translated and HTML comments are treated as text. Are these addressed in 3.0?
Posted by Leonard at Wed Apr 26 2006 19:13
Sure, if you want. What do you want to happen to entity references exactly?
Posted by Leonard at Sun Apr 30 2006 09:15
The encoding hierarchy is: programmer-specified encoding, encoding specified in document, autodetected encoding, utf-8, windows-1152. It picks the first one that successfully converts the document into Unicode.