Use Beautiful Soup 4 instead.
The parser in version 3.1.0 of Beautiful Soup did significantly worse on real-world HTML than version 3.0.8 did. The most common problems were the incorrect handling of <script> tags, "malformed start tag" errors, and "bad end tag" errors.
To address this problem, I replaced the 3.1 series of Beautiful Soup with the 3.2 series. I then replaced Beautiful Soup 3 itself with Beautiful Soup 4, which can use any of a number of third-party parsers.
This page was originally written in March 2009, and remains up for historical purposes. The information on this page is years out of date. If you use it to make decisions, you'll probably make bad decisions. See the Beautiful Soup 4 documentation instead.
Despite the above warnings, I still encounter people who read this
page and think that Beautiful Soup is a dead project. I don't want to
remove the text of this page, because I think it's important to an
understanding of the project's history. Instead, I've taken the step
of striking out the no-longer-accurate information. Hopefully this will help.
What happened?
Beautiful Soup 3.0.8 uses SGMLParser, found in the Python standard library. It's not the best parser ever written, and I've had to add a lot of code dealing with its quirks, but it can handle most bad tags and a certain amount of <script> weirdness. It's been the basis of Beautiful Soup since the first version.
In Python 3.0, the latest version of Python, SGMLParser has been removed. The only parser that's part of the Python 3.0 standard library is HTMLParser, a simpler parser that's much worse at processing malformed HTML.
In the meantime, an excellent new HTML parsing library called html5lib has emerged. It's very good at processing malformed HTML, but it's not part of the Python standard library, and at the moment it's slower than either SGMLParser or HTMLParser.
I no longer enjoy working on Beautiful Soup, but too many people depend on it for me to let the project die just because it depends on code that's been removed from the Python standard library. So I wrote Beautiful Soup 3.1.0, which cleans up the Python 2.x code so that it can be automatically converted to Python 3.0. code. Part of this change was to switch the underlying parser from SGMLParser to HTMLParser. The errors you're seeing are caused by HTML that SGMLParser could handle but HTMLParser can't.
When Beautiful Soup was first released in May of 2004, there were no lenient HTML parsers that were easy to program, only the ones embedded in web browsers. Now there are lots of programmable lenient HTML parsers, and most of them are better than SGMLParser, let alone HTMLParser. These days, Beautiful Soup is useful for its Unicode handling and its tree-traversal methods. Parsing is longer a competitive advantage for Beautiful Soup, and I'd be happier if I could get out of the parser business altogether.
1. In the future, Python 3.0 will be the standard version of Python. There will be no SGMLParser anymore, and versions of Beautiful Soup prior to 3.1.0 will stop working.
2. In a future version of Beautiful Soup, you'll be able to specify which parser you want to use to parse a given document. The supported parsers will probably be HTMLParser and html5lib, but you'll be able to write some code to hook up any parser to drive the Beautiful Soup tree builder, and then run BS methods on the result.
1. You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.
2. You can use html5lib. That library has a Beautiful Soup tree builder that will yield standard Beautiful Soup objects. It depends on Beautiful Soup to run.
3. You can use ElementTree. If you want the Beautiful Soup API, you can use Element Soup to feed the HTML into Beautiful Soup once ElementTree has cleaned it up.
4. You can use lxml. It doesn't have all of Beautiful Soup's tree-traversal methods, but it's a fast and easy-to-use parser.
5. You can do some of the work yourself, and send me the result. 3.1.0 includes the start of the refactoring work for making the parser pluggable, moving HTMLParser code into a parser class.
This is the point where I'm supposed to ask for money. But money won't solve this problem. I'm always happy when someone sends me donations because of Beautiful Soup, and I used to describe these donations as "supporting Beautiful Soup development." But that's not really accurate. To justify further Beautiful Soup development I need time, not money. Gifts of money feel more like thank-yous for the work I've already done.
I have a full-time programming job that pays well but isn't very flexible. If I get a $100 donation, I can't take an hour off from work and work on Beautiful Soup. If I take a day off, I'd rather spend it doing something other than programming. To make the time for Beautiful Soup development, I'd need enough money to make it my job, and that's too much money to ask for or to expect.
To summarize, Beautiful Soup is a hobby that I don't really enjoy and that's similar to the work I do all day. It's competing against other hobbies and committments I have, hobbies and committments that are more enjoyable and significantly different from my day job. So when I say you can do some of the work yourself, I'm not being snarky. That's a legitimate option for getting this code written faster.
Update: For easier access to the source, I've put Beautiful Soup 3.0 and trunk on Launchpad, so you can make and publish your own branches.
This document is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, April 16 2013, 15:47:54 Nowhere Standard Time and last built on Monday, May 29 2023, 08:00:01 Nowhere Standard Time.
| Document tree: Site Search: |