< Constellation Games Author Commentary #10: "K.I.S.S.I.N.G."
Next >

[Comments] (3) easy_install beautifulsoup4: This is an HTMLized version of an email I sent to the Beautiful Soup discussion group, about the impending release of Beautiful Soup 4.

Introduction

When Beautiful Soup was first released in 2004, the state of HTML parsing in Python was appalling. Over the past eight years, things have improved so dramatically that Beautiful Soup's HTML parser is no longer a competitive advantage. I don't want to duplicate other peoples', work, so I'm getting Beautiful Soup out of the parser businesss. Beautiful Soup's job is now to provide a Pythonic screen-scraping API on top of a data structure created by a third-party parser.

This will be Beautiful Soup 4, and I've been planning it for years. With help from Thomas Kluyver and Ezio Melotti, I've now met the three main goals of Beautiful Soup 4:

  1. Make a single codebase that works under Python 2 and Python 3.
  2. Stop using SGMLParser (removed in Python 3) and make it possible to swap out one parser for another.
  3. Support two major Python parsers (lxml and html5lib) as well as Python's (not currently very good) batteries-included parser, html.parser.
The first version of BS4 is almost ready for release, and I'd like you to test it out, if you haven't already. I still to fix some things, in particular some performance problems. But, note that even with the performance problems, BS4 is faster than BS3 across the board.

On Python 2 or Python 3 you can install the BS4 beta with this command:

easy_install beautifulsoup4

You can also get the source tarball.

The documentation has been completely rewritten. You may find the section on porting BS3 code to BS4 especially interesting.

There are three major things I'd like your feedback on before completing the release.

Hall of Fame

The BS3 documentation lists open-source projects that use Beautiful Soup. I stopped maintaining this list many years ago because there are hundreds of these projects, and since most of them are screen-scrapers, they're pretty ephemeral.

I'd like to bring this feature back as a "hall of fame", featuring applications of Beautiful Soup that grab a reader's attention. People who used Beautiful Soup in a high-profile way or to tackle a big issue. Projects that are interesting to hear about even if the software doesn't work anymore, or uses an old version of Beautiful Soup, or if Beautiful Soup was used internally and the public only saw the results.

My bias is towards projects having to do with space, science, journalism, politics and social justice. Here are some examples so you know the kind of thing I'm thinking of:

If you did anything of this sort, or know of someone who did, I'd like to hear about it.

Do you prefer lxml or html5lib?

Right now, the parser ranking goes lxml, html5lib, html.parser. I like lxml because it's incredibly fast and it can parse anything. But I'd like to see what you think of the trees it generates. Would html5lib, with its web-browser-like heuristics, be a better default?

substitute_html_entities

BS3 had a number of overlapping and inconsistent ways of turning HTML/XML entities into Unicode characters, and possibly turning Microsoft smart quotes into HTML entities at the same time. In BS4, all this stuff is gone. HTML and XML entities are *always* converted into Unicode characters.

This is great but there's one problem: output. If you want to turn those Unicode characters back into entities when outputting as a string, you need to call soup.encode(substitute_html_entities=True), which is a little clunky. I'm thinking of adding an output_html_entities attribute that you can set on a soup or tag to control whether this substitution happens. Do you like this idea?

I think I also need to ensure that characters like "&" and "always converted to XML entities on output, even though this will hurt performance a bit.

Conclusion

What you install with easy_install beautifulsoup4 is a beta release. If I hear of a problem soon, there's still time to fix it, even if it means a major change to the API. So please try it out and give me feedback.

Filed under:

Comments:

Posted by Geoffrey Knauth at Tue Feb 14 2012 09:17

I use Beautiful Soup to process the entries and results for the CRASH-B Sprints World Indoor Rowing Championships ( http://crash-b.org/ ) -- and for many other things too. Thank you!

Posted by Bruce Eckel at Wed Feb 15 2012 15:23

I'm using BeautifulSoup to take a book created with a remote coauthor on Google Docs and turn it into an eBook for Kindle and ePub readers. I've gone from knowing virtually nothing about BS to being a big fan; it is becoming my go-to for all problems XML/HTML.

Posted by Harry Burns at Mon Feb 20 2012 17:49

Sorry, included HTML which looks like its been stripped -- I meant to say that HTML5 meta tags with charset attribute are not updated where meta tags with http-equiv="content-type" are.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.