Introduction
When Beautiful Soup was first released in 2004, the state of HTML
parsing in Python was appalling. Over the past eight years, things
have improved so dramatically that Beautiful Soup's HTML parser is no
longer a competitive advantage. I don't want to duplicate other
peoples', work, so I'm getting Beautiful Soup out of the parser
businesss. Beautiful Soup's job is now to provide a Pythonic
screen-scraping API on top of a data structure created by a
third-party parser.
This will be Beautiful Soup 4, and I've been planning it for
years. With help from Thomas Kluyver and Ezio Melotti, I've now met
the three main goals of Beautiful Soup 4:
On Python 2 or Python 3 you can install the BS4 beta with this command:
You can also get the source tarball.
The documentation has been completely rewritten. You may find the section on porting BS3 code to BS4 especially
interesting.
There are three major things I'd like your feedback on before
completing the release.
Hall of Fame
The BS3 documentation lists open-source projects that use Beautiful
Soup. I stopped maintaining this list many years ago because there are
hundreds of these projects, and since most of them are
screen-scrapers, they're pretty ephemeral.
I'd like to bring this feature back as a "hall of fame", featuring
applications of Beautiful Soup that grab a reader's attention. People
who used Beautiful Soup in a high-profile way or to tackle a big
issue. Projects that are interesting to hear about even if the
software doesn't work anymore, or uses an old version of Beautiful
Soup, or if Beautiful Soup was used internally and the public only saw
the results.
My bias is towards projects having to do with space, science,
journalism, politics and social justice. Here are some examples so you
know the kind of thing I'm thinking of:
If you did anything of this sort, or know of someone who did, I'd
like to hear about it.
Do you prefer lxml or html5lib?
Right now, the parser ranking goes lxml, html5lib, html.parser. I like
lxml because it's incredibly fast and it can parse anything. But I'd
like to see what you think of the trees it generates. Would html5lib,
with its web-browser-like heuristics, be a better default?
substitute_html_entities
BS3 had a number of overlapping and inconsistent ways of turning
HTML/XML entities into Unicode characters, and possibly turning
Microsoft smart quotes into HTML entities at the same time. In BS4,
all this stuff is gone. HTML and XML entities are *always* converted
into Unicode characters.
This is great but there's one problem: output. If you want to turn
those Unicode characters back into entities when outputting as a
string, you need to call I think I also need to ensure that characters like "&" and " Conclusion
What you install with
(3) Thu Feb 02 2012 11:59 easy_install beautifulsoup4:
This is an HTMLized version of an email I sent to the Beautiful Soup discussion group, about the impending release of Beautiful Soup 4.
The first version of BS4 is almost ready for release, and I'd like you
to test it out, if you haven't already. I still to fix some things, in
particular some performance problems. But, note that even with the
performance problems, BS4 is faster than BS3 across the board.
easy_install beautifulsoup4
soup.encode(substitute_html_entities=True)
,
which is a little clunky. I'm thinking of adding an
output_html_entities
attribute that you can set on a soup or tag to
control whether this substitution happens. Do you like this idea?
easy_install beautifulsoup4
is a beta
release. If I hear of a problem soon, there's still time to fix it,
even if it means a major change to the API. So please try it out and
give me feedback.