< Previous
Next >

: There are four Beautiful Soup-related tasks that ought to happen in the near future.

  1. Convert the codebase to Python 3; or rather, convert it to Python 2 that can be automatically converted to Python 3.
  2. Aaron DeVore has a number of interesting additions to the API. They need to be added.
  3. Separate the tree-walker from the tree-builder. Start getting out of the business of writing tree-builders.
  4. Simplify the API.

Aaron is doing the integration work. I'm doing the conversion. Once that's done we'll have a big soup that people can use into the distant future if they don't like what I do in steps 3 and 4, or if I give the whole thing up, which is a distant possibility.

Step 3 is still a mystery. Apparently html5lib is even slower than SGMLParser, so generic is probably the way to go. Maybe something with bindings to lxml and html5lib (this is why you may not like what I do in this step and may want to stick with the all-Python, not-as-slow-as-html5lib version).

What's going to happen in step 4? Ian Bicking wrote an appreciation of lxml that ties in with my Beautiful Soup angst. Mostly to do with the tree-builder, but also rightly bashing BS's primitive CSS matcher. Implementing CSS selectors or XPath is another business I don't want to be in, but I wouldn't mind bundling with someone else's strategies for walking the tree according to CSS selectors or XPath.

I'm not terribly motivated to make these changes because I don't really use Beautiful Soup anymore. Partly because I don't do as much non-work programming as I did pre-Canonical, and partly because the sites I used to screen-scrape back in 2004 have wised up and developed web services or syndication feeds. Redoing the library doesn't feel like a fun use of my end-of-year vacation; I'd rather write stories.

But here's where I start when I think about the changes. For me, the core users of Beautiful Soup are the total newbies, people for whom BS is their first or second Python library. I've never made a secret of the fact that BS is slower than other parsers, and although coupling it with a C tree-builder would speed it up a lot, I'm totally serious when I say the point of BS is to save programmer time at the expense of processor time. If you need speed, you've got options. My overriding concern is people who've just realized they can get the information they need off a web site by writing a computer program.

Unfortunately, about 30% of those people have some specific need that goes beyond the simple API you see in Beautiful Soup 1.x, and over time those additional ways of searching got stuck into the API, and you get the Microsoft Word problem. The complexity of the API is now itself costing programmer time. It needs to stop. (But after I get Aaron's additions in so that I'll have more raw data to do my redesign with.)

So what I'd like to try is a stripped-down Beautiful Soup based on list comprehensions, with a little bit of syntactic sugar for total newbies. Matching is done by calling a factory function that returns the equivalent of a SoupStrainer. The factory functions contain most of what's currently in searchTag. Additional factory functions can implement CSS selectors or XPath, except at this point you should be able to just use a parser that has its own support for CSS selectors or XPath.

Don't even get me started on what I need to do to NewsBruiser eventually...

Filed under:


[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.