< Telling Y'all It's An Arbitrage
Next >

[Comments] (3) : Hey, Beautiful Soup fans (all others can ignore this entry). Among my other projects I am designing Beautiful Soup version 2.0, which should be much more coherent and powerful, as well as generating better parse trees and having better Unicode support. This will come at the expense of Python 1.5 compatibility and (as always) backwards compatibility with previous versions. Thanks to several incredibly useful contributed patches, I have almost everything figured out, but I have two unresolved issues about nomenclature and operator overloading, which follow. I know there is nothing Python programmers love better than arguing about nomenclature, so have at it.

  1. Consider the following HTML:
    <p><b>Foo <i>bar</i></b> <u>baz</u></p>
    <div>Some more text</div>

    *ML is parsed by Beautiful Soup (and other parsers) into a treelike structure, like so:

    [root]
     |
     +-P
     | +-B
     | | |
     | | +-"Foo"
     | | |
     | | +-I
     | |   |
     | |   +-"bar"
     | +-U
     |   |
     |   +-"baz"
     |
     +-DIV
       |
       +-"Some more text"
    

    One of the two defining features of Beautiful Soup is that it tries harder than other parsers to build this tree even when the markup is bad, and will always give you some sort of tree, on the assumption that if you wanted a ParsingIsHardLetsGoShoppingException you would have created one yourself. But one of the things missing from Beautiful Soup v1 is (to refer to the tree above) any notion of the relationship between the P tag and the DIV tag, which I've discovered can be a very important relationship to have access to when you're screen-scraping.

    Looking at the tree it's obvious that the P tag and the DIV tag are siblings; they're right next to each other on the same level of the tree. But there's no easy way in Beautiful Soup to get from the P tag to the DIV tag. The .next member of the P tag is the B tag, because it was the thing *parsed* immediately after the P tag. You have to get the parent of the P tag (the root of the document), then get the list of its children, then go through it looking for the P tag, then see what the next thing is.

    In Beautiful Soup v2, the P tag is going to have some pointer to the DIV tag, and vice versa. This will only be useful for relatively well-formed HTML, but when you need it, you need it.

    However I'm not sure what to call these new members. previous and next are already taken as referring to "previous/next thing parsed" and I want to leave that alone. The only other ideas I have are previousSibling and nextSibling. Do you have any other suggestions?

  2. The other defining feature of Beautiful Soup is that it comes packaged with tree-traversal methods, inefficient to run but very efficient to not have to write yourself. There are two such methods. The main one is called fetch and it searches the tree for whatever you're looking for, returning a list of everything that matches. This is aliased (in v1 and v2) to the method call operator __call__, so in the example above you'd write soup.fetch("b") or just soup("b") to get a list containing the only B tag.

    But if you know there is only one B tag, it's a pain to call fetch and then take the first item in the resulting list. So there's a helper method called first that does it for you. In Beautiful Soup v2 I want some operator-overloading magic for the first operator. I can't think of anything suitably Pythonic, though.

    Overriding __getitem__ would look OK, but I'm already using it to get a tag's attributes, ie. a['href']. So I've got two current contenders. The first is the dot operator (__getattr__). The dot operator looks the nicest (soup.head.title is lovely) but it also happens to be used for member and method access on objects. I'd rather not have a solution that does one thing when you call .title and another thing when you call .fetch, especially when you might be parsing some XML or made-up markup language that has a "fetch" tag. It also seems like it would slow down the parsing a lot.

    The second contender is the % operator (__mod__). I can't explain why I like (soup % head) % title except that I think it's funny. Unfortunately the joke is too complicated to express in words, so you'll either get it too or you'll just have to take my word for it. On the other hand, if you took the modulus operator at its word you'd expect soup % head to give you the whole document but without the HEAD tags, which is totally not what first does. The other possible reference is to the string interpolation operator, but first is searching for a 'string', not inserting one. So it's not very intuitive unless you share my sense of programming humor.

    I may just not create any shorthand for first since it is already shorthand for fetch()[0]. It would be nice to have something clever and elegant, though, the way we use the method call operator as an alias for fetch. Again, I need ideas.

Filed under:

Comments:

Posted by Ian Bicking at Wed Jan 05 2005 03:00

__getattr__ won't slow down your (non-dynamic-attribute) code, it's only called when a method isn't found. Is [0] that hard to write? Well, at least it's easy to read...

Posted by Leonard at Wed Jan 05 2005 10:34

One of 'em's not hard to write but you end up using it a lot, and the aggregate effect is ugly. I'm okay with the current technique of calling first() but I was hoping to be able to do something more elegant.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.