*ML is parsed by Beautiful Soup (and other parsers) into a treelike structure, like so:
One of the two defining features of Beautiful Soup is that it tries harder than other parsers to build this tree even when the markup is bad, and will always give you some sort of tree, on the assumption that if you wanted a ParsingIsHardLetsGoShoppingException you would have created one yourself. But one of the things missing from Beautiful Soup v1 is (to refer to the tree above) any notion of the relationship between the P tag and the DIV tag, which I've discovered can be a very important relationship to have access to when you're screen-scraping.
Looking at the tree it's obvious that the P tag and the DIV tag are siblings; they're right next to each other on the same level of the tree. But there's no easy way in Beautiful Soup to get from the P tag to the DIV tag. The In Beautiful Soup v2, the P tag is going to have some pointer to the DIV tag, and vice versa. This will only be useful for relatively well-formed HTML, but when you need it, you need it.
However I'm not sure what to call these new members. But if you know there is only one B tag, it's a pain to call Overriding The second contender is the % operator ( I may just not create any shorthand for
(3) Tue Jan 04 2005 16:13 PST:
Hey, Beautiful Soup fans (all others can ignore this entry). Among my other projects I am designing Beautiful Soup version 2.0, which should be much more coherent and powerful, as well as generating better parse trees and having better Unicode support. This will come at the expense of Python 1.5 compatibility and (as always) backwards compatibility with previous versions. Thanks to several incredibly useful contributed patches, I have almost everything figured out, but I have two unresolved issues about nomenclature and operator overloading, which follow. I know there is nothing Python programmers love better than arguing about nomenclature, so have at it.
<p><b>Foo <i>bar</i></b> <u>baz</u></p>
<div>Some more text</div>
[root]
|
+-P
| +-B
| | |
| | +-"Foo"
| | |
| | +-I
| | |
| | +-"bar"
| +-U
| |
| +-"baz"
|
+-DIV
|
+-"Some more text"
.next
member of the P tag is the B tag, because it was the thing *parsed* immediately after the P tag. You have to get the parent of the P tag (the root of the document), then get the list of its children, then go through it looking for the P tag, then see what the next thing is.
previous
and next
are already taken as referring to "previous/next thing parsed" and I want to leave that alone. The only other ideas I have are previousSibling
and nextSibling
. Do you have any other suggestions?
fetch
and it searches the tree for whatever you're looking for, returning a list of everything that matches. This is aliased (in v1 and v2) to the method call operator __call__
, so in the example above you'd write soup.fetch("b")
or just soup("b")
to get a list containing the only B tag.
fetch
and then take the first item in the resulting list. So there's a helper method called first
that does it for you. In Beautiful Soup v2 I want some operator-overloading magic for the first
operator. I can't think of anything suitably Pythonic, though.
__getitem__
would look OK, but I'm already using it to get a tag's attributes, ie. a['href']
. So I've got two current contenders. The first is the dot operator (__getattr__
). The dot operator looks the nicest (soup.head.title
is lovely) but it also happens to be used for member and method access on objects. I'd rather not have a solution that does one thing when you call .title
and another thing when you call .fetch
, especially when you might be parsing some XML or made-up markup language that has a "fetch" tag. It also seems like it would slow down the parsing a lot.
__mod__
). I can't explain why I like (soup % head) % title
except that I think it's funny. Unfortunately the joke is too complicated to express in words, so you'll either get it too or you'll just have to take my word for it. On the other hand, if you took the modulus operator at its word you'd expect soup % head
to give you the whole document but without the HEAD tags, which is totally not what first
does. The other possible reference is to the string interpolation operator, but first
is searching for a 'string', not inserting one. So it's not very intuitive unless you share my sense of programming humor.
first
since it is already shorthand for fetch()[0]
. It would be nice to have something clever and elegant, though, the way we use the method call operator as an alias for fetch
. Again, I need ideas.