NOTE: This is an archival document describing the now-obsolete 2.x version of Beautiful Soup. For the latest version, see the Beautiful Soup homepage.

How to Use Beautiful Soup

This document explains the use of Beautiful Soup: how to create a parse tree, how to navigate it, and how to search it.

Quick Start

Here's a Python session that demonstrates the basic features of Beautiful Soup.

>>> from BeautifulSoup import BeautifulSoup
>>> import re
>>>
>>> #Create the soup
... input = '''<html>
... <head><title>Page title</title></head>
... <body>
... <p id="firstpara" align="center">This is paragraph <b>one</b>.
... <p id="secondpara" align="blah">This is paragraph <b>two</b>.
... </html>'''
>>> soup = BeautifulSoup(input)
>>>
>>> #Search the soup
... titleTag = soup.html.head.title
>>> print titleTag
<title>Page title</title>
>>>
>>> print titleTag.string
Page title
>>>
>>> print len(soup('p'))
2
>>>
>>> print soup('p', {'align' : 'center'})
[<p id="firstpara" align="center">This is paragraph <b>one</b>.
</p>]
>>>
>>> print soup('p', {'align' : 'center'})[0]['id']
firstpara
>>>
>>> print soup.first('p', {'align' : re.compile('^b.*')})['id']
secondpara
>>>
>>> print soup.first('p').b.string
one
>>>
>>> print soup('p')[1].b.string
two
>>>
>>> #Modify the soup
... titleTag['id']='theTitle'
>>> titleTag.contents = ['New title.']
>>> print soup.html.head.title
<title id="theTitle">New title.</title>

Creating and Feeding the Parser

For most tasks your best bet is the BeautifulSoup parser. (See the section below, "Choosing a Parser" for situations when you might use the others). You can construct any of the parsers with no arguments, or you can pass in the text you want to parse.

If you don't pass in the text you want to parse to the parser constructor, you'll need to feed text into the parser afterwards, using the feed() method. Calling feed() more than once is the same as concatenating a lot of strings together and calling feed() once.

The done() method

Once you're done feeding text into the parser, call the done() method so that the parser knows to close any unclosed tags. If you pass in text to the parser constructor, you don't need to call done(); it'll happen automatically. This means that if you pass in text to the parser constructor, and then call feed(), you might not get the results you expect: there's a done() call in between.

Navigating the Parse Tree

When you feed a markup document into one of Beautiful Soup's parser classes, Beautiful Soup transforms the markup into a parse tree: a set of linked objects representing the structure of the document.

The parser object is the root of the parse tree. Below it are Tag objects and NavigableText objects. A Tag represents an SGML tag, as well as anything and everything encountered between that tag and its closing. A NavigableText object represents a chunk of ASCII or Unicode text. You can treat it just like a string, but it also has the navigation members so you can get to other parts of the parse tree from it.

For concreteness, here's a visual representation of the parse tree for the example HTML I introduced in the "Quick Start" section. I got this representation by calling soup.prettify(), and I'll use it throughout this section to illustrate the navigation. In this representation, a Tag that's underneath another tag in the parse tree is displayed with another level of indentation than its parent.

 <html>
  <head>
   <title>Page title
   </title>
  </head>
  <body>
   <p id="firstpara" align="center">This is paragraph
    <b>one
    </b>.
   </p>
   <p id="secondpara" align="blah">This is paragraph
    <b>two
    </b>.
   </p>
  </body>
 </html>

This is saying: we've got an html Tag which contains a head Tag and a body Tag. The head tag contains a title Tag, which contains a NavigableText object that says "Page title". The body Tag contains two p Tags, and so on. I got this diagram by running the HTML into a BeautifulSoup parser object and calling prettify() on it.

All Tag objects have all of the members listed below (though the actual value of the member may be Null). NavigableText objects have all of them except for contents and string.

parent

In the example above, the parent of the "head" Tag is the "html" Tag. The parent of the "html" Tag is the BeautifulSoup parser object itself. The parent of the parser object is Null. By following parent you can move up the parse tree.

contents

With parent you move up the parse tree; With contents you move down it. This is a list of Tag and NavigableText objects contained within a tag. Only the top-level parser object and Tag objects have contents; NavigableText objects don't have 'em.

In the example above, the contents of the first "p" Tag is a list containing a NavigableText ("This is paragraph"), a "b" Tag, and another NavigableText ("."). The contents of the "b" Tag is a list containing a NavigableText ("one").

string

For your convenience, if a tag has only one child node, and that child node is an ASCII or Unicode string, the child node is made available as tag.string as well as tag.contents[0]. In the example above, soup.b.string is a NavigableText representing the string "one". That's the string contained in the first "b" Tag in the parse tree. soup.p.string is Null, because the first "p" Tag in the parse tree has more than one child. soup.head.string is also Null, even though the "head" Tag has only one child, because that child is a Tag (the "title" Tag), not a string.

nextSibling and previousSibling

These members let you skip to the next or previous thing on the same level of the parse tree. For instance, the nextSibling of the "head" Tag is the "body" Tag, because the "body" Tag is the next thing directly beneath the "html" Tag. The nextSibling of the "body" tag is Null, because there's nothing else directly beneath the "html" Tag.

Conversely, the previousSibling of the "body" Tag is the "head" tag, and the previousSibling of the "head" Tag is Null.

Some more examples: the nextSibling of the first "p" Tag is the second "p" Tag. The previousSibling of the "b" Tag inside the second "p" Tag is the NavigableText "This is paragraph". The previousSibling of that NavigableText is Null, not anything inside the first "p" Tag.

next and previous

These members let you move through the document elements in the order they were processed by the parser, rather than in the order they appear in the tree. For instance, the next of the "head" Tag is the "title" Tag, not the "body" Tag. This is because the "title" Tag comes immediately after the "head" tag in the original document.

Where next and previous are concerned, a Tag's contents come before whatever is its nextSibling. You usually won't have to use these members, but sometimes it's the easiest way to get to something buried inside the parse tree.

Iterating over a Tag

You can iterate over the contents of a tag by treating the Tag itself as a list. for i in soup.body: is the same as for i in soup.body.contents:. Both will iterate over the direct children of the first 'body' Tag found in the parse tree. Similarly, to see how many child nodes a Tag has, you can call len(tag) instead of len(tag.contents).

Navigate the parse tree by specifying tag names

It's easy to navigate the parse tree by referencing the name of the tag you want as a member of the parser or a Tag object. We've been doing it throughout these examples. In general, calling tag.foo returns the first child of that tag (direct or recursive) that happens to be a "foo" Tag. If there aren't any "foo" Tags beneath a tag, its .foo member is Null.

You can use this to traverse the parse tree, writing code like soup.html.head.title to get the title of an HTML document.

You can also use this to quickly jump to a certain part of a parse tree. For instance, if you're not worried about "title" Tags in weird places outside of the "head" Tag, you can just use soup.title to get an HTML document's title. soup.p jumps to the first "p" Tag inside a document, wherever it is. soup.table.tr.td jumps to the first column of the first row of the first table in the document.

These members actually alias to the first() method, which is covered below in the section "Searching the Parse Tree". I mention it here because the alias makes it very easy to zoom in on an interesting part of a well-known parse tree.

soup.foo versus soup.fooTag

An alternate form of this idiom lets you access the first 'foo' Tag as .fooTag instead of .foo. For instance, soup.table.tr.td could also be expressed as soup.tableTag.trTag.tdTag, or even soup.tableTag.tr.tdTag. This is useful if you like to be more explicit about what you're doing, or if you're parsing XML whose tags contain names that conflict with Beautiful Soup methods and members.

Suppose you were parsing XML that contained tags called "parent" or "contents". soup.parent won't trigger this idiom; it tries to find the parent of the parser object (which is Null). You can't use that idiom to find the first "parent" tag in the parse tree. Instead, use soup.parentTag.

The attributes of Tags

SGML tags have attributes, and so do the Tag objects created by the parser. For instance, each of the "p" Tags in the example above has an "id" attribute and an "align" attribute. You can access a Tag's attributes by treating the Tag as though it were a dictionary. soup.p['id'] retrieves the "id" attribute of the first "p" Tag. NavigableText objects don't have attributes, only Tag objects.

The Null object

When navigating or searching the parse tree, you may encounter the Null object. Null is just like Python's None, but it's easier to work with:

Try doing any of that with None and you'd get an exception.

Why is this useful? Consider a line of Beautiful Soup code like soup.head.title. If Beautiful Soup used None instead of Null, that code would only work so long all your documents had a <head> tag containing a <title> tag. This is not a good assumption to make when you're dealing with real-world HTML.

For an ill-formed document, soup.html might return None, and then accessing the title member of None would throw an AttributeException. You'd need to check for None between getting the <head> and getting the <title>. But since Beautiful Soup actually returns Null if you ask for something that doesn't exist, and since you can access Null.title and get another Null, that code will work no matter what. You'll only have to check for Null once, at the end.

Searching the Parse Tree

Beautiful Soup provides a number of methods for finding Tags and text that match criteria you specify. These methods are available only to Tag objects and to the top-level parser objects, not to NavigableText objects. The methods in the next section are also available to NavigableText objects.

fetch(name, attrs, recursive, limit)

The fetch() method traverses the tree and finds all the tags that match the criteria you gave it.

Calling fetch() on the parser object searches the entire parse tree. Calling fetch() on a Tag object searches only the contents of that Tag.

fetch() takes four arguments:

If you call a Tag object as though it were a function, you're actually calling that Tag's fetch() method. tag('tag1', {'attr1':'val1'}) is the same as tag.fetch('tag1', {'attr1':'val1'}), and it's a little more concise.

attrs convenience alias

When scraping HTML, the most common use of attrs is to find tags with a particular CSS class. If you pass in a string (or a list, or a regular expression, or a callable) instead of a map to attrs, fetch will assume you want to match the "class" attribute. Therefore, soup.fetch('foo', 'bar') is the same as soup.fetch('foo', {'class' : 'bar'}). Using this alias can make your code look neater.

first(name, attrs, recursive)

The first() method traverses the tree and returns the first Tag that matches.

The arguments to first() are the same as to fetch(). It's basically a wrapper to fetch(), which returns either the first match or Null if there are no matches.

As with fetch, calling first() on the parser object searches the entire parse tree. Calling first() on a Tag object searches only the contents of that Tag.

As mentioned earlier, accessing the foo or fooTag member of the parser or a Tag object returns the first "foo" Tag in the parse tree. It's the same as calling first("foo").

fetchText(text, recursive, limit)

This is like fetch() for finding text strings (NavigableText objects) instead of Tag objects. The text argument is like the name argument of fetch(). As with name, you can pass in any of five objects:

The recursive and limit arguments works just like in fetch().

firstText(text, recursive)

As first() is to fetch(), so is firstText() to fetchText(). It calls fetchText() and returns the first item in the list, or Null if there were no matches.

Searching Inside the Parse Tree

You can do most Beautiful Soup operations with the four methods in the previous section. However, sometimes you can't use them to get directly to the Tag or NavigableText you want. For example, consider some HTML like this:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('''<ul>
 <li>An unrelated list
</ul>

<h1>Heading</h1>
<p>This is <b>the list you want</b>:</p>
<ul>
 <li>The data you want
</ul>''')

There are a number of ways to navigate to that li tag that contains the data you want. The most obvious is this:

soup('li', limit=2)[1]

It should be equally obvious that that's not a very stable way to get that li tag. If you're only scraping this page once it doesn't matter, but if you're going to scrape it many times over a long period, such considerations become important. If the irrelevant list grows another li tag, you'll get that tag instead of the one you want, and your script will break or give the wrong data.

soup('ul', limit=2)[1].li

This is a little better, in that it can survive changes to the irrelevant list, but if the document grows another irrelevant list at the top, you'll get the first li tag of that list instead of the one you want. A more reliable way of referring to the ul tag you want would better reflect that tag's place in the structure of the document.

When you look at that HTML, you think of the list you want as 'the ul tag beneath the h1 tag'. The problem is that the tag isn't contained inside the h1 tag; it just comes after it. It's easy enough to get the h1 tag, but there's no way to get to the ul tag from there using first() and fetch(). You need to navigate to it with the next or nextSibling members:

s = soup.h1
while getattr(s, 'name', None) != 'ul':
    s = s.nextSibling
li = s.li

Or, if you think this might be more stable:

s = soup.firstText('Heading')
while getattr(s, 'name', None) != 'ul':
    s = s.next
li = s.li

Both of those examples are more trouble than you should need to go through, so the methods in this section provide a useful shorthand. These methods can be used whenever you find yourself wanting to write a while loop over one of the navigation members. Given a starting point somewhere in the tree, they navigate the tree in some way and keep track of Tag or NavigableText objects that match the criteria you specify. Instead of the loops in the example code above, you can just write this:

soup.h1.findNextSibling('ul').li

Or this:

soup.firstText('Heading').findNext('ul').li

All of these methods take the same arguments as first or fetch: an optional way of matching a tag name, an optional way of matching tag attributes, and an optional way of matching text.

There are two methods for each navigation member. The methods whose names look like fetchFoos take an optional limit, like fetch, and return a list of matches. The methods whose names look like findFoo are convenience methods which will stop searching the tree after encountering a match, and will return that match as a scalar.

These methods are available on the tree as a whole, and also on Tag and NavigableText objects. first and fetch, the methods covered in the previous section, are not available for NavigableText objects, because those methods search through a Tag's children, and NavigableText objects can't have any children.

findNextSibling(name, attrs, text) and fetchNextSiblings(name, attrs, text, limit)

These methods repeatedly follow an object's nextSibling member, gathering Tag or NavigableText objects that match the criteria you specify.

findPreviousSibling(name, attrs, text) and fetchPreviousSiblings(name, attrs, text, limit)

These methods repeatedly follow an object's previousSibling member, gathering Tag or NavigableText objects that match the criteria you specify.

findNext(name, attrs, text) and fetchNext(name, attrs, text, limit)

These methods repeatedly follow an object's next member, gathering Tag or NavigableText objects that match the criteria you specify.

findPrevious(name, attrs, text) and fetchPrevious(name, attrs, text, limit)

These methods repeatedly follow an object's previous member, gathering Tag or NavigableText objects that match the criteria you specify.

findParent(name, attrs) and fetchParents(name, attrs, limit)

These methods repeatedly follow an object's parent member, gathering Tag objects that match the criteria you specify. Since a NavigableText can have no children, you'll never get a NavigableText object while calling findParent or fetchParents. That's why these methods don't take a text argument.

Printing out the parse tree

If you need to make changes to the parse tree and print it back out, or just look at how Beautiful Soup decided to parse some bad HTML, you have a couple options for turning the parse tree back into a string.

str() and unicode()

The parser objects, as well as each Tag and NavigableText object, can be printed out as strings. This string will have no unneccessary whitespace, and all tags will either be self-closing or have corresponding closing tags in what Beautiful Soup guesses is the right place. One useful thing you can do with this is clean up HTML into something approaching XHTML.

prettify()

The prettify() method turns the parse tree (or a portion of it) into a pretty-printed string. This is just like the regular string you'd get with print soup, except it uses whitespace to show the structure of the parse tree. Every tag will start a new line, and a tag's children will be indented one more level than its parent.

Remember from earlier examples that Beautiful Soup turned this:

<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>

into this:

 <html>
  <head>
   <title>Page title
   </title>
  </head>
  <body>
   <p id="firstpara" align="center">This is paragraph
    <b>one
    </b>.
   </p>
   <p id="secondpara" align="blah">This is paragraph
    <b>two
    </b>.
   </p>
  </body>
 </html>

Choosing a parser

Beautiful Soup provides four classes that implement different parsing strategies. You'll need to choose the right one depending on your task. For most tasks you'll be able to use BeautifulSoup, but sometimes one of the other classes might make things easier for you.

BeautifulSoup

The most popular Beautiful Soup class, this class parses HTML as seen in the real world. It contains heuristics about common HTML usage and mis-usage.
RawParsed with BeautifulSoup
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>
 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>
  </span>
 </i>

BeautifulStoneSoup

This class parses any XML-like language. It contains no special language- or schema-specific heuristics. If you want to define a set of self-closing tags for your XML schema, you'll need to subclass this class.
RawParsed with BeautifulStoneSoup
<foo key1="value1">This is some <bar>invalid</baz> XML.
 <foo key1="value1">This is some 
  <bar>invalid XML.
  </bar>
 </foo>

ICantBelieveItsBeautifulSoup

This is a subclass of BeautifulSoup with different heuristics. It's geared towards dealing with bizarre but valid HTML, like HTML that contains nested inline tags that don't do anything when you nest them:
RawParsed with BeautifulSoup Parsed with ICantBelieveItsBeautifulSoup
<b>This text is <b>bolded 
twice</b> for some reason.
 <b>This text is 
 </b>
 <b>bolded twice
 </b> for some reason.
 <b>This text is 
  <b>bolded twice
  </b> for some reason.
 </b>

BeautifulSOAP

This is a convenience subclass of BeautifulStoneSoup which makes it easier to deal with XML documents (like SOAP messages) that put data in tiny sub-elements when it would be more convenient to put them in attributes of the parent element.
RawParsed with BeautifulSOAP
<foo><bar>baz</bar></foo>
 <foo bar="baz">
  <bar>baz
  </bar>
 </foo>

Advanced Topics: Building a custom parser

As befits an "advanced topics" section, I haven't written this yet.

Sanitizing Bad Data with Regexes

...

Customizing the Tag Maps

...

 

This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Wednesday, May 04 2005, 19:17:58 Nowhere Daylight Time and last built on Sunday, May 28 2006, 09:00:58 Nowhere Daylight Time.

Crummy is © 1996-2006 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.

Document tree:

http://www.crummy.com/
software/
BeautifulSoup/
documentation.html
Site Search: