How to Use Rubyful Soup

This document explains the use of Rubyful Soup: how to create a parse tree, how to navigate it, how to search it, and how to print it out.

Quick Start

Here's a Ruby session that demonstrates the basic features of Rubyful Soup.

require 'RubyfulSoup'
#Create the soup
input = %{<html><head><title>Page title</title></head><body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.</html>}
soup = BeautifulSoup.new(input)

#Search the soup
titleTag = soup.html.head.title
#=> <title>Page title</title>

titleTag.string
#=> "Page title"

soup.find_all('p').size
#=> 2

soup.find_all { |tag| tag['align'] = "blah"}
#=> [<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.find_all { |element| element.respond_to? :index and element.index('This is') == 0
#=> ["This is paragraph ", "This is paragraph "]

soup.find('p', :attrs => {'align' => 'center'})
#=> <p id="firstpara" align="center">This is paragraph <b>one</b>.
#=> </p>

soup.find_all('p', :attrs => {'align' => 'center'})[0]['id']
#=> "firstpara"

soup.find('p', {'align' => /^b.*/})['id']
#=> "firstpara"

soup.find('p').b.string
#=> "one"

soup.find_all('p')[1].b.string
#=> "two"

#Modify the soup
titleTag['id']='theTitle'
titleTag.contents = ['New title.']
soup.html.head.title
#=> <title id="theTitle">New title.</title>

Creating and Feeding the Parser

The done method

Navigating the Parse Tree

When you feed a markup document into one of Rubyful Soup's parser classes, Rubyful Soup transforms the markup into a parse tree: a set of linked objects representing the structure of the document.

The parser object is the root of the parse tree. Below it are Tag objects and NavigableString objects. A Tag represents an SGML tag, as well as anything and everything encountered between that tag and its closing. A NavigableString object represents a chunk of ASCII or Unicode text. You can treat it just like a string, but it also has the navigation members so you can get to other parts of the parse tree from it.

For concreteness, here's a visual representation of the parse tree for the example HTML I introduced in the "Quick Start" section. I got this representation by calling soup.prettify, and I'll use it throughout this section to illustrate the navigation. In this representation, a Tag that's underneath another tag in the parse tree is displayed with another level of indentation than its parent.
 <html>
  <head>
   <title>Page title
   </title>
  </head>

  <body>
   <p id="firstpara" align="center">This is paragraph
    <b>one
    </b>.
   </p>
   <p id="secondpara" align="blah">This is paragraph
    <b>two
    </b>.
   </p>

  </body>
 </html>

This is saying: we've got an html Tag which contains a head Tag and a body Tag. The head tag contains a title Tag, which contains a NavigableString object that says "Page title". The body Tag contains two p Tags, and so on.

All Tag objects have all of the members listed below (though the actual value of the member may be nil). NavigableString objects have all of them except for contents and string.

parent

In the example above, the parent of the "head" Tag is the "html" Tag. The parent of the "html" Tag is the BeautifulSoup parser object itself. The parent of the parser object is nil. By following parent you can move up the parse tree.

soup.head.parent.name
# => "html"
soup.head.parent.parent.class
# => BeautifulSoup
soup.head.parent.parent.parent
# => nil

contents

With parent you move up the parse tree; With contents you move down it. This is a list of Tag and NavigableString objects contained within a tag. Only the top-level parser object and Tag objects have contents; NavigableString objects don't have 'em.

In the example above, the contents of the first "p" Tag is a list containing a NavigableString ("This is paragraph "), a "b" Tag, and another NavigableString (".\n"). The contents of the "b" Tag is a list containing a NavigableString ("one").

soup.p.contents
# => ["This is paragraph ", one, ".\n"]
soup.p.contents[1].contents
# => ["one"]

string

For your convenience, if a tag has only one child node, and that child node is a string, the child node is made available as tag.string as well as tag.contents[0]. In the example above, soup.b.string is a NavigableString representing the string "one". That's the string contained in the first "b" Tag in the parse tree. soup.p.string is nil, because the first "p" Tag in the parse tree has more than one child. soup.head.string is also nil, even though the "head" Tag has only one child, because that child is a Tag (the "title" Tag), not a string.

soup.b.string
# => "one"
soup.p.string
# => nil
soup.head.string
# => nil

next_sibling and previous_sibling

These members let you skip to the next or previous thing on the same level of the parse tree. For instance, the next_sibling of the "head" Tag is the "body" Tag, because the "body" Tag is the next thing directly beneath the "html" Tag. The next_sibling of the "body" tag is nil, because there's nothing else directly beneath the "html" Tag.

Conversely, the previous_sibling of the "body" Tag is the "head" tag, and the previous_sibling of the "head" Tag is nil.

soup.head.next_sibling.name
# => "body"
soup.body.next_sibling
# => nil

soup.body.previous_sibling.name
# => "head"
soup.head.previous_sibling
# => nil

Some more examples: the next_sibling of the first "p" Tag is the second "p" Tag. The previous_sibling of the "b" Tag inside the second "p" Tag is the NavigableText "This is paragraph ". The previous_sibling of that NavigableText is nil, not anything inside the first "p" Tag.

soup.p.next_sibling
# => <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>
soup.find_all('p')[1].b.previous_sibling
# => "This is paragraph "
soup.find_all('p')[1].b.previous_sibling.previous_sibling
# => nil

next_parsed and previous_parsed

These members let you move through the document elements in the order they were processed by the parser, rather than in the order they appear in the tree. For instance, the next_parsed of the "head" Tag is the "title" Tag, not the "body" Tag. This is because the "title" Tag comes immediately after the "head" tag in the original document.

Where next_parsed and previous_parsed are concerned, a Tag's contents come before whatever is its next_sibling. You usually won't have to use these members, but sometimes it's the easiest way to get to something buried inside the parse tree.

soup.head.next_parsed
# => <title>Page title</title>
soup.head.next_parsed
soup.b.previous_parsed
# => "This is paragraph "
soup.b.next_parsed
# => "one"

A Tag is an Enumerable

You can iterate over the contents of a tag by treating the Tag itself as an Enumerable. soup.body.each is the same as soup.body.contents.each. Both will iterate over the direct children of the first 'body' Tag found in the parse tree. Similarly, to see how many child nodes a Tag has, you can call tag.size instead of tag.contents.size. All the Enumerable methods will work on a Tag object, just as if you'd called it on that Tag's Array of contents

soup.body.each { |x| puts x.name if x.is_a? Tag }
# => p
# => p
soup.body.contents.each { |x| puts x.name if x.is_a? Tag }
# => p
# => p

soup.body.size
# => 3
soup.head.contents.size
# => 3

soup.body.p.reject { |x| x.is_a? Tag }
# => ["This is paragraph ", ".\n"]

Navigate the parse tree by using tag names as members

It's easy to navigate the parse tree by referencing the name of the tag you want as a member of the parser or a Tag object. We've been doing it throughout these examples. In general, calling tag.foo returns the first child of that tag (direct or recursive) that happens to be a "foo" Tag. If there aren't any "foo" Tags beneath a tag, its .foo member is nil.

You can use this to traverse the parse tree, writing code like soup.html.head.title to get the title of an HTML document.


soup.html.head.title
# => <title>Page title</title>

You can also use this to quickly jump to a certain part of a parse tree. For instance, if you're not worried about "title" Tags in weird places outside of the "head" Tag, you can just use soup.title to get an HTML document's title. soup.p jumps to the first "p" Tag inside a document, wherever it is. soup.table.tr.td jumps to the first column of the first row of the first table in the document.


soup.title
# => <title>Page title</title>

soup.p
<p id="firstpara" align="center">This is paragraph <b>one</b>.
</p>

These members actually alias to the Tag#find method, which is covered below in the section "Searching the Parse Tree". I mention it here because the alias makes it very easy to zoom in on an interesting part of a well-known parse tree.

soup.foo versus soup.foo_tag

An alternate form of this idiom lets you access the first 'foo' Tag as .foo_tag instead of .foo. For instance, soup.table.tr.td could also be expressed as soup.table_tag.tr_tag.td_tag, or even soup.table_tag.tr.td_tag. This is useful if you like to be more explicit about what you're doing, or if you're parsing XML whose tags contain names that conflict with Beautiful Soup methods and members.

soup.title_tag
# => <title>Page title</title>

Suppose you were parsing XML that contained tags called "parent" or "contents". soup.parent won't look for a tag called "parent"; it will look for the parent of the parser object (which is nil). Therefore you can't use that idiom to find the first "parent" tag in the parse tree. Instead, use soup.parent_tag.

The attributes of Tags

SGML tags can have attributes, and so can the Tag objects created by the parser. For instance, each of the "p" Tags in the example above has an "id" attribute and an "align" attribute. You can access a Tag's attributes by treating the Tag as though it were a Hash. soup.p['id'] retrieves the "id" attribute of the first "p" Tag.

soup.p['id']
# => "firstpara"

NavigableString objects don't have attributes; only Tag objects do.

Iterating Over the Parse Tree

Rubyful Soup exposes iterator functions that you can use to perform an iteration over the parse tree. Passing a code block to these functions is like repeatedly using the corresponding navigation member and calling the code block on each new result. The objects yielded by these methods will be both Tag and NavigableString objects.

Searching the Parse Tree

Beautiful Soup provides a number of methods for finding Tags and NavigableStrings that match criteria you specify. These methods are available only to Tag objects and to the top-level parser objects, not to NavigableString objects, which are always at the leaves of the parse tree. (The methods in the next section, "Searching Inside the Parse Tree", are available to both Tag and NavigableText objects.)

There are a lot of methods, and all of them have a great deal in common, so before I actually tell you the names of the methods I'm going to talk about their arguments

Arguments to these methods

All these methods, and the methods in the next section, take basically the same arguments: a name, zero or more from a set of four possible keyword args (passed using simulated keyword arguments), and an optional code block. All these arguments are used to narrow the search so you can get only the results you want.

Overriding the match code with a code block

If you pass a code block to one of these methods, you are overriding the entire Rubyful Soup matching process with your own code. Your code block will accept a series of Tag and NavigableString objects and will need to return true or false depending on whether it thinks each one is a "match".

A code block passed into these methods is like a code block passed into Enumerable#reject: its return value is used to decide whether or not a Tag or NavigableString has been matched. It's not like the code block you pass into Enumerable#collect, where the return value is used as the actually returned result.

Search terms

Rubyful Soup provides a very flexible matching system so you don't always have to write your own code block to do the matching. At just any place mentioned above where Rubyful Soup accepts a "search term" (for instance, as the name of a Tag, or as a piece of text to match), you can pass in any of a number of objects:

  1. A string. This will match only that specific string.

    For instance, if you wanted to get all of the "a" tags, you could call soup.find_all('a').

  2. An Array. This will match only the string values present in the Array.

    For instance, if you wanted to collect both "font" and "span" tags, you could call soup.find_all(['font', 'div'])

  3. A Hash where the keys are the string values you will accept. The values of the Hash don't matter. This is just like the Array technique, but faster.

    For instance, if you wanted to collect both "font" and "span" tags, you could call soup.find_all({'font' => nil, 'div' => nil})

  4. A regular expression. This will match any value that matches the regular expression.

    For instance, if you wanted to get all tags whose names contained the letter "a", you could call soup.find_all(/a/)

  5. A Proc object which takes a Tag object (or, if passed as the :text argument, a NavigableString object) and returns a boolean. This object will be called once for each Tag (or NavigableString) encountered, and if it returns True then the tag is considered to match.

    For instance, if you wanted to get only tags whose 'id' attributes matched their names, you could call: soup.find_all(Proc.new { |x| x.name==x['id'] })

    The main advantage of using a Proc object over just specifying a code block is that a code block has to handle both Tag and NavigableString objects. A Proc object only has to deal with one or the other (depending on where you pass it into the method).

find_all(name=nil, args={}, &block)

The find_all() method uses the children or recursive_children iterator, and traverses the entire tree below the starting point (the parser, or the Tag on which you called find_all). On its travels it gathers all the Tags or NavigableStrings that match the criteria you gave it. Supported args: :attrs, :text, :limit, :recursive.

find(name=nil, args={}, &block)

This is the same as find_all(), but it has a built-in :limit of 1. Supported args: :attrs, :text, :recursive.

find_all_text(text=nil, args={}, &block)

This locates pieces of text that match the search term text. This is just like passing in a :text arg to find_all. Supported args: :limit, :recursive.

find_text(text=nil, args={}, &block)

This is the same as find_all_text, but it has a built-in :limit of 1. Supported args: :recursive.

Searching Inside the Parse Tree

You can do most Rubyful Soup operations with the four methods in the previous section. However, sometimes you can't use them to get directly to the Tag or NavigableString you want. For example, consider some HTML like this:

require 'RubyfulSoup'
soup = BeautifulSoup.new(%{<ul>

 <li>An unrelated list
</ul>

<h1>Heading</h1>
<p>This is <b>the list you want</b>:</p>

<ul>
 <li>The data you want
</ul>})

There are a number of ways to navigate to that li tag that contains the data you want. The most obvious is this:

soup.find_all('li', args=:limit=>2)[1]
# => <li>The data you want
# => </li>

It should be equally obvious that that's not a very stable way to get that li tag. If you're only scraping this page once it doesn't matter, but if you're going to scrape it many times over a long period, such considerations become important. If the irrelevant list grows another li tag, you'll get that tag instead of the one you want, and your script will break or give the wrong data.

soup.find_all('ul', args=:limit=>2)[1].li
# => <li>The data you want
# => </li>

This is a little better, in that it can survive changes to the irrelevant list, but if the document grows another irrelevant list at the top, you'll get the first li tag of that list instead of the one you want. A more reliable way of referring to the ul tag you want would better reflect that tag's place in the structure of the document.

When you look at that HTML, you think of the list you want as 'the ul tag beneath the h1 tag'. The problem is that the tag isn't contained inside the h1 tag; it just comes after it. It's easy enough to get the h1 tag, but there's no way to get to the ul tag from there using fetch_all.

That's because the methods covered so far go down the parse tree, using the children or recursive_children iterators. The logical relationship to use in this document is the sibling relationship between the h1 tag and the ul tag that comes after it on the same level. This relationship is captured in Rubyful Soup's find_next_sibling method:

soup.h1.find_next_sibling('ul').li
# => <li>The data you want
# => </li>

As mentioned above, Rubyful Soup provides five iterators besides children and recursive_children. Each of these iterators has two methods associated with it, corresponding to find_all and find. One method gets all matching objects found through the traversal (subject to a user-specified :limit), and one does the same but has a built-in :limit of 1.

Methods that go down the parse tree imply a starting point that has children, so only Tag objects (and the parser objects themselves) have the search methods discussed so far. All search methods discussed in this section are available on Tag, NavigableString, and parser objects, because they move laterally or upwards through the parse tree.

In each of these pairs of methods, the first method accepts args :attrs, :text, :limit. The second method accepts args :attrs and :text (the :limit is always 1). The signature of each of these methods is (name=nil, args={}, &block)

Printing out the parse tree

If you need to make changes to the parse tree and print it back out, or just look at how Rubyful Soup decided to parse some bad HTML, you have a couple options for turning the parse tree back into a string.

to_s

The parser objects, as well as each Tag and NavigableText object, can be printed out as strings. This string will have no whitespace other than that present in the original text, and all tags will either be self-closing or have corresponding closing tags inserted into what Rubyful Soup guesses is the right place. One useful thing you can do with this is clean up HTML into something approaching XHTML.

BeautifulSoup.new("<b>foo<b>bar<br><i>baz</b>").to_s
#=> "<b>foo</b><b>bar<br /><i>baz</i></b>"

to_str

All Tag objects, as well as the parser objects, implement to_str. This means you can pass them into most methods that expect Strings as input.

prettify

The prettify method turns the parse tree (or a portion of it) into a pretty-printed string. This is just like the regular string you'd get with puts soup, except it uses whitespace to show the structure of the parse tree. Every tag will start a new line, and a tag's children will be indented one more level than its parent.

Remember from earlier examples that Rubyful Soup turned this:

<html>
<head><title>Page title</title></head>
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>.

<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>

into this:

 <html>
  <head>

   <title>Page title
   </title>
  </head>
  <body>
   <p id="firstpara" align="center">This is paragraph
    <b>one
    </b>.
   </p>

   <p id="secondpara" align="blah">This is paragraph
    <b>two
    </b>.
   </p>
  </body>
 </html>

Choosing a parser

Beautiful Soup provides four classes that implement different parsing strategies. You'll need to choose the right one depending on your task. For most tasks you'll be able to use BeautifulSoup, but sometimes one of the other classes might make things easier for you.

BeautifulSoup

The most popular Beautiful Soup class, this class parses HTML as seen in the real world. It contains heuristics about common HTML usage and mis-usage.
RawParsed with BeautifulSoup

<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. 
<sarcasm>It's so great!</sarcasm>
 <i>This 
  <span title="a">is
   <br /> some 
   <html>invalid HTML. 
    <sarcasm>It's so great!
    </sarcasm>
   </html>

  </span>
 </i>

BeautifulStoneSoup

This class parses any XML-like language. It contains no special language- or schema-specific heuristics. If you want to define a set of self-closing tags for your XML schema, you'll need to subclass this class.
RawParsed with BeautifulStoneSoup
<foo key1="value1">This is some <bar>invalid</baz> XML.
 <foo key1="value1">This is some 
  <bar>invalid XML.
  </bar>
 </foo>

BeautifulSOAP

This is a convenience subclass of BeautifulStoneSoup which makes it easier to deal with XML documents (like SOAP messages) that put data in tiny sub-elements when it would be more convenient to put them in attributes of the parent element.
RawParsed with BeautifulSOAP
<foo><bar>baz</bar></foo>
 <foo bar="baz">
  <bar>baz
  </bar>
 </foo>


This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Wednesday, July 12 2006, 11:51:42 Nowhere Standard Time and last built on Thursday, July 24 2014, 02:00:04 Nowhere Standard Time.

Crummy is © 1996-2014 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.

Document tree:

http://www.crummy.com/
software/
RubyfulSoup/
documentation.html
Site Search: