Here's a Ruby session that demonstrates the basic features of Rubyful Soup.
require 'RubyfulSoup' #Create the soup input = %{<html><head><title>Page title</title></head><body> <p id="firstpara" align="center">This is paragraph <b>one</b>. <p id="secondpara" align="blah">This is paragraph <b>two</b>.</html>} soup = BeautifulSoup.new(input) #Search the soup titleTag = soup.html.head.title #=> <title>Page title</title> titleTag.string #=> "Page title" soup.find_all('p').size #=> 2 soup.find_all { |tag| tag['align'] = "blah"} #=> [<p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>] soup.find_all { |element| element.respond_to? :index and element.index('This is') == 0 #=> ["This is paragraph ", "This is paragraph "] soup.find('p', :attrs => {'align' => 'center'}) #=> <p id="firstpara" align="center">This is paragraph <b>one</b>. #=> </p> soup.find_all('p', :attrs => {'align' => 'center'})[0]['id'] #=> "firstpara" soup.find('p', {'align' => /^b.*/})['id'] #=> "firstpara" soup.find('p').b.string #=> "one" soup.find_all('p')[1].b.string #=> "two" #Modify the soup titleTag['id']='theTitle' titleTag.contents = ['New title.'] soup.html.head.title #=> <title id="theTitle">New title.</title> |
done
methodWhen you feed a markup document into one of Rubyful Soup's parser classes, Rubyful Soup transforms the markup into a parse tree: a set of linked objects representing the structure of the document.
The parser object is the root of the parse tree. Below it are Tag objects and NavigableString objects. A Tag represents an SGML tag, as well as anything and everything encountered between that tag and its closing. A NavigableString object represents a chunk of ASCII or Unicode text. You can treat it just like a string, but it also has the navigation members so you can get to other parts of the parse tree from it.
For concreteness, here's a visual representation of the parse tree for the example HTML I introduced in the "Quick Start" section. I got this representation by callingsoup.prettify
, and I'll use it
throughout this section to illustrate the navigation. In this
representation, a Tag that's underneath another tag in the parse tree
is displayed with another level of indentation than its parent.
<html> <head> <title>Page title </title> </head> <body> <p id="firstpara" align="center">This is paragraph <b>one </b>. </p> <p id="secondpara" align="blah">This is paragraph <b>two </b>. </p> </body> </html>
This is saying: we've got an html
Tag which contains a
head
Tag and a body
Tag. The
head
tag contains a title
Tag, which
contains a NavigableString object that says "Page title". The
body
Tag contains two p
Tags, and so on.
All Tag objects have all of the members listed below (though the
actual value of the member may be nil
). NavigableString
objects have all of them except for contents
and
string
.
parent
In the example above, the parent
of the "head" Tag is
the "html" Tag. The parent
of the "html" Tag is the
BeautifulSoup parser object itself. The parent of the parser object is
nil
. By following parent
you can move up the
parse tree.
soup.head.parent.name # => "html" soup.head.parent.parent.class # => BeautifulSoup soup.head.parent.parent.parent # => nil
contents
With parent
you move up the parse tree; With
contents
you move down it. This is a list of Tag and
NavigableString objects contained within a tag. Only the top-level
parser object and Tag objects have contents
;
NavigableString objects don't have 'em.
In the example above, the contents
of the first "p"
Tag is a list containing a NavigableString ("This is paragraph "), a
"b" Tag, and another NavigableString (".\n"). The
contents
of the "b" Tag is a list containing a
NavigableString ("one").
soup.p.contents # => ["This is paragraph ", one, ".\n"] soup.p.contents[1].contents # => ["one"]
string
For your convenience, if a tag has only one child node, and that
child node is a string, the child node is made available as
tag.string
as well as tag.contents[0]
. In
the example above, soup.b.string
is a NavigableString
representing the string "one". That's the string contained in the
first "b" Tag in the parse tree. soup.p.string
is
nil
, because the first "p" Tag in the parse tree has
more than one child. soup.head.string
is also
nil
, even though the "head" Tag has only one child,
because that child is a Tag (the "title" Tag), not a string.
soup.b.string # => "one" soup.p.string # => nil soup.head.string # => nil
next_sibling
and previous_sibling
These members let you skip to the next or previous thing on the
same level of the parse tree. For instance, the
next_sibling
of the "head" Tag is the "body" Tag, because
the "body" Tag is the next thing directly beneath the "html" Tag. The
next_sibling
of the "body" tag is nil
,
because there's nothing else directly beneath the "html" Tag.
Conversely, the previous_sibling
of the "body" Tag is
the "head" tag, and the previous_sibling
of the "head" Tag
is nil
.
soup.head.next_sibling.name # => "body" soup.body.next_sibling # => nil soup.body.previous_sibling.name # => "head" soup.head.previous_sibling # => nil
Some more examples: the next_sibling
of the first "p"
Tag is the second "p" Tag. The previous_sibling
of the
"b" Tag inside the second "p" Tag is the NavigableText "This is
paragraph ". The previous_sibling
of that NavigableText
is nil
, not anything inside the first "p" Tag.
soup.p.next_sibling # => <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p> soup.find_all('p')[1].b.previous_sibling # => "This is paragraph " soup.find_all('p')[1].b.previous_sibling.previous_sibling # => nil
next_parsed
and previous_parsed
These members let you move through the document elements in the
order they were processed by the parser, rather than in the order they
appear in the tree. For instance, the next_parsed
of the
"head" Tag is the "title" Tag, not the "body" Tag. This is because the
"title" Tag comes immediately after the "head" tag in the original
document.
Where next_parsed
and previous_parsed
are
concerned, a Tag's contents
come before whatever is its
next_sibling
. You usually won't have to use these
members, but sometimes it's the easiest way to get to something buried
inside the parse tree.
soup.head.next_parsed # => <title>Page title</title> soup.head.next_parsed soup.b.previous_parsed # => "This is paragraph " soup.b.next_parsed # => "one"
Tag
is an Enumerable
You can iterate over the contents of a tag by treating the Tag
itself as an Enumerable
. soup.body.each
is
the same as soup.body.contents.each
. Both will iterate
over the direct children of the first 'body' Tag found in the parse
tree. Similarly, to see how many child nodes a Tag has, you can call
tag.size
instead of tag.contents.size
. All
the Enumerable
methods will work on a Tag object, just as
if you'd called it on that Tag's Array
of contents
soup.body.each { |x| puts x.name if x.is_a? Tag } # => p # => p soup.body.contents.each { |x| puts x.name if x.is_a? Tag } # => p # => p soup.body.size # => 3 soup.head.contents.size # => 3 soup.body.p.reject { |x| x.is_a? Tag } # => ["This is paragraph ", ".\n"]
It's easy to navigate the parse tree by referencing the name of the
tag you want as a member of the parser or a Tag object. We've been
doing it throughout these examples. In general, calling
tag.foo
returns the first child of that tag (direct or
recursive) that happens to be a "foo" Tag. If there aren't any "foo"
Tags beneath a tag, its .foo
member is nil
.
You can use this to traverse the parse tree, writing code like
soup.html.head.title
to get the title of an HTML
document.
soup.html.head.title
# => <title>Page title</title>
You can also use this to quickly jump to a certain part of a parse
tree. For instance, if you're not worried about "title" Tags in weird
places outside of the "head" Tag, you can just use
soup.title
to get an HTML document's
title. soup.p
jumps to the first "p" Tag inside a
document, wherever it is. soup.table.tr.td
jumps
to the first column of the first row of the first table in the
document.
soup.title
# => <title>Page title</title>
soup.p
<p id="firstpara" align="center">This is paragraph <b>one</b>.
</p>
These members actually alias to the Tag#find
method, which
is covered below in the section "Searching the Parse Tree". I mention
it here because the alias makes it very easy to zoom in on an
interesting part of a well-known parse tree.
soup.foo
versus soup.foo_tag
An alternate form of this idiom lets you access the first 'foo' Tag
as .foo_tag
instead of .foo
. For instance,
soup.table.tr.td
could also be expressed as
soup.table_tag.tr_tag.td_tag
, or even
soup.table_tag.tr.td_tag
. This is useful if you like to
be more explicit about what you're doing, or if you're parsing XML
whose tags contain names that conflict with Beautiful Soup methods and
members.
soup.title_tag # => <title>Page title</title>
Suppose you were parsing XML that contained tags called "parent" or
"contents". soup.parent
won't look for a tag called
"parent"; it will look for the parent of the parser object (which is
nil
). Therefore you can't use that idiom to find the
first "parent" tag in the parse tree. Instead, use
soup.parent_tag
.
Tag
sSGML tags can have attributes, and so can the Tag objects created
by the parser. For instance, each of the "p" Tags in the example above
has an "id" attribute and an "align" attribute. You can access a Tag's
attributes by treating the Tag as though it were a
Hash
. soup.p['id']
retrieves the "id"
attribute of the first "p" Tag.
soup.p['id'] # => "firstpara"
NavigableString objects don't have attributes; only Tag objects do.
Rubyful Soup exposes iterator functions that you can use to perform
an iteration over the parse tree. Passing a code block to these
functions is like repeatedly using the corresponding navigation member
and calling the code block on each new result. The objects
yield
ed by these methods will be both Tag
and NavigableString
objects.
children
: This iterator works just like
contents.each
. It iterates over the direct children of a
Tag
or a parser object.
recursive_children
: This iterator performs a prefix
traversal over the entire parse tree (or, if you call it on a
Tag
object, the entire subtree). It gets every recursive
child of the starting object.
next_parsed_items
: This iterator moves through every
subsequent element of the document in the order in which the elements
were parsed. It's like reading the document, but instead of
undifferentiated strings you're reading a stream of Tag
and NavigableString
objects.
previous_parsed_items
: The same as
next_parsed_items
, but it moves in the opposite
direction, towards the beginning of the document, in the reverse of
the order in which the items were parsed.
next_siblings
: This iterator works like multiple
applications of the next_sibling
member. It iterates over
all subsequent Tag
s and NavigableString
s on
the same level as the starting point.
previous_siblings
: This iterator is the opposite of
next_siblings
; it works like multiple applications of the
previous_sibling
member. It iterates over all previous
Tag
s and NavigableString
s on the same level
as the starting point.
parents
: This iterator moves up the parse
tree. It's equivalent to calling the parent
member until
you reach the top of the tree. Usually this is a short trip compared
to what you'd get by calling the other iterators.
Beautiful Soup provides a number of methods for finding
Tag
s and NavigableString
s that match
criteria you specify. These methods are available only to
Tag
objects and to the top-level parser objects, not to
NavigableString
objects, which are always at the leaves
of the parse tree. (The methods in the next section, "Searching Inside
the Parse Tree", are available to both Tag
and
NavigableText
objects.)
There are a lot of methods, and all of them have a great deal in common, so before I actually tell you the names of the methods I'm going to talk about their arguments
All these methods, and the methods in the next section, take basically the same arguments: a name, zero or more from a set of four possible keyword args (passed using simulated keyword arguments), and an optional code block. All these arguments are used to narrow the search so you can get only the results you want.
Hash
of attribute-value pairs. The
attributes must be String
s, and the values must be search
terms (see below). :attrs can also be a search term: this is
the same as passing in a Hash
that maps the "class" HTML
attribute to the search term. This is a shortcut for searching for
HTML tags belonging to certain CSS classes.
recursive_children
iterator or the children
iterator). Of course, this argument is only supported by the methods
that go down the parse tree, such as find_all
.
If you pass a code block to one of these methods, you are
overriding the entire Rubyful Soup matching process with your own
code. Your code block will accept a series of Tag and NavigableString
objects and will need to return true
or
false
depending on whether it thinks each one is a
"match".
A code block passed into these methods is like a code block passed
into Enumerable#reject
: its return value is used to
decide whether or not a Tag or NavigableString has been matched. It's
not like the code block you pass into Enumerable#collect
,
where the return value is used as the actually returned result.
Rubyful Soup provides a very flexible matching system so you don't
always have to write your own code block to do the matching. At just
any place mentioned above where Rubyful Soup accepts a "search term"
(for instance, as the name of a Tag
, or as a piece of
text to match), you can pass in any of a number of objects:
For instance, if you wanted to get all of the "a" tags, you could
call soup.find_all('a')
.
Array
. This will match only the string values
present in the Array
.
For instance, if you wanted to collect both "font" and "span" tags,
you could call soup.find_all(['font', 'div'])
Hash
where the keys are the string values you will
accept. The values of the Hash
don't matter. This is just
like the Array
technique, but faster.
For instance, if you wanted to collect both "font" and "span" tags,
you could call soup.find_all({'font' => nil, 'div' => nil})
For instance, if you wanted to get all tags whose names contained
the letter "a", you could call
soup.find_all(/a/)
Proc
object which takes a Tag object (or, if passed
as the :text argument, a NavigableString
object)
and returns a boolean. This object will be called once for each
Tag
(or NavigableString
) encountered, and if
it returns True then the tag is considered to match.
For instance, if you wanted to get only tags whose 'id' attributes
matched their names, you could call: soup.find_all(Proc.new {
|x| x.name==x['id'] })
The main advantage of using a Proc
object over just
specifying a code block is that a code block has to handle both
Tag
and NavigableString
objects. A
Proc
object only has to deal with one or the other
(depending on where you pass it into the method).
find_all(name=nil, args={}, &block)
The find_all()
method uses the children
or recursive_children
iterator, and traverses the entire
tree below the starting point (the parser, or the Tag
on
which you called find_all
). On its travels it gathers all
the Tag
s or NavigableString
s that match the
criteria you gave it. Supported args: :attrs, :text,
:limit, :recursive.
find(name=nil, args={}, &block)
This is the same as find_all()
, but it has a built-in
:limit of 1. Supported args: :attrs, :text,
:recursive.
find_all_text(text=nil, args={}, &block)
This locates pieces of text that match the search term
text
. This is just like passing in a
:text arg to find_all
. Supported args:
:limit, :recursive.
find_text(text=nil, args={}, &block)
This is the same as find_all_text
, but it has a
built-in :limit of 1. Supported args: :recursive.
You can do most Rubyful Soup operations with the four methods in
the previous section. However, sometimes you can't use them to get
directly to the Tag
or NavigableString
you
want. For example, consider some HTML like this:
require 'RubyfulSoup' soup = BeautifulSoup.new(%{<ul> <li>An unrelated list </ul> <h1>Heading</h1> <p>This is <b>the list you want</b>:</p> <ul> <li>The data you want </ul>})
There are a number of ways to navigate to that li tag that contains the data you want. The most obvious is this:
soup.find_all('li', args=:limit=>2)[1] # => <li>The data you want # => </li>
It should be equally obvious that that's not a very stable way to get that li tag. If you're only scraping this page once it doesn't matter, but if you're going to scrape it many times over a long period, such considerations become important. If the irrelevant list grows another li tag, you'll get that tag instead of the one you want, and your script will break or give the wrong data.
soup.find_all('ul', args=:limit=>2)[1].li # => <li>The data you want # => </li>
This is a little better, in that it can survive changes to the irrelevant list, but if the document grows another irrelevant list at the top, you'll get the first li tag of that list instead of the one you want. A more reliable way of referring to the ul tag you want would better reflect that tag's place in the structure of the document.
When you look at that HTML, you think of the list you want as
'the ul tag beneath the h1 tag'. The problem is that the tag isn't
contained inside the h1 tag; it just comes after it. It's easy
enough to get the h1 tag, but there's no way to get to the ul tag from
there using fetch_all
.
That's because the methods covered so far go down the parse tree,
using the children
or recursive_children
iterators. The logical relationship to use in this document is the
sibling relationship between the h1
tag and the
ul
tag that comes after it on the same level. This
relationship is captured in Rubyful Soup's
find_next_sibling
method:
soup.h1.find_next_sibling('ul').li # => <li>The data you want # => </li>
As mentioned above, Rubyful Soup provides five iterators besides
children
and recursive_children
. Each of
these iterators has two methods associated with it, corresponding to
find_all
and find
. One method gets all
matching objects found through the traversal (subject to a
user-specified :limit), and one does the same but has a
built-in :limit of 1.
Methods that go down the parse tree imply a starting point that has
children, so only Tag
objects (and the parser objects
themselves) have the search methods discussed so far. All search
methods discussed in this section are available on Tag
,
NavigableString
, and parser objects, because they move
laterally or upwards through the parse tree.
In each of these pairs of methods, the first method accepts
args
:attrs, :text, :limit. The
second method accepts args
:attrs and :text (the :limit is always 1). The
signature of each of these methods is (name=nil, args={},
&block)
find_all_next
and find_next
These methods traverse the tree from their starting point using the
next_parsed_items
iterator, looking for matches.
find_all_previous
and find_previous
These methods traverse the tree from their starting point using the
previous_parsed_items
iterator, looking for matches.
find_next_siblings
and find_next_sibling
These methods traverse the tree from their starting point using the
next_siblings
iterator, looking for matches.
find_previous_siblings
and
find_previous_sibling
These methods traverse the tree from their starting point using the
previous_siblings
iterator, looking for matches.
find_parents
and find_parent
These methods traverse the tree from their starting point using the
parents
iterator, looking for matches.
If you need to make changes to the parse tree and print it back out, or just look at how Rubyful Soup decided to parse some bad HTML, you have a couple options for turning the parse tree back into a string.
to_s
The parser objects, as well as each Tag and NavigableText object, can be printed out as strings. This string will have no whitespace other than that present in the original text, and all tags will either be self-closing or have corresponding closing tags inserted into what Rubyful Soup guesses is the right place. One useful thing you can do with this is clean up HTML into something approaching XHTML.
BeautifulSoup.new("<b>foo<b>bar<br><i>baz</b>").to_s #=> "<b>foo</b><b>bar<br /><i>baz</i></b>"
to_str
All Tag
objects, as well as the parser objects,
implement to_str
. This means you can pass them into most
methods that expect String
s as input.
prettify
The prettify
method turns the parse tree (or a portion
of it) into a pretty-printed string. This is just like the regular
string you'd get with puts soup
, except it uses
whitespace to show the structure of the parse tree. Every tag will
start a new line, and a tag's children will be indented one more level
than its parent.
Remember from earlier examples that Rubyful Soup turned this:
<html> <head><title>Page title</title></head> <body> <p id="firstpara" align="center">This is paragraph <b>one</b>. <p id="secondpara" align="blah">This is paragraph <b>two</b>. </html>
into this:
<html> <head> <title>Page title </title> </head> <body> <p id="firstpara" align="center">This is paragraph <b>one </b>. </p> <p id="secondpara" align="blah">This is paragraph <b>two </b>. </p> </body> </html>
Beautiful Soup provides four classes that implement different
parsing strategies. You'll need to choose the right one depending on
your task. For most tasks you'll be able to use
BeautifulSoup
, but sometimes one of the other classes
might make things easier for you.
BeautifulSoup
Raw | Parsed with BeautifulSoup |
---|---|
<i>This <span title="a">is<br> some <html>invalid</htl %> HTML. <sarcasm>It's so great!</sarcasm> |
<i>This <span title="a">is <br /> some <html>invalid HTML. <sarcasm>It's so great! </sarcasm> </html> </span> </i> |
BeautifulStoneSoup
This class parses any XML-like language. It contains no special language- or schema-specific heuristics. If you want to define a set of self-closing tags for your XML schema, you'll need to subclass this class.
Raw | Parsed with BeautifulStoneSoup |
---|---|
<foo key1="value1">This is some <bar>invalid</baz> XML. |
<foo key1="value1">This is some <bar>invalid XML. </bar> </foo> |
BeautifulSOAP
This is a convenience subclass of BeautifulStoneSoup
which makes it easier to deal with XML documents (like SOAP messages)
that put data in tiny sub-elements when it would be more convenient to
put them in attributes of the parent element.
Raw | Parsed with BeautifulSOAP |
---|---|
<foo><bar>baz</bar></foo> |
<foo bar="baz"> <bar>baz </bar> </foo> |
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Wednesday, July 12 2006, 11:51:42 Nowhere Standard Time and last built on Friday, March 31 2023, 04:00:01 Nowhere Standard Time.
| Document tree: Site Search: |