Rubyful Soup

Note: Rubyful Soup is no longer being maintained. I recommend you use hpricot instead.

Rubyful Soup is a Ruby port of the hit Python HTML/XML parser Beautiful Soup. It's designed to be a useful quick-and-dirty parser for screen-scraping, along the same lines as its parent:

  1. Rubyful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and then run away.
  2. Rubyful Soup provides a few simple methods and Ruby-like idioms for navigating and searching a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application. It's more flexible and easier to learn than XPath.

Download

The current version has the same functionality and robustness of Beautiful Soup, and some things I haven't yet put into Beautiful Soup (see below for a list of differences). It's packaged as the rubyful_soup gem on rubyforge.org. You can install it with the gem install command, or download it manually. If you download it manually, be sure to install the htmltools gem first; that gem supplies the parser on which Rubyful Soup depends.

Documentation

Get it here.

Differences from Beautiful Soup

Good differences

  • In Rubyful Soup you can ignore all the arguments to the fetch methods, and pass in a block instead. This lets you completely customize the behavior of the fetch methods. This block will be called on every Tag and NavigableText object encountered by the iterator. The method should return true when it encounters a match, and false otherwise. The method may abort the iteration by throwing :stop_iteration.

    For instance, these two pieces of code are identical:

    soup.find_all('a')
    soup.find_all {|x| x.is_a? Tag and x.name == 'a' }

    As are these:

    soup.find_text('Hello')
    soup.find_text {|x| x == 'Hello' }

Bad differences

  • Rubyful Soup is relatively slower than Beautiful Soup. As of 1.0.4, however, it should never be unusably slow.
  • The method and member names are slightly different, out of respect for Ruby's naming conventions and reserved words.
  • You can't call a Tag as though it were a method. Ruby doesn't seem to support this. Not a big deal.
  • Most of the features of Beautiful Soup 3.0 are not yet present in Rubyful Soup. Some (the encoding autodetection) may not be present for a long time.

They were standing under a tree, each with an arm round the other's neck, and Alice knew which was which in a moment, because one of them had "DUM" embroidered on his collar, and the other "DEE".

This document is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Sunday, August 24 2008, 14:50:22 Nowhere Standard Time and last built on Friday, June 02 2023, 21:00:01 Nowhere Standard Time.

Crummy is © 1996-2023 Leonard Richardson. Unless otherwise noted, all text licensed under a Creative Commons License.

Document tree:

http://www.crummy.com/
software/
RubyfulSoup/
Site Search: