.. _manual: Beautiful Soup Documentation ============================ .. image:: 6.1.jpg :align: right :alt: "The Fish-Footman began by producing from under his arm a great letter, nearly as large as himself." `Beautiful Soup `_ is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. These instructions illustrate all major features of Beautiful Soup 4, with examples. I show you what the library is good for, how it works, how to use it, how to make it do what you want, and what to do when it violates your expectations. This document covers Beautiful Soup version 4.9.1. The examples in this documentation should work the same way in Python 2.7 and Python 3.2. You might be looking for the documentation for `Beautiful Soup 3 `_. If so, you should know that Beautiful Soup 3 is no longer being developed and that support for it will be dropped on or after December 31, 2020. If you want to learn about the differences between Beautiful Soup 3 and Beautiful Soup 4, see `Porting code to BS4`_. This documentation has been translated into other languages by Beautiful Soup users: * `这篇文档当然还有中文版. `_ * このページは日本語で利用できます(`外部リンク `_) * `이 문서는 한국어 번역도 가능합니다. `_ * `Este documento também está disponível em Português do Brasil. `_ * `Эта документация доступна на русском языке. `_ Getting help ------------ If you have questions about Beautiful Soup, or run into problems, `send mail to the discussion group `_. If your problem involves parsing an HTML document, be sure to mention :ref:`what the diagnose() function says ` about that document. Quick Start =========== Here's an HTML document I'll be using as an example throughout this document. It's part of a story from `Alice in Wonderland`:: html_doc = """ The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" Running the "three sisters" document through Beautiful Soup gives us a ``BeautifulSoup`` object, which represents the document as a nested data structure:: from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify()) # # # # The Dormouse's story # # # #

# # The Dormouse's story # #

#

# Once upon a time there were three little sisters; and their names were # # Elsie # # , # # Lacie # # and # # Tillie # # ; and they lived at the bottom of a well. #

#

# ... #

# # Here are some simple ways to navigate that data structure:: soup.title # The Dormouse's story soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p #

The Dormouse's story

soup.p['class'] # u'title' soup.a # Elsie soup.find_all('a') # [Elsie, # Lacie, # Tillie] soup.find(id="link3") # Tillie One common task is extracting all the URLs found within a page's tags:: for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie Another common task is extracting all the text from a page:: print(soup.get_text()) # The Dormouse's story # # The Dormouse's story # # Once upon a time there were three little sisters; and their names were # Elsie, # Lacie and # Tillie; # and they lived at the bottom of a well. # # ... Does this look like what you need? If so, read on. Installing Beautiful Soup ========================= If you're using a recent version of Debian or Ubuntu Linux, you can install Beautiful Soup with the system package manager: :kbd:`$ apt-get install python-bs4` (for Python 2) :kbd:`$ apt-get install python3-bs4` (for Python 3) Beautiful Soup 4 is published through PyPi, so if you can't install it with the system packager, you can install it with ``easy_install`` or ``pip``. The package name is ``beautifulsoup4``, and the same package works on Python 2 and Python 3. Make sure you use the right version of ``pip`` or ``easy_install`` for your Python version (these may be named ``pip3`` and ``easy_install3`` respectively if you're using Python 3). :kbd:`$ easy_install beautifulsoup4` :kbd:`$ pip install beautifulsoup4` (The ``BeautifulSoup`` package is probably `not` what you want. That's the previous major release, `Beautiful Soup 3`_. Lots of software uses BS3, so it's still available, but if you're writing new code you should install ``beautifulsoup4``.) If you don't have ``easy_install`` or ``pip`` installed, you can `download the Beautiful Soup 4 source tarball `_ and install it with ``setup.py``. :kbd:`$ python setup.py install` If all else fails, the license for Beautiful Soup allows you to package the entire library with your application. You can download the tarball, copy its ``bs4`` directory into your application's codebase, and use Beautiful Soup without installing it at all. I use Python 2.7 and Python 3.2 to develop Beautiful Soup, but it should work with other recent versions. Problems after installation --------------------------- Beautiful Soup is packaged as Python 2 code. When you install it for use with Python 3, it's automatically converted to Python 3 code. If you don't install the package, the code won't be converted. There have also been reports on Windows machines of the wrong version being installed. If you get the ``ImportError`` "No module named HTMLParser", your problem is that you're running the Python 2 version of the code under Python 3. If you get the ``ImportError`` "No module named html.parser", your problem is that you're running the Python 3 version of the code under Python 2. In both cases, your best bet is to completely remove the Beautiful Soup installation from your system (including any directory created when you unzipped the tarball) and try the installation again. If you get the ``SyntaxError`` "Invalid syntax" on the line ``ROOT_TAG_NAME = u'[document]'``, you need to convert the Python 2 code to Python 3. You can do this either by installing the package: :kbd:`$ python3 setup.py install` or by manually running Python's ``2to3`` conversion script on the ``bs4`` directory: :kbd:`$ 2to3-3.2 -w bs4` .. _parser-installation: Installing a parser ------------------- Beautiful Soup supports the HTML parser included in Python's standard library, but it also supports a number of third-party Python parsers. One is the `lxml parser `_. Depending on your setup, you might install lxml with one of these commands: :kbd:`$ apt-get install python-lxml` :kbd:`$ easy_install lxml` :kbd:`$ pip install lxml` Another alternative is the pure-Python `html5lib parser `_, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands: :kbd:`$ apt-get install python-html5lib` :kbd:`$ easy_install html5lib` :kbd:`$ pip install html5lib` This table summarizes the advantages and disadvantages of each parser library: +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | Parser | Typical usage | Advantages | Disadvantages | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | Python's html.parser | ``BeautifulSoup(markup, "html.parser")`` | * Batteries included | * Not as fast as lxml, | | | | * Decent speed | less lenient than | | | | * Lenient (As of Python 2.7.3 | html5lib. | | | | and 3.2.) | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | lxml's HTML parser | ``BeautifulSoup(markup, "lxml")`` | * Very fast | * External C dependency | | | | * Lenient | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | lxml's XML parser | ``BeautifulSoup(markup, "lxml-xml")`` | * Very fast | * External C dependency | | | ``BeautifulSoup(markup, "xml")`` | * The only currently supported | | | | | XML parser | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ | html5lib | ``BeautifulSoup(markup, "html5lib")`` | * Extremely lenient | * Very slow | | | | * Parses pages the same way a | * External Python | | | | web browser does | dependency | | | | * Creates valid HTML5 | | +----------------------+--------------------------------------------+--------------------------------+--------------------------+ If you can, I recommend you install and use lxml for speed. If you're using a very old version of Python -- earlier than 2.7.3 or 3.2.2 -- it's `essential` that you install lxml or html5lib. Python's built-in HTML parser is just not very good in those old versions. Note that if a document is invalid, different parsers will generate different Beautiful Soup trees for it. See `Differences between parsers`_ for details. Making the soup =============== To parse a document, pass it into the ``BeautifulSoup`` constructor. You can pass in a string or an open filehandle:: from bs4 import BeautifulSoup with open("index.html") as fp: soup = BeautifulSoup(fp) soup = BeautifulSoup("a web page") First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:: print(BeautifulSoup("Sacré bleu!")) # Sacré bleu! Beautiful Soup then parses the document using the best available parser. It will use an HTML parser unless you specifically tell it to use an XML parser. (See `Parsing XML`_.) Kinds of objects ================ Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you'll only ever have to deal with about four `kinds` of objects: ``Tag``, ``NavigableString``, ``BeautifulSoup``, and ``Comment``. .. _Tag: ``Tag`` ------- A ``Tag`` object corresponds to an XML or HTML tag in the original document:: soup = BeautifulSoup('Extremely bold') tag = soup.b type(tag) # Tags have a lot of attributes and methods, and I'll cover most of them in `Navigating the tree`_ and `Searching the tree`_. For now, the most important features of a tag are its name and attributes. Name ^^^^ Every tag has a name, accessible as ``.name``:: tag.name # u'b' If you change a tag's name, the change will be reflected in any HTML markup generated by Beautiful Soup:: tag.name = "blockquote" tag #
Extremely bold
Attributes ^^^^^^^^^^ A tag may have any number of attributes. The tag ```` has an attribute "id" whose value is "boldest". You can access a tag's attributes by treating the tag like a dictionary:: tag['id'] # u'boldest' You can access that dictionary directly as ``.attrs``:: tag.attrs # {u'id': 'boldest'} You can add, remove, and modify a tag's attributes. Again, this is done by treating the tag as a dictionary:: tag['id'] = 'verybold' tag['another-attribute'] = 1 tag # del tag['id'] del tag['another-attribute'] tag # tag['id'] # KeyError: 'id' print(tag.get('id')) # None .. _multivalue: Multi-valued attributes &&&&&&&&&&&&&&&&&&&&&&& HTML 4 defines a few attributes that can have multiple values. HTML 5 removes a couple of them, but defines a few more. The most common multi-valued attribute is ``class`` (that is, a tag can have more than one CSS class). Others include ``rel``, ``rev``, ``accept-charset``, ``headers``, and ``accesskey``. Beautiful Soup presents the value(s) of a multi-valued attribute as a list:: css_soup = BeautifulSoup('

') css_soup.p['class'] # ["body"] css_soup = BeautifulSoup('

') css_soup.p['class'] # ["body", "strikeout"] If an attribute `looks` like it has more than one value, but it's not a multi-valued attribute as defined by any version of the HTML standard, Beautiful Soup will leave the attribute alone:: id_soup = BeautifulSoup('

') id_soup.p['id'] # 'my id' When you turn a tag back into a string, multiple attribute values are consolidated:: rel_soup = BeautifulSoup('

Back to the homepage

') rel_soup.a['rel'] # ['index'] rel_soup.a['rel'] = ['index', 'contents'] print(rel_soup.p) #

Back to the homepage

You can disable this by passing ``multi_valued_attributes=None`` as a keyword argument into the ``BeautifulSoup`` constructor:: no_list_soup = BeautifulSoup('

', 'html', multi_valued_attributes=None) no_list_soup.p['class'] # u'body strikeout' You can use ```get_attribute_list`` to get a value that's always a list, whether or not it's a multi-valued atribute:: id_soup.p.get_attribute_list('id') # ["my id"] If you parse a document as XML, there are no multi-valued attributes:: xml_soup = BeautifulSoup('

', 'xml') xml_soup.p['class'] # u'body strikeout' Again, you can configure this using the ``multi_valued_attributes`` argument:: class_is_multi= { '*' : 'class'} xml_soup = BeautifulSoup('

', 'xml', multi_valued_attributes=class_is_multi) xml_soup.p['class'] # [u'body', u'strikeout'] You probably won't need to do this, but if you do, use the defaults as a guide. They implement the rules described in the HTML specification:: from bs4.builder import builder_registry builder_registry.lookup('html').DEFAULT_CDATA_LIST_ATTRIBUTES ``NavigableString`` ------------------- A string corresponds to a bit of text within a tag. Beautiful Soup uses the ``NavigableString`` class to contain these bits of text:: tag.string # u'Extremely bold' type(tag.string) # A ``NavigableString`` is just like a Python Unicode string, except that it also supports some of the features described in `Navigating the tree`_ and `Searching the tree`_. You can convert a ``NavigableString`` to a Unicode string with ``unicode()``:: unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # You can't edit a string in place, but you can replace one string with another, using :ref:`replace_with()`:: tag.string.replace_with("No longer bold") tag #
No longer bold
``NavigableString`` supports most of the features described in `Navigating the tree`_ and `Searching the tree`_, but not all of them. In particular, since a string can't contain anything (the way a tag may contain a string or another tag), strings don't support the ``.contents`` or ``.string`` attributes, or the ``find()`` method. If you want to use a ``NavigableString`` outside of Beautiful Soup, you should call ``unicode()`` on it to turn it into a normal Python Unicode string. If you don't, your string will carry around a reference to the entire Beautiful Soup parse tree, even when you're done using Beautiful Soup. This is a big waste of memory. ``BeautifulSoup`` ----------------- The ``BeautifulSoup`` object represents the parsed document as a whole. For most purposes, you can treat it as a :ref:`Tag` object. This means it supports most of the methods described in `Navigating the tree`_ and `Searching the tree`_. You can also pass a ``BeautifulSoup`` object into one of the methods defined in `Modifying the tree`_, just as you would a :ref:`Tag`. This lets you do things like combine two parsed documents:: doc = BeautifulSoup("INSERT FOOTER HEREHere's the footer", "xml") doc.find(text="INSERT FOOTER HERE").replace_with(footer) # u'INSERT FOOTER HERE' print(doc) # #
Here's the footer
Since the ``BeautifulSoup`` object doesn't correspond to an actual HTML or XML tag, it has no name and no attributes. But sometimes it's useful to look at its ``.name``, so it's been given the special ``.name`` "[document]":: soup.name # u'[document]' Comments and other special strings ---------------------------------- ``Tag``, ``NavigableString``, and ``BeautifulSoup`` cover almost everything you'll see in an HTML or XML file, but there are a few leftover bits. The main one you'll probably encounter is the comment:: markup = "" soup = BeautifulSoup(markup) comment = soup.b.string type(comment) # The ``Comment`` object is just a special type of ``NavigableString``:: comment # u'Hey, buddy. Want to buy a used parser' But when it appears as part of an HTML document, a ``Comment`` is displayed with special formatting:: print(soup.b.prettify()) # # # Beautiful Soup also defines classes called ``Stylesheet``, ``Script``, and ``TemplateString``, for embedded CSS stylesheets (any strings found inside a ``