Fixed a small but annoying bug that caused BS to crash when presented with HTML that contained boolean attributes.
A hybrid version that supports 2.4 and can be automatically converted to run under Python 3.0. There are three backwards-incompatible changes you should be aware of, but no new features or deliberate behavior changes.
The effect of this is that you can't pass an encoding to .__str__ anymore. Use encode() to get a string and decode() to get Unicode, and you'll be ready (well, readier) for Python 3.
<a href="foo</a>, </a><a href="bar">baz</a> <a b="<a>">', '<a b="<a>"></a><a>"></a>
A later version of Beautiful Soup will allow you to plug in different parsers to make tradeoffs between speed and the ability to handle bad HTML.
<a href="http://crummy.com?sacré&bleu">
In Python 3, the é is always converted to \xe9 during parsing.
convertEntities, or XML/HTML entities might stick around
that aren't valid in HTML/XML). The result may not validate, but it
should be good enough to not choke a real-world XML
parser. Specifically, the output of a properly constructed soup object
should always be valid as part of an XML document, but parts
may be missing if they were missing in the original. As always, if the
input is valid XML, the output will also be valid.
fetch,
find, findText, etc.) for backwards
compatibility purposes. Those names are deprecated and if I ever do a
4.0 I will remove them. I will, I tell you!
findAll method wasn't passing
along any keyword arguments.
soup('a', class='foo') because class is a
Python keyword.
SoupStrainer tells it not to parse that
tag, Beautiful Soup will no longer try to rewrite the meta tag to
mention the new encoding. Basically, this makes
SoupStrainers work in real-world applications instead of
crashing the parser.
extract, replaceWith, and
insert. [Doc
reference. See also Improving
Memory Usage with extract]
True in as an attribute value gives you tags
that have any value for that attribute. You don't have to
create a regular expression. Passing None for an
attribute value gives you tags that don't have that attribute at all.
selfClosingTags: you don't have to subclass anymore. [Doc reference]
MinimalSoup, which has
most of BeautifulSoup's HTML-specific rules, but no tag
nesting rules. [Doc
reference]
SoupStrainer to tell Beautiful Soup to
parse only part of a document. This saves time and memory, often
making Beautiful Soup about as fast as a custom-built
SGMLParser subclass. [Doc
reference, SoupStrainer reference]
soup(args={"id" : "5"}) with
soup(id="5"). You can still use args if (for
instance) you need to find an attribute whose name clashes with the
name of an argument to findAll. [Doc reference: **kwargs attrs]
find methods and
fetch methods, there are only find methods.
Instead of a scheme where you can't remember which method finds one
element and which one finds them all, we have find and
findAll. In general, if the method name mentions
All or a plural noun (eg. findNextSiblings),
then it finds many elements method. Otherwise, it only finds one
element. [Doc
reference]
avoidParserProblems is now
parserMassage.
feed
method. You need to pass a string or a filehandle into the soup
constructor, not with feed after the soup has been
created. There is still a feed method, but it's the
feed method implemented by SGMLParser and
calling it will bypass Beautiful Soup and cause problems.
NavigableText class has been renamed to
NavigableString. There is no
NavigableUnicodeString anymore, because every string
inside a Beautiful Soup parse tree is a Unicode string.
findText and fetchText are gone. Just
pass a text argument into find or
findAll.
Null was more trouble than it was worth, so I got rid
of it. Anything that used to return Null now returns
None.
NavigableString subclasses, instead of being treated as
oddly-formed data. If you parse a document that contains CDATA and
write it back out, the CDATA will still be there. [Doc reference]
BeautifulStoneSoup
which was causing parsing to be incredibly slow.
nextSibling) over and over again, looking for
Tag and NavigableText objects that match
certain criteria. The new methods are findNext,
fetchNext, findPrevious,
fetchPrevious, findNextSibling,
fetchNextSiblings, findPreviousSibling,
fetchPreviousSiblings, findParent, and
fetchParents. All of these use the same basic code used
by first and fetch, so you can pass your
weird ways of matching things into these methods.
fetch method and its derivatives now accept a
limit argument.
Tag
object as though it were a method.
done() method, which closes all of the
parser's open tags. It gets called automatically when you pass in some
text to the constructor of a parser class; otherwise you must call it
yourself.
string member of a NavigableText object
returns the NavigableText object instead of throwing an error.
tag.table.td.
tag.hidden doesn't spawn an attempt to find a tag named
'hidden'.
Beautiful Soup version 1 was very useful but also pretty stupid. I originally wrote it without noticing any of the problems inherent in trying to build a parse tree out of ambiguous HTML tags. This version solves all of those problems to my satisfaction. It also adds many new clever things to make up for the removal of the stupid things.
Tag.prettify() method.
str() on a Tag always returns a string, and
unicode() always returns Unicode. Previously it was
inconsistent.
first() or fetch() call, the tag
name or the desired value of an attribute can now be any of the
following:
This is much easier to use than SQL-style wildcards (see, regular expressions are good for something). Because of this, I took out SQL-style wildcards. I'll put them back if someone complains, but their removal simplifies the code a lot.
fetch() and first() to
search for text in the parse tree, not just tags. There are new alias
methods fetchText() and firstText() designed for this purpose. As with
searching for tags, you can pass in a string, a regular expression
object, or a method to match your text.
attrs
argument of fetch() or first(), Beautiful
Soup will assume you want to match that thing against the "class"
attribute. When you're scraping well-structured HTML, this makes your
code a lot cleaner.
fetch(). For instance, foo("bar") is a shorthand for
foo.fetch("bar"). In 2.x, you can also access a specially-named member
of a Tag object as a shorthand for first(). For instance,
foo.barTag is a shorthand for
foo.first("bar"). By chaining these shortcuts you
traverse a tree in very little code: for header in
soup.bodyTag.pTag.tableTag('th'):
first()
will also return Null if you ask it for a nonexistent tag. Null is an
object that's just like None, except you can do whatever you want to
it and it'll give you Null instead of throwing an error.
This lets you do tree traversals like
soup.htmlTag.headTag.titleTag without having to worry if
the intermediate stages are actually there. Previously, if there was
no 'head' tag in the document, headTag in that instance would have
been None, and accessing its 'titleTag' member would have thrown an
AttributeError. Now, you can get what you want when it exists, and get
Null when it doesn't, without having to do a lot of conditionals
checking to see if every stage is None.
<p><ul><li>Foo<br /><li>Bar</ul>The first 'li' tag has a previousSibling of Null and its nextSibling is the second 'li' tag. The second 'li' tag has a nextSibling of Null and its previousSibling is the first 'li' tag. The previousSibling of the 'ul' tag is the first 'p' tag. The nextSibling of 'Foo' is the 'br' tag.
There are three changes in 2.0 that break old code:
This is the release to get if you want Python 1.5 compatibility.
This is much easier to use than SQL-style wildcards (see, regular expressions are good for something). Because of this, I no longer recommend you use SQL-style wildcards. They may go away in a future release to clean up the code.
Initial release.
|
This document (source) is part of Crummy, the webspace of Leonard Richardson (contact information). It was last modified on Tuesday, January 06 2009, 21:18:23 Nowhere Standard Time and last built on Saturday, February 04 2012, 02:00:07 Nowhere Standard Time.
| Document tree: Site Search: |