Sun Feb 02 2025 14:34 Beautiful Soup 4.13.0:
After a beta period lasting nearly a year, I've released the biggest update to Beautiful Soup in many years. For version 4.13.0 I added type hints to the Python code, and in doing so uncovered a large number of very small inconsistencies in the code. I've fixed the inconsistencies, but the result is a larger-than-usual number of deprecations and changes that may break backwards compatibility.
The CHANGELOG for 4.13.0 is quite large so I'm writing this blog post to highlight just the most important changes, specifically the changes most likely to make you need (or want) to change your code.
Deprecations and backwards-incompatible changes
DeprecationWarning
is issued on use for every deprecated method, attribute and class from the 3.0 and 2.0 major versions of Beautiful Soup. These have been deprecated for at least ten years, but they didn't issue DeprecationWarning
when you tried to use them. Now they do, and they're all going away soon.
- This version drops support for Python 3.6, which went EOL in December 2021. The minimum supported major Python version for Beautiful Soup is now Python 3.7, which went EOL in June 2023.
- The storage for a tag's attribute values now modifies incoming values
to be consistent with the HTML or XML spec. This means that if you set an
attribute value to a number, it will be converted to a string
immediately, rather than being converted when you output the document.
More importantly for backwards compatibility, setting an HTML
attribute value to True
will set the attribute's value to the
appropriate string per the HTML spec. Setting an attribute value to
False or None will remove the attribute value from the tag
altogether, rather than (effectively, as before) setting the value
to the string "False"
or the string "None"
.
This means that some programs that modify documents will generate
different output than they would in earlier versions of Beautiful Soup,
but the new documents are more likely to represent the intent behind the
modifications.
To give a specific example, if you have code that looks something like this:
checkbox1['checked'] = True
checkbox2['checked'] = False
Then a document that used to look like this (with most browsers
treating both boxes as checked):
<input type="checkbox" checked="True"/>
<input type="checkbox" checked="False"/>
Will now look like this (with browsers treating only the first box
as checked):
<input type="checkbox" checked="checked"/>
<input type="checkbox"/>
You can get the old behavior back by instantiating a TreeBuilder
with attribute_dict_class=dict
, or you can customize how Beautiful Soup
treats attribute values by passing in a custom subclass of dict
.
- If you pass an empty list as the attribute value when searching the
tree, you will now find all tags which have that attribute set to a value in
the empty list--that is, you will find nothing. This is consistent with other
situations where a list of acceptable values is provided. Previously, an
empty list was treated the same as
None
and False
, and you would have
found the tags which did not have that attribute set at all.
- When using one of the
find()
methods or creating a SoupStrainer
,
if you specify the same attribute value in attrs
and the
keyword arguments, you'll end up with two different ways to match that
attribute. Previously the value in keyword arguments would override the
value in attrs
.
- The 'html5' formatter is now much less aggressive about escaping
ampersands, escaping only the ampersands considered "ambiguous" by the HTML5
spec (which is almost none of them). This is the sort of change that
might break your unit test suite, but the resulting markup will be much more
readable and more HTML5-ish.
To quickly get the old behavior back, change code like this:
tag.encode(formatter='html5')
to this:
tag.encode(formatter='html5-4.12')
In the future, the 'html5' formatter may be become the default HTML
formatter, which will change Beautiful Soup's default output. This
will break a lot of test suites so it's not going to happen for a
while.
New features
- The online documentation now includes full API documentation generated from Python docstrings.
- The new
ElementFilter
class encapsulates Beautiful Soup's rules
about matching elements and deciding which parts of a document to
parse. This gives you direct access to Beautiful Soup's low-level matching API. See the documentation for details.
- The new
PageElement.filter()
method provides a fully general way of
finding elements in a Beautiful Soup parse tree. You can specify a
function to iterate over the tree and an ElementFilter
to determine
what matches.
- The
NavigableString
class now has a .string property which returns the
string itself. This makes it easier to iterate over a mixed list
of Tag
and NavigableString
objects.
- Defined a new warning class,
UnusualUsageWarning
, which is a superclass
for all of the warnings issued when Beautiful Soup notices something
unusual but not guaranteed to be wrong, like markup that looks like
a URL (MarkupResemblesLocatorWarning
) or XML being run through an HTML
parser (XMLParsedAsHTMLWarning
).
The text of these warnings has been revamped to explain in more
detail what is going on, how to check if you've made a mistake,
and how to make the warning go away if you are acting deliberately.
If these warnings are interfering with your workflow, or simply
annoying you, you can filter all of them by filtering
UnusualUsageWarning
, without worrying about losing the warnings
Beautiful Soup issues when there *definitely* is a problem you
need to correct, such as use of a deprecated method.
- Emit an
UnusualUsageWarning
if the user tries to search for an attribute
called _class
; they probably mean class_
.
Mon Feb 03 2025 12:05 January Film Roundup:
- Wallace & Gromit: Vengeance Most Fowl (2024): A fun movie, but I wish they'd come up with a brand new villain. Instead they brought back Feathers McGraw, who's enjoyable, but I think this trend of creative people mining their past triumphs is getting a little worrisome. If we must mine the past for Wallace & Gromit villains, why not bring back the stop-motion monstrosities from the Peter Gabriel "Sledgehammer" music video? Just throwing out ideas here.
- Thirty-Day Princess (1934): With a title like that you'd expect this to be a Lifetime original from 2015, but with a screenplay credit for Preston Sturges you'd expect it to be better than that, and it is. A Prince and the Pauper-type screwball farce with a pretty long automat scene. Cary Grant hasn't got The Voice down yet. It's all right.
- The Last Married Couple in America (1980): George Segal and Natalie Wood are always fun but this was more successful as a time capsule than as a comedy. Also, I have to point out the shadow hypothesis that this movie never dares to mention: Jeff and Mari may actually be causing their friends' marriages to break up, draining their compatibility energy to keep their own relationship afloat. I think this is a really strong possibility, given that at the end of the movie they're able to destroy another couple's marriage just by spending five minutes alone in a room with them. The power of their love vampirism can't be overstated.
- Inspector Ike (2020): A "loving parody", as they say, of 70s detective shows, which is also a decent Columbo-style mystery. I really enjoyed this because... well, it's a goofy version of Columbo. Nuff said. Can't wait for Poker Face season 2? Check this out. I do wish Aparna Nancherla had a bigger part, but it feels like a "I'll do one day of filming as a favor" kind of part.
- One, Two, Three (1961): Rewatch with James. Still a hilarious movie, an all-time great at making me laugh. Almost 20 years after we first saw this film, Scarlett's "Well, bye!" is still a common reference in our household.
- It Happened Tomorrow (1944): The strange sequel to It Happened One Night. A fun light fantasy story, with the fantasy element being slightly more fantastic than It's A Wonderful Life. I thought the most fun part of this movie was its illustration of how an artifact from the future swiftly ceases to be interesting or valuable once your timeline catches up with the time of its production. It's not from the future anymore!
- Batman Returns (1992): I wasn't expecting this, but this is way better than the 1989 Batman, and it's completely down to Danny DeVito and Michelle Pfeiffer. It's a movie all about the villains, with Batman himself being almost absent. Batman (1989) was also all about the villain, but with an annoying amount of time dedicated to Batman himself.
The Batman Returns screenplay doesn't really need Batman at all, though the marketing surely does. This could easily be the story of Catwoman and Penguin attempting to team up and then destroying one another. When he is on screen, Michael Douglas has the same blasé attitude towards the part which he'd display (I have to admit) to greater effect in Birdman (2014).
I didn't know Christopher Walken was in this, but he Walkens his way through this movie and it's a treat. One of the most Tim Burton-compatible actors. Speaking of Burton, I've seen enough of his movies now that certain aesthetic choices are repeating. I guess I'd classify him as a wackier version of David Lynch? He really puts his id on the screen, but not in a way that's terribly hard to figure out. Spirals and stripes; the guy loves spirals and stripes.
And I just realized this while writing this Film Roundup: the scene in Wallace & Gromit: Vengeance Most Fowl where they're tracking the gnomes through the sewers is a clear reference to the part of Batman Returns where Batman is tracking the penguins through the city streets. The UI is the same and really, the penguins in this movie should've used the sewers. Good call, Feathers McGraw.
 | Unless otherwise noted, all content licensed by Leonard Richardson under a Creative Commons License. |