bs4 package

Module contents

Beautiful Soup Elixir and Tonic - "The Screen-Scraper's Friend".

http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup uses a pluggable XML or HTML parser to parse a (possibly invalid) document into a tree representation. Beautiful Soup provides methods and Pythonic idioms that make it easy to navigate, search, and modify the parse tree.

Beautiful Soup works with Python 3.7 and up. It works better if lxml and/or html5lib is installed, but they are not required.

For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

exception bs4.AttributeResemblesVariableWarning

Bases: UnusualUsageWarning, SyntaxWarning

The warning issued when Beautiful Soup suspects a provided attribute name may actually be the misspelled name of a Beautiful Soup variable. Generally speaking, this is only used in cases like "_class" where it's very unlikely the user would be referencing an XML attribute with that name.

MESSAGE: str = '%(original)r is an unusual attribute name and is a common misspelling for %(autocorrect)r.\n\nIf you meant %(autocorrect)r, change your code to use it, and this warning will go away.\n\nIf you really did mean to check the %(original)r attribute, this warning is spurious and can be filtered. To make it go away, run this code before creating your BeautifulSoup object:\n\n    from bs4 import AttributeResemblesVariableWarning\n    import warnings\n\n    warnings.filterwarnings("ignore", category=AttributeResemblesVariableWarning)\n'
class bs4.BeautifulSoup(markup: str | bytes | IO[str] | IO[bytes] = '', features: str | Sequence[str] | None = None, builder: TreeBuilder | Type[TreeBuilder] | None = None, parse_only: SoupStrainer | None = None, from_encoding: str | None = None, exclude_encodings: Iterable[str] | None = None, element_classes: Dict[Type[PageElement], Type[PageElement]] | None = None, **kwargs: Any)

Bases: Tag

A data structure representing a parsed HTML or XML document.

Most of the methods you'll call on a BeautifulSoup object are inherited from PageElement or Tag.

Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you'll need to understand these methods as a whole.

These methods will be called by the BeautifulSoup constructor:
  • reset()

  • feed(markup)

The tree builder may call these methods from its feed() implementation:
  • handle_starttag(name, attrs) # See note about return value

  • handle_endtag(name)

  • handle_data(data) # Appends to the current data node

  • endData(containerClass) # Ends the current data node

No matter how complicated the underlying parser is, you should be able to build a tree using 'start tag' events, 'end tag' events, 'data' events, and "done with data" events.

If you encounter an empty-element tag (aka a self-closing tag, like HTML's <br> tag), call handle_starttag and then handle_endtag.

ASCII_SPACES: str = ' \n\t\x0c\r'

A string containing all ASCII whitespace characters, used in during parsing to detect data chunks that seem 'empty'.

DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast']

If the end-user gives no indication which tree builder they want, look for one with these features.

ROOT_TAG_NAME: str = '[document]'

Since BeautifulSoup subclasses Tag, it's possible to treat it as a Tag with a Tag.name. Hoever, this name makes it clear the BeautifulSoup object isn't a real markup tag.

contains_replacement_characters: bool

This is True if the markup that was parsed contains U+FFFD REPLACEMENT_CHARACTER characters which were not present in the original markup. These mark character sequences that could not be represented in Unicode.

copy_self() BeautifulSoup

Create a new BeautifulSoup object with the same TreeBuilder, but not associated with any markup.

This is the first step of the deepcopy process.

declared_html_encoding: str | None

The character encoding, if any, that was explicitly defined in the original document. This may or may not match BeautifulSoup.original_encoding.

decode(indent_level: int | None = None, eventual_encoding: str = 'utf-8', formatter: Formatter | str = 'minimal', iterator: Iterator[PageElement] | None = None, **kwargs: Any) str
Returns a string representation of the parse tree

as a full HTML or XML document.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • eventual_encoding -- The encoding of the final document. If this is None, the document will be a Unicode string.

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • iterator -- The iterator to use when navigating over the parse tree. This is only used by Tag.decode_contents and you probably won't need to use it.

insert_after(*args: PageElement | str) List[PageElement]

This method is part of the PageElement API, but BeautifulSoup doesn't implement it because there is nothing before or after it in the parse tree.

insert_before(*args: PageElement | str) List[PageElement]

This method is part of the PageElement API, but BeautifulSoup doesn't implement it because there is nothing before or after it in the parse tree.

is_xml: bool
new_string(s: str, subclass: Type[NavigableString] | None = None) NavigableString

Create a new NavigableString associated with this BeautifulSoup object.

Parameters:
  • s -- The string content of the NavigableString

  • subclass -- The subclass of NavigableString, if any, to use. If a document is being processed, an appropriate subclass for the current location in the document will be determined automatically.

new_tag(name: str, namespace: str | None = None, nsprefix: str | None = None, attrs: Mapping[str | NamespacedAttribute, _RawAttributeValue] | None = None, sourceline: int | None = None, sourcepos: int | None = None, string: str | None = None, **kwattrs: str) Tag

Create a new Tag associated with this BeautifulSoup object.

Parameters:
  • name -- The name of the new Tag.

  • namespace -- The URI of the new Tag's XML namespace, if any.

  • prefix -- The prefix for the new Tag's XML namespace, if any.

  • attrs -- A dictionary of this Tag's attribute values; can be used instead of kwattrs for attributes like 'class' that are reserved words in Python.

  • sourceline -- The line number where this tag was (purportedly) found in its source document.

  • sourcepos -- The character position within sourceline where this tag was (purportedly) found.

  • string -- String content for the new Tag, if any.

  • kwattrs -- Keyword arguments for the new Tag's attribute values.

original_encoding: str | None

Beautiful Soup's best guess as to the character encoding of the original document.

reset() None

Reset this object to a state as though it had never parsed any markup.

string_container(base_class: Type[NavigableString] | None = None) Type[NavigableString]

Find the class that should be instantiated to hold a given kind of string.

This may be a built-in Beautiful Soup class or a custom class passed in to the BeautifulSoup constructor.

class bs4.CData(value: str | bytes)

Bases: PreformattedString

A CDATA section.

PREFIX: str = '<![CDATA['

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = ']]>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

class bs4.CSS(tag: element.Tag, api: ModuleType | None = None)

Bases: object

A proxy object against the soupsieve library, to simplify its CSS selector API.

You don't need to instantiate this class yourself; instead, use element.Tag.css.

Parameters:
  • tag -- All CSS selectors run by this object will use this as their starting point.

  • api -- An optional drop-in replacement for the soupsieve module, intended for use in unit tests.

closest(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None

Find the element.Tag closest to this one that matches the given selector.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.closest() method.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.closest() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.closest() method.

compile(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) SoupSieve

Pre-compile a selector and return the compiled object.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • flags -- Flags to be passed into Soup Sieve's soupsieve.compile() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.compile() method.

Returns:

A precompiled selector object.

Return type:

soupsieve.SoupSieve

escape(ident: str) str

Escape a CSS identifier.

This is a simple wrapper around soupsieve.escape(). See the documentation for that function for more information.

filter(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) ResultSet[element.Tag]

Filter this element.Tag's direct children based on the given CSS selector.

This uses the Soup Sieve library. It works the same way as passing a element.Tag into that library's soupsieve.filter() method. For more information, see the documentation for soupsieve.filter().

Parameters:
  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.filter() method.

  • kwargs --

    Keyword arguments to be passed into SoupSieve's soupsieve.filter() method.

iselect(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) Iterator[element.Tag]

Perform a CSS selection operation on the current element.Tag.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.iselect() method. It is the same as select(), but it returns a generator instead of a list.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.iselect() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.iselect() method.

match(select: str, namespaces: Dict[str, str] | None = None, flags: int = 0, **kwargs: Any) bool

Check whether or not this element.Tag matches the given CSS selector.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.match() method.

Param:

a CSS selector.

Parameters:
  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.match() method.

  • kwargs --

    Keyword arguments to be passed into SoupSieve's soupsieve.match() method.

select(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) ResultSet[element.Tag]

Perform a CSS selection operation on the current element.Tag.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select() method.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.select() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.

select_one(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None

Perform a CSS selection operation on the current Tag and return the first result, if any.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select_one() method.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.select_one() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.select_one() method.

class bs4.Comment(value: str | bytes)

Bases: PreformattedString

An HTML comment or XML comment.

PREFIX: str = '<!--'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '-->'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

class bs4.Declaration(value: str | bytes)

Bases: PreformattedString

An XML declaration.

PREFIX: str = '<?'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '?>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

class bs4.Doctype(value: str | bytes)

Bases: PreformattedString

A document type declaration.

PREFIX: str = '<!DOCTYPE '

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '>\n'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

classmethod for_name_and_ids(name: str, pub_id: str | None, system_id: str | None) Doctype

Generate an appropriate document type declaration for a given public ID and system ID.

Parameters:
  • name -- The name of the document's root element, e.g. 'html'.

  • pub_id -- The Formal Public Identifier for this document type, e.g. '-//W3C//DTD XHTML 1.1//EN'

  • system_id -- The system identifier for this document type, e.g. 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'

class bs4.ElementFilter(match_function: Callable[[PageElement], bool] | None = None)

Bases: object

ElementFilter encapsulates the logic necessary to decide:

1. whether a PageElement (a Tag or a NavigableString) matches a user-specified query.

2. whether a given sequence of markup found during initial parsing should be turned into a PageElement at all, or simply discarded.

The base class is the simplest ElementFilter. By default, it matches everything and allows all markup to become PageElement objects. You can make it more selective by passing in a user-defined match function, or defining a subclass.

Most users of Beautiful Soup will never need to use ElementFilter, or its more capable subclass SoupStrainer. Instead, they will use methods like Tag.find(), which will convert their arguments into SoupStrainer objects and run them against the tree.

However, if you find yourself wanting to treat the arguments to Beautiful Soup's find_*() methods as first-class objects, those objects will be SoupStrainer objects. You can create them yourself and then make use of functions like ElementFilter.filter().

allow_string_creation(string: str) bool

Based on the content of a string, see whether this ElementFilter will allow a NavigableString object based on this string to be added to the parse tree.

By default, all strings are processed into NavigableString objects. To change this, subclass ElementFilter.

Parameters:

str -- The string under consideration.

allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool

Based on the name and attributes of a tag, see whether this ElementFilter will allow a Tag object to even be created.

By default, all tags are parsed. To change this, subclass ElementFilter.

Parameters:
  • name -- The name of the prospective tag.

  • attrs -- The attributes of the prospective tag.

property excludes_everything: bool

Does this ElementFilter obviously exclude everything? If so, Beautiful Soup will issue a warning if you try to use it when parsing a document.

The ElementFilter might turn out to exclude everything even if this returns False, but it won't exclude everything in an obvious way.

The base ElementFilter implementation excludes things based on a match function we can't inspect, so excludes_everything is always false.

filter(generator: Iterator[PageElement]) Iterator[PageElement | Tag | NavigableString]

The most generic search method offered by Beautiful Soup.

Acts like Python's built-in filter, using ElementFilter.match as the filtering function.

find(generator: Iterator[PageElement]) PageElement | Tag | NavigableString | None

A lower-level equivalent of Tag.find().

You can pass in your own generator for iterating over PageElement objects. The first one that matches this ElementFilter will be returned.

Parameters:

generator -- A way of iterating over PageElement objects.

find_all(generator: Iterator[PageElement], limit: int | None = None) ResultSet[PageElement | Tag | NavigableString]

A lower-level equivalent of Tag.find_all().

You can pass in your own generator for iterating over PageElement objects. Only elements that match this ElementFilter will be returned in the ResultSet.

Parameters:
  • generator -- A way of iterating over PageElement objects.

  • limit -- Stop looking after finding this many results.

property includes_everything: bool

Does this ElementFilter obviously include everything? If so, the filter process can be made much faster.

The ElementFilter might turn out to include everything even if this returns False, but it won't include everything in an obvious way.

The base ElementFilter implementation includes things based on the match function, so includes_everything is only true if there is no match function.

match(element: PageElement, _known_rules: bool = False) bool

Does the given PageElement match the rules set down by this ElementFilter?

The base implementation delegates to the function passed in to the constructor.

Parameters:

_known_rules -- Defined for compatibility with SoupStrainer._match(). Used more for consistency than because we need the performance optimization.

match_function: Callable[[PageElement], bool] | None
exception bs4.FeatureNotFound

Bases: ValueError

Exception raised by the BeautifulSoup constructor if no parser with the requested features is found.

exception bs4.GuessedAtParserWarning

Bases: UserWarning

The warning issued when BeautifulSoup has to guess what parser to use -- probably because no parser was specified in the constructor.

MESSAGE: str = 'No parser was explicitly specified, so I\'m using the best available %(markup_type)s parser for this system ("%(parser)s"). This usually isn\'t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument \'features="%(parser)s"\' to the BeautifulSoup constructor.\n'
exception bs4.MarkupResemblesLocatorWarning

Bases: UnusualUsageWarning

The warning issued when BeautifulSoup is given 'markup' that actually looks like a resource locator -- a URL or a path to a file on disk.

FILENAME_MESSAGE: str = 'The input passed in on this line looks more like a filename than HTML or XML.\n\nIf you meant to use Beautiful Soup to parse the contents of a file on disk, then something has gone wrong. You should open the file first, using code like this:\n\n    filehandle = open(your filename)\n\nYou can then feed the open filehandle into Beautiful Soup instead of using the filename.\n\nHowever, if you want to parse some data that happens to look like a %(what)s, then nothing has gone wrong: you are using Beautiful Soup correctly, and this warning is spurious and can be filtered. To make this warning go away, run this code before calling the BeautifulSoup constructor:\n\n    from bs4 import MarkupResemblesLocatorWarning\n    import warnings\n\n    warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)\n    '
URL_MESSAGE: str = 'The input passed in on this line looks more like a URL than HTML or XML.\n\nIf you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like \'requests\' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.\n\nHowever, if you want to parse some data that happens to look like a %(what)s, then nothing has gone wrong: you are using Beautiful Soup correctly, and this warning is spurious and can be filtered. To make this warning go away, run this code before calling the BeautifulSoup constructor:\n\n    from bs4 import MarkupResemblesLocatorWarning\n    import warnings\n\n    warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)\n    '
exception bs4.ParserRejectedMarkup(message_or_exception: str | Exception)

Bases: Exception

An Exception to be raised when the underlying parser simply refuses to parse the given markup.

class bs4.ProcessingInstruction(value: str | bytes)

Bases: PreformattedString

A SGML processing instruction.

PREFIX: str = '<?'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

class bs4.ResultSet(source: ElementFilter | None, result: Sequence[_PageElementT] = ())

Bases: Sequence[_PageElementT], Generic[_PageElementT]

A ResultSet is a sequence of PageElement objects, gathered as the result of matching an ElementFilter against a parse tree. Basically, a list of search results.

result: Sequence[_PageElementT]
source: ElementFilter | None
class bs4.Script(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of a <script> HTML tag (probably Javascript).

Used to distinguish executable code from textual content.

exception bs4.StopParsing

Bases: Exception

Exception raised by a TreeBuilder if it's unable to continue parsing.

class bs4.Stylesheet(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of a <style> HTML tag (probably CSS).

Used to distinguish embedded stylesheets from textual content.

class bs4.Tag(parser: BeautifulSoup | None = None, builder: TreeBuilder | None = None, name: str | None = None, namespace: str | None = None, prefix: str | None = None, attrs: _RawOrProcessedAttributeValues | None = None, parent: BeautifulSoup | Tag | None = None, previous: _AtMostOneElement = None, is_xml: bool | None = None, sourceline: int | None = None, sourcepos: int | None = None, can_be_empty_element: bool | None = None, cdata_list_attributes: Dict[str, Set[str]] | None = None, preserve_whitespace_tags: Set[str] | None = None, interesting_string_types: Set[Type[NavigableString]] | None = None, namespaces: Dict[str, str] | None = None)

Bases: PageElement

An HTML or XML tag that is part of a parse tree, along with its attributes, contents, and relationships to other parts of the tree.

When Beautiful Soup parses the markup <b>penguin</b>, it will create a Tag object representing the <b> tag. You can instantiate Tag objects directly, but it's not necessary unless you're adding entirely new markup to a parsed document. Most of the constructor arguments are intended for use by the TreeBuilder that's parsing a document.

Parameters:
  • parser -- A BeautifulSoup object representing the parse tree this Tag will be part of.

  • builder -- The TreeBuilder being used to build the tree.

  • name -- The name of the tag.

  • namespace -- The URI of this tag's XML namespace, if any.

  • prefix -- The prefix for this tag's XML namespace, if any.

  • attrs -- A dictionary of attribute values.

  • parent -- The Tag to use as the parent of this Tag. May be the BeautifulSoup object itself.

  • previous -- The PageElement that was parsed immediately before parsing this tag.

  • is_xml -- If True, this is an XML tag. Otherwise, this is an HTML tag.

  • sourceline -- The line number where this tag was found in its source document.

  • sourcepos -- The character position within sourceline where this tag was found.

  • can_be_empty_element -- If True, this tag should be represented as <tag/>. If False, this tag should be represented as <tag></tag>.

  • cdata_list_attributes -- A dictionary of attributes whose values should be parsed as lists of strings if they ever show up on this tag.

  • preserve_whitespace_tags -- Names of tags whose contents should have their whitespace preserved if they are encountered inside this tag.

  • interesting_string_types -- When iterating over this tag's string contents in methods like Tag.strings or PageElement.get_text, these are the types of strings that are interesting enough to be considered. By default, NavigableString (normal strings) and CData (CDATA sections) are the only interesting string subtypes.

  • namespaces -- A dictionary mapping currently active namespace prefixes to URIs, as of the point in the parsing process when this tag was encountered. This can be used later to construct CSS selectors.

append(tag: _InsertableElement) PageElement

Appends the given PageElement to the contents of this Tag.

Parameters:

tag -- A PageElement.

:return The newly appended PageElement.

attrs: _AttributeValues
can_be_empty_element: bool | None
cdata_list_attributes: Dict[str, Set[str]] | None
property children: Iterator[PageElement]

Iterate over all direct children of this PageElement.

clear(decompose: bool = False) None
Destroy all children of this Tag by calling

PageElement.extract on them.

Parameters:

decompose -- If this is True, PageElement.decompose (a more destructive method) will be called instead of PageElement.extract.

contents: List[PageElement]
copy_self() Self

Create a new Tag just like this one, but with no contents and unattached to any parse tree.

This is the first step in the deepcopy process, but you can call it on its own to create a copy of a Tag without copying its contents.

property css: CSS

Return an interface to the CSS selector API.

decode(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal', iterator: Iterator[PageElement] | None = None) str

Render this Tag and its contents as a Unicode string.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • encoding -- The encoding you intend to use when converting the string to a bytestring. decode() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • iterator -- The iterator to use when navigating over the parse tree. This is only used by Tag.decode_contents and you probably won't need to use it.

decode_contents(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') str

Renders the contents of this tag as a Unicode string.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.

  • eventual_encoding -- The tag is destined to be encoded into this encoding. decode_contents() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).

  • formatter -- A Formatter object, or a string naming one of the standard Formatters.

property descendants: Iterator[PageElement]

Iterate over all children of this Tag in a breadth-first sequence.

encode(encoding: _Encoding = 'utf-8', indent_level: int | None = None, formatter: _FormatterOrName = 'minimal', errors: str = 'xmlcharrefreplace') bytes

Render this Tag and its contents as a bytestring.

Parameters:
  • encoding -- The encoding to use when converting to a bytestring. This may also affect the text of the document, specifically any encoding declarations within the document.

  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • errors -- An error handling strategy such as 'xmlcharrefreplace'. This value is passed along into str.encode() and its value should be one of the error handling constants defined by Python's codecs module.

encode_contents(indent_level: int | None = None, encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') bytes

Renders the contents of this PageElement as a bytestring.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • encoding -- The bytestring will be in this encoding.

extend(tags: Iterable[_InsertableElement] | Tag) List[PageElement]

Appends one or more objects to the contents of this Tag.

Parameters:

tags -- If a list of PageElement objects is provided, they will be appended to this tag's contents, one at a time. If a single Tag is provided, its Tag.contents will be used to extend this object's Tag.contents.

:return The list of PageElements that were appended.

find(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '') _AtMostOneNavigableString

Look in the children of this PageElement and find the first PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • recursive -- If this is True, find() will perform a recursive search of this Tag's children. Otherwise, only the direct children will be considered.

  • string -- A filter on the Tag.string attribute.

Kwargs:

Additional filters on attribute values.

find_all(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_all(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Look in the children of this PageElement and find all PageElement objects that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • recursive -- If this is True, find_all() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

get(key: str, default: _AttributeValue | None = None) _AttributeValue | None

Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.

Parameters:
  • key -- The attribute to look for.

  • default -- Use this value if the attribute is not present on this Tag.

get_attribute_list(key: str, default: AttributeValueList | None = None) AttributeValueList

The same as get(), but always returns a (possibly empty) list.

Parameters:
  • key -- The attribute to look for.

  • default -- Use this value if the attribute is not present on this Tag.

Returns:

A list of strings, usually empty or containing only a single value.

has_attr(key: str) bool

Does this Tag have an attribute with the given name?

index(element: PageElement) int

Find the index of a child of this Tag (by identity, not value).

Doing this by identity avoids issues when a Tag contains two children that have string equality.

Parameters:

element -- Look for this PageElement in this object's contents.

insert(position: int, *new_children: _InsertableElement) List[PageElement]

Insert one or more new PageElements as a child of this Tag.

This works similarly to list.insert(), except you can insert multiple elements at once.

Parameters:
  • position -- The numeric position that should be occupied in this Tag's Tag.children by the first new PageElement.

  • new_children -- The PageElements to insert.

:return The newly inserted PageElements.

interesting_string_types: Set[Type[NavigableString]] | None
isSelfClosing() bool

: :meta private:

property is_empty_element: bool

Is this tag an empty-element tag? (aka a self-closing tag)

A tag that has contents is never an empty-element tag.

A tag that has no contents may or may not be an empty-element tag. It depends on the TreeBuilder used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag. This is usually the case for HTML documents.

If the builder has no designated list of empty-element, then any tag with no contents is an empty-element tag. This is usually the case for XML documents.

name: str
namespace: str | None
parser_class: type[BeautifulSoup] | None
prefix: str | None
preserve_whitespace_tags: Set[str] | None
prettify(encoding: None = None, formatter: _FormatterOrName = 'minimal') str
prettify(encoding: _Encoding, formatter: _FormatterOrName = 'minimal') bytes

Pretty-print this Tag as a string or bytestring.

Parameters:
  • encoding -- The encoding of the bytestring, or None if you want Unicode.

  • formatter -- A Formatter object, or a string naming one of the standard formatters.

Returns:

A string (if no encoding is provided) or a bytestring (otherwise).

replaceWithChildren() _OneElement

: :meta private:

replace_with_children() Self

Replace this PageElement with its contents.

Returns:

This object, no longer part of the tree.

select(selector: str, namespaces: Dict[str, str] | None = None, limit: int = 0, **kwargs: Any) ResultSet[Tag]

Perform a CSS selection operation on the current element.

This uses the SoupSieve library.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • kwargs -- Keyword arguments to be passed into SoupSieve's soupsieve.select() method.

select_one(selector: str, namespaces: Dict[str, str] | None = None, **kwargs: Any) Tag | None

Perform a CSS selection operation on the current element.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • kwargs -- Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.

property self_and_descendants: Iterator[PageElement]

Iterate over this Tag and its children in a breadth-first sequence.

smooth() None

Smooth out the children of this Tag by consolidating consecutive strings.

If you perform a lot of operations that modify the tree, calling this method afterwards can make pretty-printed output look more natural.

sourceline: int | None
sourcepos: int | None
property string: str | None

Convenience property to get the single string within this Tag, assuming there is just one.

Returns:

If this Tag has a single child that's a NavigableString, the return value is that string. If this element has one child Tag, the return value is that child's Tag.string, recursively. If this Tag has no children, or has more than one child, the return value is None.

If this property is unexpectedly returning None for you, it's probably because your Tag has more than one thing inside it.

property strings: Iterator[str]

Yield all strings of certain classes, possibly stripping them.

Parameters:
  • strip -- If True, all strings will be stripped before being yielded.

  • types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that's not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.

unwrap() Self

Replace this PageElement with its contents.

Returns:

This object, no longer part of the tree.

class bs4.TemplateString(value: str | bytes)

Bases: NavigableString

A NavigableString representing a string found inside an HTML <template> tag embedded in a larger document.

Used to distinguish such strings from the main body of the document.

class bs4.UnicodeDammit(markup: bytes, known_definite_encodings: Iterable[str] | None = [], smart_quotes_to: Literal['ascii', 'xml', 'html'] | None = None, is_html: bool = False, exclude_encodings: Iterable[str] | None = [], user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)

Bases: object

A class for detecting the encoding of a bytestring containing an HTML or XML document, and decoding it to Unicode. If the source encoding is windows-1252, UnicodeDammit can also replace Microsoft smart quotes with their HTML or XML equivalents.

Parameters:
  • markup -- HTML or XML markup in an unknown encoding.

  • known_definite_encodings -- When determining the encoding of markup, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.

  • user_encodings -- These encodings will be tried after the known_definite_encodings have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.

  • override_encodings -- A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings.

  • smart_quotes_to -- By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to ascii will convert them to ASCII quotes instead. Setting it to xml will convert them to XML entity references, and setting it to html will convert them to HTML entity references.

  • is_html -- If True, markup is treated as an HTML document. Otherwise it's treated as an XML document.

  • exclude_encodings -- These encodings will not be considered, even if the sniffing code thinks they might make sense.

CHARSET_ALIASES: Dict[str, str]

This dictionary maps commonly seen values for "charset" in HTML meta tags to the corresponding Python codec names. It only covers values that aren't in Python's aliases and can't be determined by the heuristics in find_codec.

ENCODINGS_WITH_SMART_QUOTES: Iterable[str]

A list of encodings that tend to contain Microsoft smart quotes.

MS_CHARS: Dict[bytes, str | Tuple[str, str]]

A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.

WINDOWS_1252_TO_UTF8: Dict[int, bytes]

A map used when removing rogue Windows-1252/ISO-8859-1 characters in otherwise UTF-8 documents.

Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in Windows-1252.

contains_replacement_characters: bool

This is True if UnicodeDammit.unicode_markup contains U+FFFD REPLACEMENT_CHARACTER characters which were not present in UnicodeDammit.markup. These mark character sequences that could not be represented in Unicode.

property declared_html_encoding: str | None

If the markup is an HTML document, returns the encoding, if any, declared inside the document.

classmethod detwingle(in_bytes: bytes, main_encoding: str = 'utf8', embedded_encoding: str = 'windows-1252') bytes

Fix characters from one encoding embedded in some other encoding.

Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8.

Parameters:
  • in_bytes -- A bytestring that you suspect contains characters from multiple encodings. Note that this must be a bytestring. If you've already converted the document to Unicode, you're too late.

  • main_encoding -- The primary encoding of in_bytes.

  • embedded_encoding -- The encoding that was used to embed characters in the main document.

Returns:

A bytestring similar to in_bytes, in which embedded_encoding characters have been converted to their main_encoding equivalents.

find_codec(charset: str) str | None

Look up the Python codec corresponding to a given character set.

Parameters:

charset -- The name of a character set.

Returns:

The name of a Python codec.

markup: bytes

The original markup, before it was converted to Unicode. This is not necessarily the same as what was passed in to the constructor, since any byte-order mark will be stripped.

original_encoding: str | None

Unicode, Dammit's best guess as to the original character encoding of UnicodeDammit.markup.

smart_quotes_to: str | None

The strategy used to handle Microsoft smart quotes.

tried_encodings: List[Tuple[str, str]]

The (encoding, error handling strategy) 2-tuples that were used to try and convert the markup to Unicode.

unicode_markup: str | None

The Unicode version of the markup, following conversion. This is set to None if there was simply no way to convert the bytestring to Unicode (as with binary data).

exception bs4.UnusualUsageWarning

Bases: UserWarning

A superclass for warnings issued when Beautiful Soup sees something that is typically the result of a mistake in the calling code, but might be intentional on the part of the user. If it is in fact intentional, you can filter the individual warning class to get rid of the warning. If you don't like Beautiful Soup second-guessing what you are doing, you can filter the UnusualUsageWarningclass itself and get rid of these entirely.

exception bs4.XMLParsedAsHTMLWarning

Bases: UnusualUsageWarning

The warning issued when an HTML parser is used to parse XML that is not (as far as we can tell) XHTML.

MESSAGE: str = 'It looks like you\'re using an HTML parser to parse an XML document.\n\nAssuming this really is an XML document, what you\'re doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package \'lxml\' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.\n\nIf you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:\n\n    from bs4 import XMLParsedAsHTMLWarning\n    import warnings\n\n    warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)\n'

Subpackages

Submodules

bs4.css module

Integration code for CSS selectors using Soup Sieve (pypi: soupsieve).

Acquire a CSS object through the element.Tag.css attribute of the starting point of your CSS selector, or (if you want to run a selector against the entire document) of the BeautifulSoup object itself.

The main advantage of doing this instead of using soupsieve functions is that you don't need to keep passing the element.Tag to be selected against, since the CSS object is permanently scoped to that element.Tag.

class bs4.css.CSS(tag: element.Tag, api: ModuleType | None = None)

Bases: object

A proxy object against the soupsieve library, to simplify its CSS selector API.

You don't need to instantiate this class yourself; instead, use element.Tag.css.

Parameters:
  • tag -- All CSS selectors run by this object will use this as their starting point.

  • api -- An optional drop-in replacement for the soupsieve module, intended for use in unit tests.

closest(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None

Find the element.Tag closest to this one that matches the given selector.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.closest() method.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.closest() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.closest() method.

compile(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) SoupSieve

Pre-compile a selector and return the compiled object.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.compile() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.compile() method.

Returns:

A precompiled selector object.

Return type:

soupsieve.SoupSieve

escape(ident: str) str

Escape a CSS identifier.

This is a simple wrapper around soupsieve.escape(). See the documentation for that function for more information.

filter(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) ResultSet[element.Tag]

Filter this element.Tag's direct children based on the given CSS selector.

This uses the Soup Sieve library. It works the same way as passing a element.Tag into that library's soupsieve.filter() method. For more information, see the documentation for soupsieve.filter().

Parameters:
  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.filter() method.

  • kwargs --

    Keyword arguments to be passed into SoupSieve's soupsieve.filter() method.

iselect(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) Iterator[element.Tag]

Perform a CSS selection operation on the current element.Tag.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.iselect() method. It is the same as select(), but it returns a generator instead of a list.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.iselect() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.iselect() method.

match(select: str, namespaces: Dict[str, str] | None = None, flags: int = 0, **kwargs: Any) bool

Check whether or not this element.Tag matches the given CSS selector.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.match() method.

Param:

a CSS selector.

Parameters:
  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.match() method.

  • kwargs --

    Keyword arguments to be passed into SoupSieve's soupsieve.match() method.

select(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) ResultSet[element.Tag]

Perform a CSS selection operation on the current element.Tag.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select() method.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.select() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.

select_one(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None

Perform a CSS selection operation on the current Tag and return the first result, if any.

This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select_one() method.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • flags --

    Flags to be passed into Soup Sieve's soupsieve.select_one() method.

  • kwargs --

    Keyword arguments to be passed into Soup Sieve's soupsieve.select_one() method.

bs4.dammit module

Beautiful Soup bonus library: Unicode, Dammit

This library converts a bytestream to Unicode through any means necessary. It is heavily based on code from Mark Pilgrim's Universal Feed Parser, now maintained by Kurt McKee. It does not rewrite the body of an XML or HTML document to reflect a new encoding; that's the job of TreeBuilder.

class bs4.dammit.EncodingDetector(markup: bytes, known_definite_encodings: Iterable[str] | None = None, is_html: bool | None = False, exclude_encodings: Iterable[str] | None = None, user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)

Bases: object

This class is capable of guessing a number of possible encodings for a bytestring.

Order of precedence:

  1. Encodings you specifically tell EncodingDetector to try first (the known_definite_encodings argument to the constructor).

  2. An encoding determined by sniffing the document's byte-order mark.

  3. Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the user_encodings argument to the constructor).

  4. An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.)

  5. An encoding detected through textual analysis by chardet, cchardet, or a similar external library.

  6. UTF-8.

  7. Windows-1252.

Parameters:
  • markup -- Some markup in an unknown encoding.

  • known_definite_encodings --

    When determining the encoding of markup, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.

  • user_encodings --

    These encodings will be tried after the known_definite_encodings have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.

  • override_encodings -- A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings.

  • is_html -- If True, this markup is considered to be HTML. Otherwise it's assumed to be XML.

  • exclude_encodings -- These encodings will not be tried, even if they otherwise would be.

chardet_encoding: str | None
declared_encoding: str | None
property encodings: Iterator[str]

Yield a number of encodings that might work for this markup.

Yield:

A sequence of strings. Each is the name of an encoding that might work to convert a bytestring into Unicode.

exclude_encodings: Iterable[str]
classmethod find_declared_encoding(markup: bytes | str, is_html: bool = False, search_entire_document: bool = False) str | None

Given a document, tries to find an encoding declared within the text of the document itself.

An XML encoding is declared at the beginning of the document.

An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document.

Parameters:
  • markup -- Some markup.

  • is_html -- If True, this markup is considered to be HTML. Otherwise it's assumed to be XML.

  • search_entire_document -- Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.

Returns:

The declared encoding, if one is found.

is_html: bool
known_definite_encodings: Iterable[str]
markup: bytes
sniffed_encoding: str | None
classmethod strip_byte_order_mark(data: bytes) Tuple[bytes, str | None]

If a byte-order mark is present, strip it and return the encoding it implies.

Parameters:

data -- A bytestring that may or may not begin with a byte-order mark.

Returns:

A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark)

user_encodings: Iterable[str]
class bs4.dammit.EntitySubstitution

Bases: object

The ability to substitute XML or HTML entities for certain characters.

AMPERSAND_OR_BRACKET: Pattern[str]

A regular expression matching an angle bracket or an ampersand.

ANY_ENTITY_RE = re.compile('&(#\\d+|#x[0-9a-fA-F]+|\\w+);', re.IGNORECASE)
BARE_AMPERSAND_OR_BRACKET: Pattern[str]

A regular expression matching an angle bracket or an ampersand that is not part of an XML or HTML entity.

CHARACTER_TO_HTML_ENTITY: Dict[str, str]

A map of Unicode strings to the corresponding named HTML entities; the inverse of HTML_ENTITY_TO_CHARACTER.

CHARACTER_TO_HTML_ENTITY_RE: Pattern[str]

A regular expression that matches any character (or, in rare cases, pair of characters) that can be replaced with a named HTML entity.

CHARACTER_TO_HTML_ENTITY_WITH_AMPERSAND_RE: Pattern[str]

A very similar regular expression to CHARACTER_TO_HTML_ENTITY_RE, but which also matches unescaped ampersands. This is used by the 'html' formatted to provide backwards-compatibility, even though the HTML5 spec allows most ampersands to go unescaped.

CHARACTER_TO_XML_ENTITY: Dict[str, str]

A map of Unicode strings to the corresponding named XML entities.

HTML_ENTITY_TO_CHARACTER: Dict[str, str]

A map of named HTML entities to the corresponding Unicode string.

classmethod quoted_attribute_value(value: str) str

Make a value into a quoted XML attribute, possibly escaping it.

Most strings will be quoted using double quotes.

Bob's Bar -> "Bob's Bar"

If a string contains double quotes, it will be quoted using single quotes.

Welcome to "my bar" -> 'Welcome to "my bar"'

If a string contains both single and double quotes, the double quotes will be escaped, and the string will be quoted using double quotes.

Welcome to "Bob's Bar" -> Welcome to &quot;Bob's bar&quot;

Parameters:

value -- The XML attribute value to quote

Returns:

The quoted value

classmethod substitute_html(s: str) str

Replace certain Unicode characters with named HTML entities.

This differs from data.encode(encoding, 'xmlcharrefreplace') in that the goal is to make the result more readable (to those with ASCII displays) rather than to recover from errors. There's absolutely nothing wrong with a UTF-8 string containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that character with "&eacute;" will make it more readable to some people.

Parameters:

s -- The string to be modified.

Returns:

The string with some Unicode characters replaced with HTML entities.

classmethod substitute_html5(s: str) str

Replace certain Unicode characters with named HTML entities using HTML5 rules.

Specifically, this method is much less aggressive about escaping ampersands than substitute_html. Only ambiguous ampersands are escaped, per the HTML5 standard:

"An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section."

Unlike substitute_html5_raw, this method assumes HTML entities were converted to Unicode characters on the way in, as Beautiful Soup does. By the time Beautiful Soup does its work, the only ambiguous ampersands that need to be escaped are the ones that were escaped in the original markup when mentioning HTML entities.

Parameters:

s -- The string to be modified.

Returns:

The string with some Unicode characters replaced with HTML entities.

classmethod substitute_html5_raw(s: str) str

Replace certain Unicode characters with named HTML entities using HTML5 rules.

substitute_html5_raw is similar to substitute_html5 but it is designed for standalone use (whereas substitute_html5 is designed for use with Beautiful Soup).

Parameters:

s -- The string to be modified.

Returns:

The string with some Unicode characters replaced with HTML entities.

classmethod substitute_xml(value: str, make_quoted_attribute: bool = False) str

Replace special XML characters with named XML entities.

The less-than sign will become &lt;, the greater-than sign will become &gt;, and any ampersands will become &amp;. If you want ampersands that seem to be part of an entity definition to be left alone, use substitute_xml_containing_entities instead.

Parameters:
  • value -- A string to be substituted.

  • make_quoted_attribute -- If True, then the string will be quoted, as befits an attribute value.

Returns:

A version of value with special characters replaced with named entities.

classmethod substitute_xml_containing_entities(value: str, make_quoted_attribute: bool = False) str

Substitute XML entities for special XML characters.

Parameters:
  • value -- A string to be substituted. The less-than sign will become &lt;, the greater-than sign will become &gt;, and any ampersands that are not part of an entity defition will become &amp;.

  • make_quoted_attribute -- If True, then the string will be quoted, as befits an attribute value.

class bs4.dammit.UnicodeDammit(markup: bytes, known_definite_encodings: Iterable[str] | None = [], smart_quotes_to: Literal['ascii', 'xml', 'html'] | None = None, is_html: bool = False, exclude_encodings: Iterable[str] | None = [], user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)

Bases: object

A class for detecting the encoding of a bytestring containing an HTML or XML document, and decoding it to Unicode. If the source encoding is windows-1252, UnicodeDammit can also replace Microsoft smart quotes with their HTML or XML equivalents.

Parameters:
  • markup -- HTML or XML markup in an unknown encoding.

  • known_definite_encodings --

    When determining the encoding of markup, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.

  • user_encodings --

    These encodings will be tried after the known_definite_encodings have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.

  • override_encodings -- A deprecated alias for known_definite_encodings. Any encodings here will be tried immediately after the encodings in known_definite_encodings.

  • smart_quotes_to -- By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to ascii will convert them to ASCII quotes instead. Setting it to xml will convert them to XML entity references, and setting it to html will convert them to HTML entity references.

  • is_html -- If True, markup is treated as an HTML document. Otherwise it's treated as an XML document.

  • exclude_encodings -- These encodings will not be considered, even if the sniffing code thinks they might make sense.

CHARSET_ALIASES: Dict[str, str]

This dictionary maps commonly seen values for "charset" in HTML meta tags to the corresponding Python codec names. It only covers values that aren't in Python's aliases and can't be determined by the heuristics in find_codec.

ENCODINGS_WITH_SMART_QUOTES: Iterable[str]

A list of encodings that tend to contain Microsoft smart quotes.

MS_CHARS: Dict[bytes, str | Tuple[str, str]]

A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.

WINDOWS_1252_TO_UTF8: Dict[int, bytes]

A map used when removing rogue Windows-1252/ISO-8859-1 characters in otherwise UTF-8 documents.

Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in Windows-1252.

contains_replacement_characters: bool

This is True if UnicodeDammit.unicode_markup contains U+FFFD REPLACEMENT_CHARACTER characters which were not present in UnicodeDammit.markup. These mark character sequences that could not be represented in Unicode.

property declared_html_encoding: str | None

If the markup is an HTML document, returns the encoding, if any, declared inside the document.

classmethod detwingle(in_bytes: bytes, main_encoding: str = 'utf8', embedded_encoding: str = 'windows-1252') bytes

Fix characters from one encoding embedded in some other encoding.

Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8.

Parameters:
  • in_bytes -- A bytestring that you suspect contains characters from multiple encodings. Note that this must be a bytestring. If you've already converted the document to Unicode, you're too late.

  • main_encoding -- The primary encoding of in_bytes.

  • embedded_encoding -- The encoding that was used to embed characters in the main document.

Returns:

A bytestring similar to in_bytes, in which embedded_encoding characters have been converted to their main_encoding equivalents.

find_codec(charset: str) str | None

Look up the Python codec corresponding to a given character set.

Parameters:

charset -- The name of a character set.

Returns:

The name of a Python codec.

markup: bytes

The original markup, before it was converted to Unicode. This is not necessarily the same as what was passed in to the constructor, since any byte-order mark will be stripped.

original_encoding: str | None

Unicode, Dammit's best guess as to the original character encoding of UnicodeDammit.markup.

smart_quotes_to: str | None

The strategy used to handle Microsoft smart quotes.

tried_encodings: List[Tuple[str, str]]

The (encoding, error handling strategy) 2-tuples that were used to try and convert the markup to Unicode.

unicode_markup: str | None

The Unicode version of the markup, following conversion. This is set to None if there was simply no way to convert the bytestring to Unicode (as with binary data).

bs4.element module

class bs4.element.AttributeDict

Bases: Dict[Any, Any]

Superclass for the dictionary used to hold a tag's attributes. You can use this, but it's just a regular dict with no special logic.

class bs4.element.AttributeValueList(iterable=(), /)

Bases: List[str]

Class for the list used to hold the values of attributes which have multiple values (such as HTML's 'class'). It's just a regular list, but you can subclass it and pass it in to the TreeBuilder constructor as attribute_value_list_class, to have your subclass instantiated instead.

class bs4.element.AttributeValueWithCharsetSubstitution

Bases: str

An abstract class standing in for a character encoding specified inside an HTML <meta> tag.

Subclasses exist for each place such a character encoding might be found: either inside the charset attribute (CharsetMetaAttributeValue) or inside the content attribute (ContentMetaAttributeValue)

This allows Beautiful Soup to replace that part of the HTML file with a different encoding when ouputting a tree as a string.

substitute_encoding(eventual_encoding: str) str

Do whatever's necessary in this implementation-specific portion an HTML document to substitute in a specific encoding.

class bs4.element.CData(value: str | bytes)

Bases: PreformattedString

A CDATA section.

PREFIX: str = '<![CDATA['

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = ']]>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
class bs4.element.CharsetMetaAttributeValue(original_value: str)

Bases: AttributeValueWithCharsetSubstitution

A generic stand-in for the value of a <meta> tag's charset attribute.

When Beautiful Soup parses the markup <meta charset="utf8">, the value of the charset attribute will become one of these objects.

If the document is later encoded to an encoding other than UTF-8, its <meta> tag will mention the new encoding instead of utf8.

substitute_encoding(eventual_encoding: _Encoding = 'utf-8') str

When an HTML document is being encoded to a given encoding, the value of a <meta> tag's charset becomes the name of the encoding.

class bs4.element.Comment(value: str | bytes)

Bases: PreformattedString

An HTML comment or XML comment.

PREFIX: str = '<!--'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '-->'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
class bs4.element.ContentMetaAttributeValue(original_value: str)

Bases: AttributeValueWithCharsetSubstitution

A generic stand-in for the value of a <meta> tag's content attribute.

When Beautiful Soup parses the markup:

<meta http-equiv="content-type" content="text/html; charset=utf8">

The value of the content attribute will become one of these objects.

If the document is later encoded to an encoding other than UTF-8, its <meta> tag will mention the new encoding instead of utf8.

CHARSET_RE: Pattern[str] = re.compile('((^|;)\\s*charset=)([^;]*)', re.MULTILINE)

Match the 'charset' argument inside the 'content' attribute of a <meta> tag. :meta private:

substitute_encoding(eventual_encoding: _Encoding = 'utf-8') str

When an HTML document is being encoded to a given encoding, the value of the charset= in a <meta> tag's content becomes the name of the encoding.

bs4.element.DEFAULT_OUTPUT_ENCODING: str = 'utf-8'

Documents output by Beautiful Soup will be encoded with this encoding unless you specify otherwise.

class bs4.element.Declaration(value: str | bytes)

Bases: PreformattedString

An XML declaration.

PREFIX: str = '<?'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '?>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
class bs4.element.Doctype(value: str | bytes)

Bases: PreformattedString

A document type declaration.

PREFIX: str = '<!DOCTYPE '

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '>\n'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

classmethod for_name_and_ids(name: str, pub_id: str | None, system_id: str | None) Doctype

Generate an appropriate document type declaration for a given public ID and system ID.

Parameters:
  • name -- The name of the document's root element, e.g. 'html'.

  • pub_id -- The Formal Public Identifier for this document type, e.g. '-//W3C//DTD XHTML 1.1//EN'

  • system_id -- The system identifier for this document type, e.g. 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'

next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
class bs4.element.HTMLAttributeDict

Bases: AttributeDict

A dictionary for holding a Tag's attributes, which processes incoming values for consistency with the HTML spec, which says 'Attribute values are a mixture of text and character references...'

Basically, this means converting common non-string values into strings, like XMLAttributeDict, though HTML also has some rules around boolean attributes that XML doesn't have.

class bs4.element.NamespacedAttribute(prefix: str | None, name: str | None = None, namespace: str | None = None)

Bases: str

A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"') which remembers the namespace prefix ('xml') and the name ('lang') that were used to create it.

name: str | None
namespace: str | None
prefix: str | None
class bs4.element.NavigableString(value: str | bytes)

Bases: str, PageElement

A Python string that is part of a parse tree.

When Beautiful Soup parses the markup <b>penguin</b>, it will create a NavigableString for the string "penguin".

PREFIX: str = ''

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = ''

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

output_ready(formatter: _FormatterOrName = 'minimal') str

Run the string through the provided formatter, making it ready for output as part of an HTML or XML document.

Parameters:

formatter -- A Formatter object, or a string naming one of the standard formatters.

property strings: Iterator[str]

Yield this string, but only if it is interesting.

This is defined the way it is for compatibility with Tag.strings. See Tag for information on which strings are interesting in a given context.

Yield:

A sequence that either contains this string, or is empty.

bs4.element.PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = {'idna', 'mbcs', 'oem', 'palmos', 'punycode', 'raw-unicode-escape', 'raw_unicode_escape', 'string-escape', 'string_escape', 'undefined', 'unicode-escape', 'unicode_escape'}

These encodings are recognized by Python (so Tag.encode could theoretically support them) but XML and HTML don't recognize them (so they should not show up in an XML or HTML document as that document's encoding).

If an XML document is encoded in one of these encodings, no encoding will be mentioned in the XML declaration. If an HTML document is encoded in one of these encodings, and the HTML document has a <meta> tag that mentions an encoding, the encoding will be given as the empty string.

Source: Python documentation, Python Specific Encodings

class bs4.element.PageElement

Bases: object

An abstract class representing a single element in the parse tree.

NavigableString, Tag, etc. are all subclasses of PageElement. For this reason you'll see a lot of methods that return PageElement, but you'll never see an actual PageElement object. For the most part you can think of PageElement as meaning "a Tag or a NavigableString."

decompose() None

Recursively destroys this PageElement and its children.

The element will be removed from the tree and wiped out; so will everything beneath it.

The behavior of a decomposed PageElement is undefined and you should never use one for anything, but if you need to check whether an element has been decomposed, you can use the PageElement.decomposed property.

property decomposed: bool

Check whether a PageElement has been decomposed.

extract(_self_index: int | None = None) Self

Destructively rips this element out of the tree.

Parameters:

_self_index -- The location of this element in its parent's .contents, if known. Passing this in allows for a performance optimization.

Returns:

this PageElement, no longer part of the tree.

find_all_next(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_all_next(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Find all PageElement objects that match the given criteria and appear later in the document than this PageElement.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

find_all_previous(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_all_previous(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Look backwards in the document from this PageElement and find all PageElement that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

find_next(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find_next(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString

Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

Kwargs:

Additional filters on attribute values.

find_next_sibling(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find_next_sibling(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString

Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

Kwargs:

Additional filters on attribute values.

find_next_siblings(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_next_siblings(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Find all siblings of this PageElement that match the given criteria and appear later in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

find_parent(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, **kwargs: _StrainableAttribute) _AtMostOneTag

Find the closest parent of this PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • self -- Whether the PageElement itself should be considered as one of its 'parents'.

Kwargs:

Additional filters on attribute values.

find_parents(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags

Find all parents of this PageElement that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

find_previous(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find_previous(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString

Look backwards in the document from this PageElement and find the first PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

Kwargs:

Additional filters on attribute values.

find_previous_sibling(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find_previous_sibling(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString

Returns the closest sibling to this PageElement that matches the given criteria and appears earlier in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

Kwargs:

Additional filters on attribute values.

find_previous_siblings(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_previous_siblings(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • string -- A filter for a NavigableString with specific text.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

format_string(s: str, formatter: _FormatterOrName | None) str

Format the given string using the given formatter.

Parameters:
  • s -- A string.

  • formatter -- A Formatter object, or a string naming one of the standard formatters.

formatter_for_name(formatter_name: _FormatterOrName | _EntitySubstitutionFunction) Formatter

Look up or create a Formatter for the given identifier, if necessary.

Parameters:

formatter -- Can be a Formatter object (used as-is), a function (used as the entity substitution hook for an bs4.formatter.XMLFormatter or bs4.formatter.HTMLFormatter), or a string (used to look up an bs4.formatter.XMLFormatter or bs4.formatter.HTMLFormatter in the appropriate registry.

getText(separator: str = '', strip: bool = False, types: Iterable[Type[NavigableString]] = ()) str

Get all child strings of this PageElement, concatenated using the given separator.

Parameters:
  • separator -- Strings will be concatenated using this separator.

  • strip -- If True, strings will be stripped before being concatenated.

  • types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.

Returns:

A string.

get_text(separator: str = '', strip: bool = False, types: Iterable[Type[NavigableString]] = ()) str

Get all child strings of this PageElement, concatenated using the given separator.

Parameters:
  • separator -- Strings will be concatenated using this separator.

  • strip -- If True, strings will be stripped before being concatenated.

  • types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.

Returns:

A string.

hidden: bool = False

Whether or not this element is hidden from generated output. Only the BeautifulSoup object itself is hidden.

insert_after(*args: _InsertableElement) List[PageElement]

Makes the given element(s) the immediate successor of this one.

The elements will have the same PageElement.parent as this one, and the given elements will occur immediately after this one.

Parameters:

args -- One or more PageElements.

:return The list of PageElements that were inserted.

insert_before(*args: _InsertableElement) List[PageElement]

Makes the given element(s) the immediate predecessor of this one.

All the elements will have the same PageElement.parent as this one, and the given elements will occur immediately before this one.

Parameters:

args -- One or more PageElements.

:return The list of PageElements that were inserted.

known_xml: bool | None = None

In general, we can't tell just by looking at an element whether it's contained in an XML document or an HTML document. But for Tag objects (q.v.) we can store this information at parse time. :meta private:

property next: _AtMostOneElement

The PageElement, if any, that was parsed just after this one.

next_element: _AtMostOneElement
property next_elements: Iterator[PageElement]

All PageElements that were parsed after this one.

next_sibling: _AtMostOneElement
property next_siblings: Iterator[PageElement]

All PageElements that are siblings of this one but were parsed later.

parent: Tag | None
property parents: Iterator[Tag]

All elements that are parents of this PageElement.

Yield:

A sequence of Tags, ending with a BeautifulSoup object.

property previous: _AtMostOneElement

The PageElement, if any, that was parsed just before this one.

previous_element: _AtMostOneElement
property previous_elements: Iterator[PageElement]

All PageElements that were parsed before this one.

Yield:

A sequence of PageElements.

previous_sibling: _AtMostOneElement
property previous_siblings: Iterator[PageElement]

All PageElements that are siblings of this one but were parsed earlier.

Yield:

A sequence of PageElements.

replace_with(*args: _InsertableElement) Self

Replace this PageElement with one or more other elements, objects, keeping the rest of the tree the same.

Returns:

This PageElement, no longer part of the tree.

property self_and_next_elements: Iterator[PageElement]

This PageElement, then all PageElements that were parsed after it.

property self_and_next_siblings: Iterator[PageElement]

This PageElement, then all of its siblings.

property self_and_parents: Iterator[PageElement]

This element, then all of its parents.

Yield:

A sequence of PageElements, ending with a BeautifulSoup object.

property self_and_previous_elements: Iterator[PageElement]

This PageElement, then all elements that were parsed earlier.

property self_and_previous_siblings: Iterator[PageElement]

This PageElement, then all of its siblings that were parsed earlier.

setup(parent: Tag | None = None, previous_element: _AtMostOneElement = None, next_element: _AtMostOneElement = None, previous_sibling: _AtMostOneElement = None, next_sibling: _AtMostOneElement = None) None

Sets up the initial relations between this element and other elements.

Parameters:
  • parent -- The parent of this element.

  • previous_element -- The element parsed immediately before this one.

  • next_element -- The element parsed immediately after this one.

  • previous_sibling -- The most recently encountered element on the same level of the parse tree as this one.

  • previous_sibling -- The next element to be encountered on the same level of the parse tree as this one.

property stripped_strings: Iterator[str]

Yield all interesting strings in this PageElement, stripping them first.

See Tag for information on which strings are considered interesting in a given context.

property text: str

Get all child strings of this PageElement, concatenated using the given separator.

Parameters:
  • separator -- Strings will be concatenated using this separator.

  • strip -- If True, strings will be stripped before being concatenated.

  • types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.

Returns:

A string.

wrap(wrap_inside: Tag) Tag

Wrap this PageElement inside a Tag.

Returns:

wrap_inside, occupying the position in the tree that used to be occupied by this object, and with this object now inside it.

class bs4.element.PreformattedString(value: str | bytes)

Bases: NavigableString

A NavigableString not subject to the normal formatting rules.

This is an abstract class used for special kinds of strings such as comments (Comment) and CDATA blocks (CData).

PREFIX: str = ''

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = ''

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

output_ready(formatter: _FormatterOrName | None = None) str
Make this string ready for output by adding any subclass-specific

prefix or suffix.

Parameters:

formatter -- A Formatter object, or a string naming one of the standard formatters. The string will be passed into the Formatter, but only to trigger any side effects: the return value is ignored.

Returns:

The string, with any subclass-specific prefix and suffix added on.

class bs4.element.ProcessingInstruction(value: str | bytes)

Bases: PreformattedString

A SGML processing instruction.

PREFIX: str = '<?'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
class bs4.element.ResultSet(source: ElementFilter | None, result: Sequence[_PageElementT] = ())

Bases: Sequence[_PageElementT], Generic[_PageElementT]

A ResultSet is a sequence of PageElement objects, gathered as the result of matching an ElementFilter against a parse tree. Basically, a list of search results.

result: Sequence[_PageElementT]
source: ElementFilter | None
class bs4.element.RubyParenthesisString(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of an <rp> HTML tag.

class bs4.element.RubyTextString(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of an <rt> HTML tag.

Can be used to distinguish such strings from the strings they're annotating.

class bs4.element.Script(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of a <script> HTML tag (probably Javascript).

Used to distinguish executable code from textual content.

class bs4.element.Stylesheet(value: str | bytes)

Bases: NavigableString

A NavigableString representing the contents of a <style> HTML tag (probably CSS).

Used to distinguish embedded stylesheets from textual content.

class bs4.element.Tag(parser: BeautifulSoup | None = None, builder: TreeBuilder | None = None, name: str | None = None, namespace: str | None = None, prefix: str | None = None, attrs: _RawOrProcessedAttributeValues | None = None, parent: BeautifulSoup | Tag | None = None, previous: _AtMostOneElement = None, is_xml: bool | None = None, sourceline: int | None = None, sourcepos: int | None = None, can_be_empty_element: bool | None = None, cdata_list_attributes: Dict[str, Set[str]] | None = None, preserve_whitespace_tags: Set[str] | None = None, interesting_string_types: Set[Type[NavigableString]] | None = None, namespaces: Dict[str, str] | None = None)

Bases: PageElement

An HTML or XML tag that is part of a parse tree, along with its attributes, contents, and relationships to other parts of the tree.

When Beautiful Soup parses the markup <b>penguin</b>, it will create a Tag object representing the <b> tag. You can instantiate Tag objects directly, but it's not necessary unless you're adding entirely new markup to a parsed document. Most of the constructor arguments are intended for use by the TreeBuilder that's parsing a document.

Parameters:
  • parser -- A BeautifulSoup object representing the parse tree this Tag will be part of.

  • builder -- The TreeBuilder being used to build the tree.

  • name -- The name of the tag.

  • namespace -- The URI of this tag's XML namespace, if any.

  • prefix -- The prefix for this tag's XML namespace, if any.

  • attrs -- A dictionary of attribute values.

  • parent -- The Tag to use as the parent of this Tag. May be the BeautifulSoup object itself.

  • previous -- The PageElement that was parsed immediately before parsing this tag.

  • is_xml -- If True, this is an XML tag. Otherwise, this is an HTML tag.

  • sourceline -- The line number where this tag was found in its source document.

  • sourcepos -- The character position within sourceline where this tag was found.

  • can_be_empty_element -- If True, this tag should be represented as <tag/>. If False, this tag should be represented as <tag></tag>.

  • cdata_list_attributes -- A dictionary of attributes whose values should be parsed as lists of strings if they ever show up on this tag.

  • preserve_whitespace_tags -- Names of tags whose contents should have their whitespace preserved if they are encountered inside this tag.

  • interesting_string_types -- When iterating over this tag's string contents in methods like Tag.strings or PageElement.get_text, these are the types of strings that are interesting enough to be considered. By default, NavigableString (normal strings) and CData (CDATA sections) are the only interesting string subtypes.

  • namespaces -- A dictionary mapping currently active namespace prefixes to URIs, as of the point in the parsing process when this tag was encountered. This can be used later to construct CSS selectors.

append(tag: _InsertableElement) PageElement

Appends the given PageElement to the contents of this Tag.

Parameters:

tag -- A PageElement.

:return The newly appended PageElement.

attrs: _AttributeValues
can_be_empty_element: bool | None
cdata_list_attributes: Dict[str, Set[str]] | None
property children: Iterator[PageElement]

Iterate over all direct children of this PageElement.

clear(decompose: bool = False) None
Destroy all children of this Tag by calling

PageElement.extract on them.

Parameters:

decompose -- If this is True, PageElement.decompose (a more destructive method) will be called instead of PageElement.extract.

contents: List[PageElement]
copy_self() Self

Create a new Tag just like this one, but with no contents and unattached to any parse tree.

This is the first step in the deepcopy process, but you can call it on its own to create a copy of a Tag without copying its contents.

property css: CSS

Return an interface to the CSS selector API.

decode(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal', iterator: Iterator[PageElement] | None = None) str

Render this Tag and its contents as a Unicode string.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • encoding -- The encoding you intend to use when converting the string to a bytestring. decode() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • iterator -- The iterator to use when navigating over the parse tree. This is only used by Tag.decode_contents and you probably won't need to use it.

decode_contents(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') str

Renders the contents of this tag as a Unicode string.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.

  • eventual_encoding -- The tag is destined to be encoded into this encoding. decode_contents() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).

  • formatter -- A Formatter object, or a string naming one of the standard Formatters.

property descendants: Iterator[PageElement]

Iterate over all children of this Tag in a breadth-first sequence.

encode(encoding: _Encoding = 'utf-8', indent_level: int | None = None, formatter: _FormatterOrName = 'minimal', errors: str = 'xmlcharrefreplace') bytes

Render this Tag and its contents as a bytestring.

Parameters:
  • encoding -- The encoding to use when converting to a bytestring. This may also affect the text of the document, specifically any encoding declarations within the document.

  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • errors --

    An error handling strategy such as 'xmlcharrefreplace'. This value is passed along into str.encode() and its value should be one of the error handling constants defined by Python's codecs module.

encode_contents(indent_level: int | None = None, encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') bytes

Renders the contents of this PageElement as a bytestring.

Parameters:
  • indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.

  • formatter -- Either a Formatter object, or a string naming one of the standard formatters.

  • encoding -- The bytestring will be in this encoding.

extend(tags: Iterable[_InsertableElement] | Tag) List[PageElement]

Appends one or more objects to the contents of this Tag.

Parameters:

tags -- If a list of PageElement objects is provided, they will be appended to this tag's contents, one at a time. If a single Tag is provided, its Tag.contents will be used to extend this object's Tag.contents.

:return The list of PageElements that were appended.

find(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag
find(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '') _AtMostOneNavigableString

Look in the children of this PageElement and find the first PageElement that matches the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • recursive -- If this is True, find() will perform a recursive search of this Tag's children. Otherwise, only the direct children will be considered.

  • string -- A filter on the Tag.string attribute.

Kwargs:

Additional filters on attribute values.

find_all(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags
find_all(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings

Look in the children of this PageElement and find all PageElement objects that match the given criteria.

All find_* methods take a common set of arguments. See the online documentation for detailed explanations.

Parameters:
  • name -- A filter on tag name.

  • attrs -- Additional filters on attribute values.

  • recursive -- If this is True, find_all() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.

  • limit -- Stop looking after finding this many results.

  • _stacklevel -- Used internally to improve warning messages.

Kwargs:

Additional filters on attribute values.

get(key: str, default: _AttributeValue | None = None) _AttributeValue | None

Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.

Parameters:
  • key -- The attribute to look for.

  • default -- Use this value if the attribute is not present on this Tag.

get_attribute_list(key: str, default: AttributeValueList | None = None) AttributeValueList

The same as get(), but always returns a (possibly empty) list.

Parameters:
  • key -- The attribute to look for.

  • default -- Use this value if the attribute is not present on this Tag.

Returns:

A list of strings, usually empty or containing only a single value.

has_attr(key: str) bool

Does this Tag have an attribute with the given name?

index(element: PageElement) int

Find the index of a child of this Tag (by identity, not value).

Doing this by identity avoids issues when a Tag contains two children that have string equality.

Parameters:

element -- Look for this PageElement in this object's contents.

insert(position: int, *new_children: _InsertableElement) List[PageElement]

Insert one or more new PageElements as a child of this Tag.

This works similarly to list.insert(), except you can insert multiple elements at once.

Parameters:
  • position -- The numeric position that should be occupied in this Tag's Tag.children by the first new PageElement.

  • new_children -- The PageElements to insert.

:return The newly inserted PageElements.

interesting_string_types: Set[Type[NavigableString]] | None
isSelfClosing() bool

: :meta private:

property is_empty_element: bool

Is this tag an empty-element tag? (aka a self-closing tag)

A tag that has contents is never an empty-element tag.

A tag that has no contents may or may not be an empty-element tag. It depends on the TreeBuilder used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag. This is usually the case for HTML documents.

If the builder has no designated list of empty-element, then any tag with no contents is an empty-element tag. This is usually the case for XML documents.

name: str
namespace: str | None
next_element: _AtMostOneElement
next_sibling: _AtMostOneElement
parent: Tag | None
parser_class: type[BeautifulSoup] | None
prefix: str | None
preserve_whitespace_tags: Set[str] | None
prettify(encoding: None = None, formatter: _FormatterOrName = 'minimal') str
prettify(encoding: _Encoding, formatter: _FormatterOrName = 'minimal') bytes

Pretty-print this Tag as a string or bytestring.

Parameters:
  • encoding -- The encoding of the bytestring, or None if you want Unicode.

  • formatter -- A Formatter object, or a string naming one of the standard formatters.

Returns:

A string (if no encoding is provided) or a bytestring (otherwise).

previous_element: _AtMostOneElement
previous_sibling: _AtMostOneElement
replaceWithChildren() _OneElement

: :meta private:

replace_with_children() Self

Replace this PageElement with its contents.

Returns:

This object, no longer part of the tree.

select(selector: str, namespaces: Dict[str, str] | None = None, limit: int = 0, **kwargs: Any) ResultSet[Tag]

Perform a CSS selection operation on the current element.

This uses the SoupSieve library.

Parameters:
  • selector -- A string containing a CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • limit -- After finding this number of results, stop looking.

  • kwargs -- Keyword arguments to be passed into SoupSieve's soupsieve.select() method.

select_one(selector: str, namespaces: Dict[str, str] | None = None, **kwargs: Any) Tag | None

Perform a CSS selection operation on the current element.

Parameters:
  • selector -- A CSS selector.

  • namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.

  • kwargs -- Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.

property self_and_descendants: Iterator[PageElement]

Iterate over this Tag and its children in a breadth-first sequence.

smooth() None

Smooth out the children of this Tag by consolidating consecutive strings.

If you perform a lot of operations that modify the tree, calling this method afterwards can make pretty-printed output look more natural.

sourceline: int | None
sourcepos: int | None
property string: str | None

Convenience property to get the single string within this Tag, assuming there is just one.

Returns:

If this Tag has a single child that's a NavigableString, the return value is that string. If this element has one child Tag, the return value is that child's Tag.string, recursively. If this Tag has no children, or has more than one child, the return value is None.

If this property is unexpectedly returning None for you, it's probably because your Tag has more than one thing inside it.

property strings: Iterator[str]

Yield all strings of certain classes, possibly stripping them.

Parameters:
  • strip -- If True, all strings will be stripped before being yielded.

  • types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that's not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.

unwrap() Self

Replace this PageElement with its contents.

Returns:

This object, no longer part of the tree.

class bs4.element.TemplateString(value: str | bytes)

Bases: NavigableString

A NavigableString representing a string found inside an HTML <template> tag embedded in a larger document.

Used to distinguish such strings from the main body of the document.

class bs4.element.XMLAttributeDict

Bases: AttributeDict

A dictionary for holding a Tag's attributes, which processes incoming values for consistency with the HTML spec.

class bs4.element.XMLProcessingInstruction(value: str | bytes)

Bases: ProcessingInstruction

An XML processing instruction.

PREFIX: str = '<?'

A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.

SUFFIX: str = '?>'

A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.

bs4.element.nonwhitespace_re: Pattern[str] = re.compile('\\S+')

A regular expression that can be used to split on whitespace.

bs4.filter module

class bs4.filter.AttributeValueMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)

Bases: MatchRule

A MatchRule implementing the rules for matches against attribute value.

function: Callable[[str | None], bool] | None
class bs4.filter.ElementFilter(match_function: Callable[[PageElement], bool] | None = None)

Bases: object

ElementFilter encapsulates the logic necessary to decide:

1. whether a PageElement (a Tag or a NavigableString) matches a user-specified query.

2. whether a given sequence of markup found during initial parsing should be turned into a PageElement at all, or simply discarded.

The base class is the simplest ElementFilter. By default, it matches everything and allows all markup to become PageElement objects. You can make it more selective by passing in a user-defined match function, or defining a subclass.

Most users of Beautiful Soup will never need to use ElementFilter, or its more capable subclass SoupStrainer. Instead, they will use methods like Tag.find(), which will convert their arguments into SoupStrainer objects and run them against the tree.

However, if you find yourself wanting to treat the arguments to Beautiful Soup's find_*() methods as first-class objects, those objects will be SoupStrainer objects. You can create them yourself and then make use of functions like ElementFilter.filter().

allow_string_creation(string: str) bool

Based on the content of a string, see whether this ElementFilter will allow a NavigableString object based on this string to be added to the parse tree.

By default, all strings are processed into NavigableString objects. To change this, subclass ElementFilter.

Parameters:

str -- The string under consideration.

allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool

Based on the name and attributes of a tag, see whether this ElementFilter will allow a Tag object to even be created.

By default, all tags are parsed. To change this, subclass ElementFilter.

Parameters:
  • name -- The name of the prospective tag.

  • attrs -- The attributes of the prospective tag.

property excludes_everything: bool

Does this ElementFilter obviously exclude everything? If so, Beautiful Soup will issue a warning if you try to use it when parsing a document.

The ElementFilter might turn out to exclude everything even if this returns False, but it won't exclude everything in an obvious way.

The base ElementFilter implementation excludes things based on a match function we can't inspect, so excludes_everything is always false.

filter(generator: Iterator[PageElement]) Iterator[PageElement | Tag | NavigableString]

The most generic search method offered by Beautiful Soup.

Acts like Python's built-in filter, using ElementFilter.match as the filtering function.

find(generator: Iterator[PageElement]) PageElement | Tag | NavigableString | None

A lower-level equivalent of Tag.find().

You can pass in your own generator for iterating over PageElement objects. The first one that matches this ElementFilter will be returned.

Parameters:

generator -- A way of iterating over PageElement objects.

find_all(generator: Iterator[PageElement], limit: int | None = None) ResultSet[PageElement | Tag | NavigableString]

A lower-level equivalent of Tag.find_all().

You can pass in your own generator for iterating over PageElement objects. Only elements that match this ElementFilter will be returned in the ResultSet.

Parameters:
  • generator -- A way of iterating over PageElement objects.

  • limit -- Stop looking after finding this many results.

property includes_everything: bool

Does this ElementFilter obviously include everything? If so, the filter process can be made much faster.

The ElementFilter might turn out to include everything even if this returns False, but it won't include everything in an obvious way.

The base ElementFilter implementation includes things based on the match function, so includes_everything is only true if there is no match function.

match(element: PageElement, _known_rules: bool = False) bool

Does the given PageElement match the rules set down by this ElementFilter?

The base implementation delegates to the function passed in to the constructor.

Parameters:

_known_rules -- Defined for compatibility with SoupStrainer._match(). Used more for consistency than because we need the performance optimization.

match_function: Callable[[PageElement], bool] | None
class bs4.filter.MatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)

Bases: object

Each MatchRule encapsulates the logic behind a single argument passed in to one of the Beautiful Soup find* methods.

exclude_everything: bool | None
matches_string(string: str | None) bool
pattern: _RegularExpressionProtocol | None
present: bool | None
string: str | None
class bs4.filter.SoupStrainer(name: str | bytes | Pattern[str] | bool | Callable[[Tag], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[Tag], bool]] | None = None, attrs: Dict[str, str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]]] | None = None, string: str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]] | None = None, **kwargs: str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]])

Bases: ElementFilter

The ElementFilter subclass used internally by Beautiful Soup.

A SoupStrainer encapsulates the logic necessary to perform the kind of matches supported by methods such as Tag.find(). SoupStrainer objects are primarily created internally, but you can create one yourself and pass it in as parse_only to the BeautifulSoup constructor, to parse a subset of a large document.

Internally, SoupStrainer objects work by converting the constructor arguments into MatchRule objects. Incoming tags/markup are matched against those rules.

Parameters:
  • name -- One or more restrictions on the tags found in a document.

  • attrs -- A dictionary that maps attribute names to restrictions on tags that use those attributes.

  • string -- One or more restrictions on the strings found in a document.

  • kwargs -- A dictionary that maps attribute names to restrictions on tags that use those attributes. These restrictions are additive to any specified in attrs.

allow_string_creation(string: str) bool

Based on the content of a markup string, see whether this SoupStrainer will allow it to be instantiated as a NavigableString object, or whether it should be ignored.

allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool

Based on the name and attributes of a tag, see whether this SoupStrainer will allow a Tag object to even be created.

Parameters:
  • name -- The name of the prospective tag.

  • attrs -- The attributes of the prospective tag.

attribute_rules: Dict[str, List[AttributeValueMatchRule]]
property excludes_everything: bool

Check whether the provided rules will obviously exclude everything. (They might exclude everything even if this returns False, but not in an obvious way.)

property includes_everything: bool

Check whether the provided rules will obviously include everything. (They might include everything even if this returns False, but not in an obvious way.)

match(element: PageElement, _known_rules: bool = False) bool

Does the given PageElement match the rules set down by this SoupStrainer?

The find_* methods rely heavily on this method to find matches.

Parameters:
  • element -- A PageElement.

  • _known_rules -- Set to true in the common case where we already checked and found at least one rule in this SoupStrainer that might exclude a PageElement. Without this, we need to check .includes_everything every time, just to be safe.

Returns:

True if the element matches this SoupStrainer's rules; False otherwise.

matches_any_string_rule(string: str) bool

See whether the content of a string matches any of this SoupStrainer's string rules.

matches_tag(tag: Tag) bool

Do the rules of this SoupStrainer trigger a match against the given Tag?

If the SoupStrainer has any TagNameMatchRule, at least one must match the Tag or its Tag.name.

If there are any AttributeValueMatchRule for a given attribute, at least one of them must match the attribute value.

If there are any StringMatchRule, at least one must match, but a SoupStrainer that only contains StringMatchRule cannot match a Tag, only a NavigableString.

name_rules: List[TagNameMatchRule]
search_tag(name: str, attrs: _RawAttributeValues | None) bool

A less elegant version of allow_tag_creation. Deprecated as of 4.13.0

string_rules: List[StringMatchRule]
class bs4.filter.StringMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)

Bases: MatchRule

A MatchRule implementing the rules for matches against a NavigableString.

function: Callable[[str], bool] | None
class bs4.filter.TagNameMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)

Bases: MatchRule

A MatchRule implementing the rules for matches against tag name.

function: Callable[[Tag], bool] | None
matches_tag(tag: Tag) bool

bs4.formatter module

class bs4.formatter.Formatter(language: str | None = None, entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)

Bases: EntitySubstitution

Describes a strategy to use when outputting a parse tree to a string.

Some parts of this strategy come from the distinction between HTML4, HTML5, and XML. Others are configurable by the user.

Formatters are passed in as the formatter argument to methods like bs4.element.Tag.encode. Most people won't need to think about formatters, and most people who need to think about them can pass in one of these predefined strings as formatter rather than making a new Formatter object:

For HTML documents:
  • 'html' - HTML entity substitution for generic HTML documents. (default)

  • 'html5' - HTML entity substitution for HTML5 documents, as

    well as some optimizations in the way tags are rendered.

  • 'html5-4.12.0' - The version of the 'html5' formatter used prior to

    Beautiful Soup 4.13.0.

  • 'minimal' - Only make the substitutions necessary to guarantee

    valid HTML.

  • None - Do not perform any substitution. This will be faster

    but may result in invalid markup.

For XML documents:
  • 'html' - Entity substitution for XHTML documents.

  • 'minimal' - Only make the substitutions necessary to guarantee

    valid XML. (default)

  • None - Do not perform any substitution. This will be faster

    but may result in invalid markup.

HTML: str = 'html'

Constant name denoting HTML markup

HTML_DEFAULTS: Dict[str, Set[str]] = {'cdata_containing_tags': {'script', 'style'}}

Default values for the various constructor options when the markup language is HTML.

XML: str = 'xml'

Constant name denoting XML markup

attribute_value(value: str) str

Process the value of an attribute.

Parameters:

ns -- A string.

Returns:

A string with certain characters replaced by named or numeric entities.

attributes(tag: bs4.element.Tag) Iterable[Tuple[str, _AttributeValue | None]]

Reorder a tag's attributes however you want.

By default, attributes are sorted alphabetically. This makes behavior consistent between Python 2 and Python 3, and preserves backwards compatibility with older versions of Beautiful Soup.

If empty_attributes_are_booleans is True, then attributes whose values are set to the empty string will be treated as boolean attributes.

empty_attributes_are_booleans: bool

If this is set to true by the constructor, then attributes whose values are sent to the empty string will be treated as HTML boolean attributes. (Attributes whose value is None are always rendered this way.)

substitute(ns: str) str

Process a string that needs to undergo entity substitution. This may be a string encountered in an attribute value or as text.

Parameters:

ns -- A string.

Returns:

The same string but with certain characters replaced by named or numeric entities.

class bs4.formatter.HTMLFormatter(entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)

Bases: Formatter

A generic Formatter for HTML.

REGISTRY: Dict[str | None, HTMLFormatter] = {'html': <bs4.formatter.HTMLFormatter object>, 'html5': <bs4.formatter.HTMLFormatter object>, 'html5-4.12': <bs4.formatter.HTMLFormatter object>, 'minimal': <bs4.formatter.HTMLFormatter object>, None: <bs4.formatter.HTMLFormatter object>}
class bs4.formatter.XMLFormatter(entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)

Bases: Formatter

A generic Formatter for XML.

REGISTRY: Dict[str | None, XMLFormatter] = {'html': <bs4.formatter.XMLFormatter object>, 'minimal': <bs4.formatter.XMLFormatter object>, None: <bs4.formatter.XMLFormatter object>}

bs4._typing module

bs4._typing._AttributeValues

A dictionary of names to _AttributeValue objects. This is what a tag's attributes look like after processing.

alias of Dict[str, Union[str, AttributeValueList]]

bs4._typing._BaseStrainable

Either a tag name, an attribute value or a string can be matched against a string, bytestring, regular expression, or a boolean.

alias of Union[str, bytes, Pattern[str], bool]

bs4._typing._BaseStrainableAttribute

A tag's attribute value can be matched either with the _BaseStrainable options, or using a function that takes that value as its sole argument.

alias of Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool]]

bs4._typing._BaseStrainableElement

A tag can be matched either with the _BaseStrainable options, or using a function that takes the Tag as its sole argument.

alias of Union[str, bytes, Pattern[str], bool, Callable[[Tag], bool]]

bs4._typing._Encoding

A data encoding.

bs4._typing._Encodings

One or more data encodings.

alias of Iterable[str]

bs4._typing._IncomingMarkup

The rawest form of markup: either a string, bytestring, or an open filehandle.

alias of Union[str, bytes, IO[str], IO[bytes]]

bs4._typing._InsertableElement

A number of tree manipulation methods can take either a PageElement or a normal Python string (which will be converted to a NavigableString).

alias of Union[PageElement, str]

bs4._typing._InvertedNamespaceMapping

A mapping of namespace URLs to prefixes

alias of Dict[str, str]

bs4._typing._NamespaceMapping

A mapping of prefixes to namespace URLs.

alias of Dict[str, str]

bs4._typing._NamespacePrefix

The prefix for an XML namespace.

bs4._typing._NamespaceURL

The URL of an XML namespace

bs4._typing._NullableStringMatchFunction

A function that takes a string (or None) and returns a yes-or-no answer. An AttributeValueMatchRule expects this kind of function, if you're going to pass it a function.

alias of Callable[[Optional[str]], bool]

bs4._typing._OneElement

Many Beautiful soup methods return a PageElement or an ResultSet of PageElements. A PageElement is either a Tag or a NavigableString. These convenience aliases make it easier for IDE users to see which methods are available on the objects they're dealing with.

alias of Union[PageElement, Tag, NavigableString]

bs4._typing._PageElementMatchFunction

A function that takes a PageElement and returns a yes-or-no answer.

alias of Callable[[PageElement], bool]

bs4._typing._RawAttributeValue

The value associated with an HTML or XML attribute. This is the relatively unprocessed value Beautiful Soup expects to come from a TreeBuilder.

bs4._typing._RawAttributeValues: TypeAlias = 'Mapping[Union[str, NamespacedAttribute], _RawAttributeValue]'

A dictionary of names to _RawAttributeValue objects. This is how Beautiful Soup expects a TreeBuilder to represent a tag's attribute values.

bs4._typing._RawMarkup

Markup that is in memory but has (potentially) yet to be converted to Unicode.

alias of Union[str, bytes]

bs4._typing._RawOrProcessedAttributeValues

The methods that deal with turning _RawAttributeValue into _AttributeValue may be called several times, even after the values are already processed (e.g. when cloning a tag), so they need to be able to acommodate both possibilities.

alias of Union[Mapping[Union[str, NamespacedAttribute], _RawAttributeValue], Dict[str, Union[str, AttributeValueList]]]

class bs4._typing._RegularExpressionProtocol(*args, **kwargs)

Bases: Protocol

A protocol object which can accept either Python's built-in re.Pattern objects, or the similar Regex objects defined by the third-party regex package.

_abc_impl = <_abc._abc_data object>
_is_protocol = True
_is_runtime_protocol = True
property pattern: str
search(string: str, pos: int = Ellipsis, endpos: int = Ellipsis) Any | None
bs4._typing._StrainableAttribute

An attribute value can be matched using either a single criterion or a list of criteria.

alias of Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool], Iterable[Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool]]]]

bs4._typing._StrainableAttributes

A dictionary may be used to match against multiple attribute vlaues at once.

alias of Dict[str, Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool], Iterable[Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool]]]]]

bs4._typing._StrainableElement

A tag can be matched using either a single criterion or a list of criteria.

alias of Union[str, bytes, Pattern[str], bool, Callable[[Tag], bool], Iterable[Union[str, bytes, Pattern[str], bool, Callable[[Tag], bool]]]]

bs4._typing._StrainableString

An string can be matched using the same techniques as an attribute value.

alias of Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool], Iterable[Union[str, bytes, Pattern[str], bool, Callable[[Optional[str]], bool]]]]

bs4._typing._TagMatchFunction

A function that takes a Tag and returns a yes-or-no answer. A TagNameMatchRule expects this kind of function, if you're going to pass it a function.

alias of Callable[[Tag], bool]

bs4.diagnose module

Diagnostic functions, mainly for use when doing tech support.

class bs4.diagnose.AnnouncingParser(*, convert_charrefs=True)

Bases: HTMLParser

Subclass of HTMLParser that announces parse events, without doing anything else.

You can use this to get a picture of how html.parser sees a given document. The easiest way to do this is to call htmlparser_trace.

handle_charref(name: str) None
handle_comment(data: str) None
handle_data(data: str) None
handle_decl(data: str) None
handle_endtag(name: str, check_already_closed: bool = True) None
handle_entityref(name: str) None
handle_pi(data: str) None
handle_starttag(name: str, attrs: List[Tuple[str, str | None]], handle_empty_element: bool = True) None
unknown_decl(data: str) None
bs4.diagnose.benchmark_parsers(num_elements: int = 100000) None

Very basic head-to-head performance benchmark.

bs4.diagnose.diagnose(data: _IncomingMarkup) None

Diagnostic suite for isolating common problems.

Parameters:

data -- Some markup that needs to be explained.

Returns:

None; diagnostics are printed to standard output.

bs4.diagnose.htmlparser_trace(data: str) None

Print out the HTMLParser events that occur during parsing.

This lets you see how HTMLParser parses a document when no Beautiful Soup code is running.

Parameters:

data -- Some markup.

bs4.diagnose.lxml_trace(data: _IncomingMarkup, html: bool = True, **kwargs: Any) None

Print out the lxml events that occur during parsing.

This lets you see how lxml parses a document when no Beautiful Soup code is running. You can use this to determine whether an lxml-specific problem is in Beautiful Soup's lxml tree builders or in lxml itself.

Parameters:
  • data -- Some markup.

  • html -- If True, markup will be parsed with lxml's HTML parser. if False, lxml's XML parser will be used.

bs4.diagnose.profile(num_elements: int = 100000, parser: str = 'lxml') None

Use Python's profiler on a randomly generated document.