bs4 package¶
Module contents¶
Beautiful Soup Elixir and Tonic - "The Screen-Scraper's Friend".
http://www.crummy.com/software/BeautifulSoup/
Beautiful Soup uses a pluggable XML or HTML parser to parse a (possibly invalid) document into a tree representation. Beautiful Soup provides methods and Pythonic idioms that make it easy to navigate, search, and modify the parse tree.
Beautiful Soup works with Python 3.7 and up. It works better if lxml and/or html5lib is installed, but they are not required.
For more than you ever wanted to know about Beautiful Soup, see the documentation: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
- exception bs4.AttributeResemblesVariableWarning¶
Bases:
UnusualUsageWarning
,SyntaxWarning
The warning issued when Beautiful Soup suspects a provided attribute name may actually be the misspelled name of a Beautiful Soup variable. Generally speaking, this is only used in cases like "_class" where it's very unlikely the user would be referencing an XML attribute with that name.
- MESSAGE: str = '%(original)r is an unusual attribute name and is a common misspelling for %(autocorrect)r.\n\nIf you meant %(autocorrect)r, change your code to use it, and this warning will go away.\n\nIf you really did mean to check the %(original)r attribute, this warning is spurious and can be filtered. To make it go away, run this code before creating your BeautifulSoup object:\n\n from bs4 import AttributeResemblesVariableWarning\n import warnings\n\n warnings.filterwarnings("ignore", category=AttributeResemblesVariableWarning)\n'¶
- class bs4.BeautifulSoup(markup: str | bytes | IO[str] | IO[bytes] = '', features: str | Sequence[str] | None = None, builder: TreeBuilder | Type[TreeBuilder] | None = None, parse_only: SoupStrainer | None = None, from_encoding: str | None = None, exclude_encodings: Iterable[str] | None = None, element_classes: Dict[Type[PageElement], Type[PageElement]] | None = None, **kwargs: Any)¶
Bases:
Tag
A data structure representing a parsed HTML or XML document.
Most of the methods you'll call on a BeautifulSoup object are inherited from PageElement or Tag.
Internally, this class defines the basic interface called by the tree builders when converting an HTML/XML document into a data structure. The interface abstracts away the differences between parsers. To write a new tree builder, you'll need to understand these methods as a whole.
- These methods will be called by the BeautifulSoup constructor:
reset()
feed(markup)
- The tree builder may call these methods from its feed() implementation:
handle_starttag(name, attrs) # See note about return value
handle_endtag(name)
handle_data(data) # Appends to the current data node
endData(containerClass) # Ends the current data node
No matter how complicated the underlying parser is, you should be able to build a tree using 'start tag' events, 'end tag' events, 'data' events, and "done with data" events.
If you encounter an empty-element tag (aka a self-closing tag, like HTML's <br> tag), call handle_starttag and then handle_endtag.
- ASCII_SPACES: str = ' \n\t\x0c\r'¶
A string containing all ASCII whitespace characters, used in during parsing to detect data chunks that seem 'empty'.
- DEFAULT_BUILDER_FEATURES: Sequence[str] = ['html', 'fast']¶
If the end-user gives no indication which tree builder they want, look for one with these features.
- ROOT_TAG_NAME: str = '[document]'¶
Since
BeautifulSoup
subclassesTag
, it's possible to treat it as aTag
with aTag.name
. Hoever, this name makes it clear theBeautifulSoup
object isn't a real markup tag.
- contains_replacement_characters: bool¶
This is True if the markup that was parsed contains U+FFFD REPLACEMENT_CHARACTER characters which were not present in the original markup. These mark character sequences that could not be represented in Unicode.
- copy_self() BeautifulSoup ¶
Create a new BeautifulSoup object with the same TreeBuilder, but not associated with any markup.
This is the first step of the deepcopy process.
- declared_html_encoding: str | None¶
The character encoding, if any, that was explicitly defined in the original document. This may or may not match
BeautifulSoup.original_encoding
.
- decode(indent_level: int | None = None, eventual_encoding: str = 'utf-8', formatter: Formatter | str = 'minimal', iterator: Iterator[PageElement] | None = None, **kwargs: Any) str ¶
- Returns a string representation of the parse tree
as a full HTML or XML document.
- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.eventual_encoding -- The encoding of the final document. If this is None, the document will be a Unicode string.
formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.iterator -- The iterator to use when navigating over the parse tree. This is only used by
Tag.decode_contents
and you probably won't need to use it.
- insert_after(*args: PageElement | str) List[PageElement] ¶
This method is part of the PageElement API, but
BeautifulSoup
doesn't implement it because there is nothing before or after it in the parse tree.
- insert_before(*args: PageElement | str) List[PageElement] ¶
This method is part of the PageElement API, but
BeautifulSoup
doesn't implement it because there is nothing before or after it in the parse tree.
- new_string(s: str, subclass: Type[NavigableString] | None = None) NavigableString ¶
Create a new
NavigableString
associated with thisBeautifulSoup
object.- Parameters:
s -- The string content of the
NavigableString
subclass -- The subclass of
NavigableString
, if any, to use. If a document is being processed, an appropriate subclass for the current location in the document will be determined automatically.
- new_tag(name: str, namespace: str | None = None, nsprefix: str | None = None, attrs: Mapping[str | NamespacedAttribute, _RawAttributeValue] | None = None, sourceline: int | None = None, sourcepos: int | None = None, string: str | None = None, **kwattrs: str) Tag ¶
Create a new Tag associated with this BeautifulSoup object.
- Parameters:
name -- The name of the new Tag.
namespace -- The URI of the new Tag's XML namespace, if any.
prefix -- The prefix for the new Tag's XML namespace, if any.
attrs -- A dictionary of this Tag's attribute values; can be used instead of
kwattrs
for attributes like 'class' that are reserved words in Python.sourceline -- The line number where this tag was (purportedly) found in its source document.
sourcepos -- The character position within
sourceline
where this tag was (purportedly) found.string -- String content for the new Tag, if any.
kwattrs -- Keyword arguments for the new Tag's attribute values.
- original_encoding: str | None¶
Beautiful Soup's best guess as to the character encoding of the original document.
- string_container(base_class: Type[NavigableString] | None = None) Type[NavigableString] ¶
Find the class that should be instantiated to hold a given kind of string.
This may be a built-in Beautiful Soup class or a custom class passed in to the BeautifulSoup constructor.
- class bs4.CData(value: str | bytes)¶
Bases:
PreformattedString
- class bs4.CSS(tag: element.Tag, api: ModuleType | None = None)¶
Bases:
object
A proxy object against the
soupsieve
library, to simplify its CSS selector API.You don't need to instantiate this class yourself; instead, use
element.Tag.css
.- Parameters:
tag -- All CSS selectors run by this object will use this as their starting point.
api -- An optional drop-in replacement for the
soupsieve
module, intended for use in unit tests.
- closest(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None ¶
Find the
element.Tag
closest to this one that matches the given selector.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.closest() method.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.closest() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.closest() method.
- compile(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) SoupSieve ¶
Pre-compile a selector and return the compiled object.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
flags -- Flags to be passed into Soup Sieve's soupsieve.compile() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.compile() method.
- Returns:
A precompiled selector object.
- Return type:
soupsieve.SoupSieve
- escape(ident: str) str ¶
Escape a CSS identifier.
This is a simple wrapper around soupsieve.escape(). See the documentation for that function for more information.
- filter(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) ResultSet[element.Tag] ¶
Filter this
element.Tag
's direct children based on the given CSS selector.This uses the Soup Sieve library. It works the same way as passing a
element.Tag
into that library's soupsieve.filter() method. For more information, see the documentation for soupsieve.filter().- Parameters:
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.filter() method.
kwargs --
Keyword arguments to be passed into SoupSieve's soupsieve.filter() method.
- iselect(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) Iterator[element.Tag] ¶
Perform a CSS selection operation on the current
element.Tag
.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.iselect() method. It is the same as select(), but it returns a generator instead of a list.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
flags --
Flags to be passed into Soup Sieve's soupsieve.iselect() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.iselect() method.
- match(select: str, namespaces: Dict[str, str] | None = None, flags: int = 0, **kwargs: Any) bool ¶
Check whether or not this
element.Tag
matches the given CSS selector.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.match() method.
- Param:
a CSS selector.
- Parameters:
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.match() method.
kwargs --
Keyword arguments to be passed into SoupSieve's soupsieve.match() method.
- select(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) ResultSet[element.Tag] ¶
Perform a CSS selection operation on the current
element.Tag
.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select() method.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
flags --
Flags to be passed into Soup Sieve's soupsieve.select() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.
- select_one(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None ¶
Perform a CSS selection operation on the current Tag and return the first result, if any.
This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select_one() method.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.select_one() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.select_one() method.
- class bs4.Comment(value: str | bytes)¶
Bases:
PreformattedString
An HTML comment or XML comment.
- class bs4.Declaration(value: str | bytes)¶
Bases:
PreformattedString
An XML declaration.
- class bs4.Doctype(value: str | bytes)¶
Bases:
PreformattedString
- PREFIX: str = '<!DOCTYPE '¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = '>\n'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- classmethod for_name_and_ids(name: str, pub_id: str | None, system_id: str | None) Doctype ¶
Generate an appropriate document type declaration for a given public ID and system ID.
- Parameters:
name -- The name of the document's root element, e.g. 'html'.
pub_id -- The Formal Public Identifier for this document type, e.g. '-//W3C//DTD XHTML 1.1//EN'
system_id -- The system identifier for this document type, e.g. 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
- class bs4.ElementFilter(match_function: Callable[[PageElement], bool] | None = None)¶
Bases:
object
ElementFilter
encapsulates the logic necessary to decide:1. whether a
PageElement
(aTag
or aNavigableString
) matches a user-specified query.2. whether a given sequence of markup found during initial parsing should be turned into a
PageElement
at all, or simply discarded.The base class is the simplest
ElementFilter
. By default, it matches everything and allows all markup to becomePageElement
objects. You can make it more selective by passing in a user-defined match function, or defining a subclass.Most users of Beautiful Soup will never need to use
ElementFilter
, or its more capable subclassSoupStrainer
. Instead, they will use methods likeTag.find()
, which will convert their arguments intoSoupStrainer
objects and run them against the tree.However, if you find yourself wanting to treat the arguments to Beautiful Soup's find_*() methods as first-class objects, those objects will be
SoupStrainer
objects. You can create them yourself and then make use of functions likeElementFilter.filter()
.- allow_string_creation(string: str) bool ¶
Based on the content of a string, see whether this
ElementFilter
will allow aNavigableString
object based on this string to be added to the parse tree.By default, all strings are processed into
NavigableString
objects. To change this, subclassElementFilter
.- Parameters:
str -- The string under consideration.
- allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool ¶
Based on the name and attributes of a tag, see whether this
ElementFilter
will allow aTag
object to even be created.By default, all tags are parsed. To change this, subclass
ElementFilter
.- Parameters:
name -- The name of the prospective tag.
attrs -- The attributes of the prospective tag.
- property excludes_everything: bool¶
Does this
ElementFilter
obviously exclude everything? If so, Beautiful Soup will issue a warning if you try to use it when parsing a document.The
ElementFilter
might turn out to exclude everything even if this returnsFalse
, but it won't exclude everything in an obvious way.The base
ElementFilter
implementation excludes things based on a match function we can't inspect, so excludes_everything is always false.
- filter(generator: Iterator[PageElement]) Iterator[PageElement | Tag | NavigableString] ¶
The most generic search method offered by Beautiful Soup.
Acts like Python's built-in
filter
, usingElementFilter.match
as the filtering function.
- find(generator: Iterator[PageElement]) PageElement | Tag | NavigableString | None ¶
A lower-level equivalent of
Tag.find()
.You can pass in your own generator for iterating over
PageElement
objects. The first one that matches thisElementFilter
will be returned.- Parameters:
generator -- A way of iterating over
PageElement
objects.
- find_all(generator: Iterator[PageElement], limit: int | None = None) ResultSet[PageElement | Tag | NavigableString] ¶
A lower-level equivalent of
Tag.find_all()
.You can pass in your own generator for iterating over
PageElement
objects. Only elements that match thisElementFilter
will be returned in theResultSet
.- Parameters:
generator -- A way of iterating over
PageElement
objects.limit -- Stop looking after finding this many results.
- property includes_everything: bool¶
Does this
ElementFilter
obviously include everything? If so, the filter process can be made much faster.The
ElementFilter
might turn out to include everything even if this returnsFalse
, but it won't include everything in an obvious way.The base
ElementFilter
implementation includes things based on the match function, so includes_everything is only true if there is no match function.
- match(element: PageElement, _known_rules: bool = False) bool ¶
Does the given PageElement match the rules set down by this ElementFilter?
The base implementation delegates to the function passed in to the constructor.
- Parameters:
_known_rules -- Defined for compatibility with SoupStrainer._match(). Used more for consistency than because we need the performance optimization.
- match_function: Callable[[PageElement], bool] | None¶
- exception bs4.FeatureNotFound¶
Bases:
ValueError
Exception raised by the BeautifulSoup constructor if no parser with the requested features is found.
- exception bs4.GuessedAtParserWarning¶
Bases:
UserWarning
The warning issued when BeautifulSoup has to guess what parser to use -- probably because no parser was specified in the constructor.
- MESSAGE: str = 'No parser was explicitly specified, so I\'m using the best available %(markup_type)s parser for this system ("%(parser)s"). This usually isn\'t a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.\n\nThe code that caused this warning is on line %(line_number)s of the file %(filename)s. To get rid of this warning, pass the additional argument \'features="%(parser)s"\' to the BeautifulSoup constructor.\n'¶
- exception bs4.MarkupResemblesLocatorWarning¶
Bases:
UnusualUsageWarning
The warning issued when BeautifulSoup is given 'markup' that actually looks like a resource locator -- a URL or a path to a file on disk.
- FILENAME_MESSAGE: str = 'The input passed in on this line looks more like a filename than HTML or XML.\n\nIf you meant to use Beautiful Soup to parse the contents of a file on disk, then something has gone wrong. You should open the file first, using code like this:\n\n filehandle = open(your filename)\n\nYou can then feed the open filehandle into Beautiful Soup instead of using the filename.\n\nHowever, if you want to parse some data that happens to look like a %(what)s, then nothing has gone wrong: you are using Beautiful Soup correctly, and this warning is spurious and can be filtered. To make this warning go away, run this code before calling the BeautifulSoup constructor:\n\n from bs4 import MarkupResemblesLocatorWarning\n import warnings\n\n warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)\n '¶
- URL_MESSAGE: str = 'The input passed in on this line looks more like a URL than HTML or XML.\n\nIf you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like \'requests\' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.\n\nHowever, if you want to parse some data that happens to look like a %(what)s, then nothing has gone wrong: you are using Beautiful Soup correctly, and this warning is spurious and can be filtered. To make this warning go away, run this code before calling the BeautifulSoup constructor:\n\n from bs4 import MarkupResemblesLocatorWarning\n import warnings\n\n warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)\n '¶
- exception bs4.ParserRejectedMarkup(message_or_exception: str | Exception)¶
Bases:
Exception
An Exception to be raised when the underlying parser simply refuses to parse the given markup.
- class bs4.ProcessingInstruction(value: str | bytes)¶
Bases:
PreformattedString
A SGML processing instruction.
- class bs4.ResultSet(source: ElementFilter | None, result: Sequence[_PageElementT] = ())¶
Bases:
Sequence
[_PageElementT
],Generic
[_PageElementT
]A ResultSet is a sequence of
PageElement
objects, gathered as the result of matching anElementFilter
against a parse tree. Basically, a list of search results.- result: Sequence[_PageElementT]¶
- source: ElementFilter | None¶
- class bs4.Script(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing the contents of a <script> HTML tag (probably Javascript).Used to distinguish executable code from textual content.
- exception bs4.StopParsing¶
Bases:
Exception
Exception raised by a TreeBuilder if it's unable to continue parsing.
- class bs4.Stylesheet(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing the contents of a <style> HTML tag (probably CSS).Used to distinguish embedded stylesheets from textual content.
- class bs4.Tag(parser: BeautifulSoup | None = None, builder: TreeBuilder | None = None, name: str | None = None, namespace: str | None = None, prefix: str | None = None, attrs: _RawOrProcessedAttributeValues | None = None, parent: BeautifulSoup | Tag | None = None, previous: _AtMostOneElement = None, is_xml: bool | None = None, sourceline: int | None = None, sourcepos: int | None = None, can_be_empty_element: bool | None = None, cdata_list_attributes: Dict[str, Set[str]] | None = None, preserve_whitespace_tags: Set[str] | None = None, interesting_string_types: Set[Type[NavigableString]] | None = None, namespaces: Dict[str, str] | None = None)¶
Bases:
PageElement
An HTML or XML tag that is part of a parse tree, along with its attributes, contents, and relationships to other parts of the tree.
When Beautiful Soup parses the markup
<b>penguin</b>
, it will create aTag
object representing the<b>
tag. You can instantiateTag
objects directly, but it's not necessary unless you're adding entirely new markup to a parsed document. Most of the constructor arguments are intended for use by theTreeBuilder
that's parsing a document.- Parameters:
parser -- A
BeautifulSoup
object representing the parse tree thisTag
will be part of.builder -- The
TreeBuilder
being used to build the tree.name -- The name of the tag.
namespace -- The URI of this tag's XML namespace, if any.
prefix -- The prefix for this tag's XML namespace, if any.
attrs -- A dictionary of attribute values.
parent -- The
Tag
to use as the parent of thisTag
. May be theBeautifulSoup
object itself.previous -- The
PageElement
that was parsed immediately before parsing this tag.is_xml -- If True, this is an XML tag. Otherwise, this is an HTML tag.
sourceline -- The line number where this tag was found in its source document.
sourcepos -- The character position within
sourceline
where this tag was found.can_be_empty_element -- If True, this tag should be represented as <tag/>. If False, this tag should be represented as <tag></tag>.
cdata_list_attributes -- A dictionary of attributes whose values should be parsed as lists of strings if they ever show up on this tag.
preserve_whitespace_tags -- Names of tags whose contents should have their whitespace preserved if they are encountered inside this tag.
interesting_string_types -- When iterating over this tag's string contents in methods like
Tag.strings
orPageElement.get_text
, these are the types of strings that are interesting enough to be considered. By default,NavigableString
(normal strings) andCData
(CDATA sections) are the only interesting string subtypes.namespaces -- A dictionary mapping currently active namespace prefixes to URIs, as of the point in the parsing process when this tag was encountered. This can be used later to construct CSS selectors.
- append(tag: _InsertableElement) PageElement ¶
Appends the given
PageElement
to the contents of thisTag
.- Parameters:
tag -- A PageElement.
:return The newly appended PageElement.
- attrs: _AttributeValues¶
- property children: Iterator[PageElement]¶
Iterate over all direct children of this
PageElement
.
- clear(decompose: bool = False) None ¶
- Destroy all children of this
Tag
by calling PageElement.extract
on them.
- Parameters:
decompose -- If this is True,
PageElement.decompose
(a more destructive method) will be called instead ofPageElement.extract
.
- Destroy all children of this
- contents: List[PageElement]¶
- copy_self() Self ¶
Create a new Tag just like this one, but with no contents and unattached to any parse tree.
This is the first step in the deepcopy process, but you can call it on its own to create a copy of a Tag without copying its contents.
- decode(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal', iterator: Iterator[PageElement] | None = None) str ¶
Render this
Tag
and its contents as a Unicode string.- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.encoding -- The encoding you intend to use when converting the string to a bytestring. decode() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).
formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.iterator -- The iterator to use when navigating over the parse tree. This is only used by
Tag.decode_contents
and you probably won't need to use it.
- decode_contents(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') str ¶
Renders the contents of this tag as a Unicode string.
- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding -- The tag is destined to be encoded into this encoding. decode_contents() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).
formatter -- A
Formatter
object, or a string naming one of the standard Formatters.
- property descendants: Iterator[PageElement]¶
Iterate over all children of this
Tag
in a breadth-first sequence.
- encode(encoding: _Encoding = 'utf-8', indent_level: int | None = None, formatter: _FormatterOrName = 'minimal', errors: str = 'xmlcharrefreplace') bytes ¶
Render this
Tag
and its contents as a bytestring.- Parameters:
encoding -- The encoding to use when converting to a bytestring. This may also affect the text of the document, specifically any encoding declarations within the document.
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.errors -- An error handling strategy such as 'xmlcharrefreplace'. This value is passed along into
str.encode()
and its value should be one of the error handling constants defined by Python's codecs module.
- encode_contents(indent_level: int | None = None, encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') bytes ¶
Renders the contents of this PageElement as a bytestring.
- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.encoding -- The bytestring will be in this encoding.
- extend(tags: Iterable[_InsertableElement] | Tag) List[PageElement] ¶
Appends one or more objects to the contents of this
Tag
.- Parameters:
tags -- If a list of
PageElement
objects is provided, they will be appended to this tag's contents, one at a time. If a singleTag
is provided, itsTag.contents
will be used to extend this object'sTag.contents
.
:return The list of PageElements that were appended.
- find(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '') _AtMostOneNavigableString
Look in the children of this PageElement and find the first PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
recursive -- If this is True, find() will perform a recursive search of this Tag's children. Otherwise, only the direct children will be considered.
string -- A filter on the
Tag.string
attribute.
- Kwargs:
Additional filters on attribute values.
- find_all(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_all(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Look in the children of this
PageElement
and find allPageElement
objects that match the given criteria.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
recursive -- If this is True, find_all() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.
limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- get(key: str, default: _AttributeValue | None = None) _AttributeValue | None ¶
Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.
- Parameters:
key -- The attribute to look for.
default -- Use this value if the attribute is not present on this
Tag
.
- get_attribute_list(key: str, default: AttributeValueList | None = None) AttributeValueList ¶
The same as get(), but always returns a (possibly empty) list.
- Parameters:
key -- The attribute to look for.
default -- Use this value if the attribute is not present on this
Tag
.
- Returns:
A list of strings, usually empty or containing only a single value.
- index(element: PageElement) int ¶
Find the index of a child of this
Tag
(by identity, not value).Doing this by identity avoids issues when a
Tag
contains two children that have string equality.- Parameters:
element -- Look for this
PageElement
in this object's contents.
- insert(position: int, *new_children: _InsertableElement) List[PageElement] ¶
Insert one or more new PageElements as a child of this
Tag
.This works similarly to
list.insert()
, except you can insert multiple elements at once.- Parameters:
position -- The numeric position that should be occupied in this Tag's
Tag.children
by the first newPageElement
.new_children -- The PageElements to insert.
:return The newly inserted PageElements.
- interesting_string_types: Set[Type[NavigableString]] | None¶
- property is_empty_element: bool¶
Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element tag. It depends on the
TreeBuilder
used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag. This is usually the case for HTML documents.If the builder has no designated list of empty-element, then any tag with no contents is an empty-element tag. This is usually the case for XML documents.
- parser_class: type[BeautifulSoup] | None¶
- prettify(encoding: None = None, formatter: _FormatterOrName = 'minimal') str ¶
- prettify(encoding: _Encoding, formatter: _FormatterOrName = 'minimal') bytes
Pretty-print this
Tag
as a string or bytestring.- Parameters:
encoding -- The encoding of the bytestring, or None if you want Unicode.
formatter -- A Formatter object, or a string naming one of the standard formatters.
- Returns:
A string (if no
encoding
is provided) or a bytestring (otherwise).
- replaceWithChildren() _OneElement ¶
: :meta private:
- replace_with_children() Self ¶
Replace this
PageElement
with its contents.- Returns:
This object, no longer part of the tree.
- select(selector: str, namespaces: Dict[str, str] | None = None, limit: int = 0, **kwargs: Any) ResultSet[Tag] ¶
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
kwargs -- Keyword arguments to be passed into SoupSieve's soupsieve.select() method.
- select_one(selector: str, namespaces: Dict[str, str] | None = None, **kwargs: Any) Tag | None ¶
Perform a CSS selection operation on the current element.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
kwargs -- Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.
- property self_and_descendants: Iterator[PageElement]¶
Iterate over this
Tag
and its children in a breadth-first sequence.
- smooth() None ¶
Smooth out the children of this
Tag
by consolidating consecutive strings.If you perform a lot of operations that modify the tree, calling this method afterwards can make pretty-printed output look more natural.
- property string: str | None¶
Convenience property to get the single string within this
Tag
, assuming there is just one.- Returns:
If this
Tag
has a single child that's aNavigableString
, the return value is that string. If this element has one childTag
, the return value is that child'sTag.string
, recursively. If thisTag
has no children, or has more than one child, the return value isNone
.If this property is unexpectedly returning
None
for you, it's probably because yourTag
has more than one thing inside it.
- property strings: Iterator[str]¶
Yield all strings of certain classes, possibly stripping them.
- Parameters:
strip -- If True, all strings will be stripped before being yielded.
types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that's not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.
- unwrap() Self ¶
Replace this
PageElement
with its contents.- Returns:
This object, no longer part of the tree.
- class bs4.TemplateString(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing a string found inside an HTML <template> tag embedded in a larger document.Used to distinguish such strings from the main body of the document.
- class bs4.UnicodeDammit(markup: bytes, known_definite_encodings: Iterable[str] | None = [], smart_quotes_to: Literal['ascii', 'xml', 'html'] | None = None, is_html: bool = False, exclude_encodings: Iterable[str] | None = [], user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)¶
Bases:
object
A class for detecting the encoding of a bytestring containing an HTML or XML document, and decoding it to Unicode. If the source encoding is windows-1252,
UnicodeDammit
can also replace Microsoft smart quotes with their HTML or XML equivalents.- Parameters:
markup -- HTML or XML markup in an unknown encoding.
known_definite_encodings -- When determining the encoding of
markup
, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.user_encodings -- These encodings will be tried after the
known_definite_encodings
have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.override_encodings -- A deprecated alias for
known_definite_encodings
. Any encodings here will be tried immediately after the encodings inknown_definite_encodings
.smart_quotes_to -- By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to
ascii
will convert them to ASCII quotes instead. Setting it toxml
will convert them to XML entity references, and setting it tohtml
will convert them to HTML entity references.is_html -- If True,
markup
is treated as an HTML document. Otherwise it's treated as an XML document.exclude_encodings -- These encodings will not be considered, even if the sniffing code thinks they might make sense.
- CHARSET_ALIASES: Dict[str, str]¶
This dictionary maps commonly seen values for "charset" in HTML meta tags to the corresponding Python codec names. It only covers values that aren't in Python's aliases and can't be determined by the heuristics in
find_codec
.
- ENCODINGS_WITH_SMART_QUOTES: Iterable[str]¶
A list of encodings that tend to contain Microsoft smart quotes.
- MS_CHARS: Dict[bytes, str | Tuple[str, str]]¶
A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
- WINDOWS_1252_TO_UTF8: Dict[int, bytes]¶
A map used when removing rogue Windows-1252/ISO-8859-1 characters in otherwise UTF-8 documents.
Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in Windows-1252.
- contains_replacement_characters: bool¶
This is True if
UnicodeDammit.unicode_markup
contains U+FFFD REPLACEMENT_CHARACTER characters which were not present inUnicodeDammit.markup
. These mark character sequences that could not be represented in Unicode.
- property declared_html_encoding: str | None¶
If the markup is an HTML document, returns the encoding, if any, declared inside the document.
- classmethod detwingle(in_bytes: bytes, main_encoding: str = 'utf8', embedded_encoding: str = 'windows-1252') bytes ¶
Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8.
- Parameters:
in_bytes -- A bytestring that you suspect contains characters from multiple encodings. Note that this must be a bytestring. If you've already converted the document to Unicode, you're too late.
main_encoding -- The primary encoding of
in_bytes
.embedded_encoding -- The encoding that was used to embed characters in the main document.
- Returns:
A bytestring similar to
in_bytes
, in whichembedded_encoding
characters have been converted to theirmain_encoding
equivalents.
- find_codec(charset: str) str | None ¶
Look up the Python codec corresponding to a given character set.
- Parameters:
charset -- The name of a character set.
- Returns:
The name of a Python codec.
- markup: bytes¶
The original markup, before it was converted to Unicode. This is not necessarily the same as what was passed in to the constructor, since any byte-order mark will be stripped.
- original_encoding: str | None¶
Unicode, Dammit's best guess as to the original character encoding of
UnicodeDammit.markup
.
- exception bs4.UnusualUsageWarning¶
Bases:
UserWarning
A superclass for warnings issued when Beautiful Soup sees something that is typically the result of a mistake in the calling code, but might be intentional on the part of the user. If it is in fact intentional, you can filter the individual warning class to get rid of the warning. If you don't like Beautiful Soup second-guessing what you are doing, you can filter the UnusualUsageWarningclass itself and get rid of these entirely.
- exception bs4.XMLParsedAsHTMLWarning¶
Bases:
UnusualUsageWarning
The warning issued when an HTML parser is used to parse XML that is not (as far as we can tell) XHTML.
- MESSAGE: str = 'It looks like you\'re using an HTML parser to parse an XML document.\n\nAssuming this really is an XML document, what you\'re doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package \'lxml\' installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.\n\nIf you want or need to use an HTML parser on this document, you can make this warning go away by filtering it. To do that, run this code before calling the BeautifulSoup constructor:\n\n from bs4 import XMLParsedAsHTMLWarning\n import warnings\n\n warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)\n'¶
Subpackages¶
Submodules¶
bs4.css module¶
Integration code for CSS selectors using Soup Sieve (pypi: soupsieve
).
Acquire a CSS
object through the element.Tag.css
attribute of
the starting point of your CSS selector, or (if you want to run a
selector against the entire document) of the BeautifulSoup
object
itself.
The main advantage of doing this instead of using soupsieve
functions is that you don't need to keep passing the element.Tag
to be
selected against, since the CSS
object is permanently scoped to that
element.Tag
.
- class bs4.css.CSS(tag: element.Tag, api: ModuleType | None = None)¶
Bases:
object
A proxy object against the
soupsieve
library, to simplify its CSS selector API.You don't need to instantiate this class yourself; instead, use
element.Tag.css
.- Parameters:
tag -- All CSS selectors run by this object will use this as their starting point.
api -- An optional drop-in replacement for the
soupsieve
module, intended for use in unit tests.
- closest(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None ¶
Find the
element.Tag
closest to this one that matches the given selector.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.closest() method.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.closest() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.closest() method.
- compile(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) SoupSieve ¶
Pre-compile a selector and return the compiled object.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.compile() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.compile() method.
- Returns:
A precompiled selector object.
- Return type:
soupsieve.SoupSieve
- escape(ident: str) str ¶
Escape a CSS identifier.
This is a simple wrapper around soupsieve.escape(). See the documentation for that function for more information.
- filter(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) ResultSet[element.Tag] ¶
Filter this
element.Tag
's direct children based on the given CSS selector.This uses the Soup Sieve library. It works the same way as passing a
element.Tag
into that library's soupsieve.filter() method. For more information, see the documentation for soupsieve.filter().- Parameters:
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.filter() method.
kwargs --
Keyword arguments to be passed into SoupSieve's soupsieve.filter() method.
- iselect(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) Iterator[element.Tag] ¶
Perform a CSS selection operation on the current
element.Tag
.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.iselect() method. It is the same as select(), but it returns a generator instead of a list.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
flags --
Flags to be passed into Soup Sieve's soupsieve.iselect() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.iselect() method.
- match(select: str, namespaces: Dict[str, str] | None = None, flags: int = 0, **kwargs: Any) bool ¶
Check whether or not this
element.Tag
matches the given CSS selector.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.match() method.
- Param:
a CSS selector.
- Parameters:
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.match() method.
kwargs --
Keyword arguments to be passed into SoupSieve's soupsieve.match() method.
- select(select: str, namespaces: _NamespaceMapping | None = None, limit: int = 0, flags: int = 0, **kwargs: Any) ResultSet[element.Tag] ¶
Perform a CSS selection operation on the current
element.Tag
.This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select() method.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will pass in the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
flags --
Flags to be passed into Soup Sieve's soupsieve.select() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.
- select_one(select: str, namespaces: _NamespaceMapping | None = None, flags: int = 0, **kwargs: Any) element.Tag | None ¶
Perform a CSS selection operation on the current Tag and return the first result, if any.
This uses the Soup Sieve library. For more information, see that library's documentation for the soupsieve.select_one() method.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
flags --
Flags to be passed into Soup Sieve's soupsieve.select_one() method.
kwargs --
Keyword arguments to be passed into Soup Sieve's soupsieve.select_one() method.
bs4.dammit module¶
Beautiful Soup bonus library: Unicode, Dammit
This library converts a bytestream to Unicode through any means
necessary. It is heavily based on code from Mark Pilgrim's Universal
Feed Parser, now maintained
by Kurt McKee. It does not rewrite the body of an XML or HTML document
to reflect a new encoding; that's the job of TreeBuilder
.
- class bs4.dammit.EncodingDetector(markup: bytes, known_definite_encodings: Iterable[str] | None = None, is_html: bool | None = False, exclude_encodings: Iterable[str] | None = None, user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)¶
Bases:
object
This class is capable of guessing a number of possible encodings for a bytestring.
Order of precedence:
Encodings you specifically tell EncodingDetector to try first (the
known_definite_encodings
argument to the constructor).An encoding determined by sniffing the document's byte-order mark.
Encodings you specifically tell EncodingDetector to try if byte-order mark sniffing fails (the
user_encodings
argument to the constructor).An encoding declared within the bytestring itself, either in an XML declaration (if the bytestring is to be interpreted as an XML document), or in a <meta> tag (if the bytestring is to be interpreted as an HTML document.)
An encoding detected through textual analysis by chardet, cchardet, or a similar external library.
UTF-8.
Windows-1252.
- Parameters:
markup -- Some markup in an unknown encoding.
known_definite_encodings --
When determining the encoding of
markup
, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.user_encodings --
These encodings will be tried after the
known_definite_encodings
have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.override_encodings -- A deprecated alias for
known_definite_encodings
. Any encodings here will be tried immediately after the encodings inknown_definite_encodings
.is_html -- If True, this markup is considered to be HTML. Otherwise it's assumed to be XML.
exclude_encodings -- These encodings will not be tried, even if they otherwise would be.
- property encodings: Iterator[str]¶
Yield a number of encodings that might work for this markup.
- Yield:
A sequence of strings. Each is the name of an encoding that might work to convert a bytestring into Unicode.
- classmethod find_declared_encoding(markup: bytes | str, is_html: bool = False, search_entire_document: bool = False) str | None ¶
Given a document, tries to find an encoding declared within the text of the document itself.
An XML encoding is declared at the beginning of the document.
An HTML encoding is declared in a <meta> tag, hopefully near the beginning of the document.
- Parameters:
markup -- Some markup.
is_html -- If True, this markup is considered to be HTML. Otherwise it's assumed to be XML.
search_entire_document -- Since an encoding is supposed to declared near the beginning of the document, most of the time it's only necessary to search a few kilobytes of data. Set this to True to force this method to search the entire document.
- Returns:
The declared encoding, if one is found.
- classmethod strip_byte_order_mark(data: bytes) Tuple[bytes, str | None] ¶
If a byte-order mark is present, strip it and return the encoding it implies.
- Parameters:
data -- A bytestring that may or may not begin with a byte-order mark.
- Returns:
A 2-tuple (data stripped of byte-order mark, encoding implied by byte-order mark)
- class bs4.dammit.EntitySubstitution¶
Bases:
object
The ability to substitute XML or HTML entities for certain characters.
- ANY_ENTITY_RE = re.compile('&(#\\d+|#x[0-9a-fA-F]+|\\w+);', re.IGNORECASE)¶
- BARE_AMPERSAND_OR_BRACKET: Pattern[str]¶
A regular expression matching an angle bracket or an ampersand that is not part of an XML or HTML entity.
- CHARACTER_TO_HTML_ENTITY: Dict[str, str]¶
A map of Unicode strings to the corresponding named HTML entities; the inverse of HTML_ENTITY_TO_CHARACTER.
- CHARACTER_TO_HTML_ENTITY_RE: Pattern[str]¶
A regular expression that matches any character (or, in rare cases, pair of characters) that can be replaced with a named HTML entity.
- CHARACTER_TO_HTML_ENTITY_WITH_AMPERSAND_RE: Pattern[str]¶
A very similar regular expression to CHARACTER_TO_HTML_ENTITY_RE, but which also matches unescaped ampersands. This is used by the 'html' formatted to provide backwards-compatibility, even though the HTML5 spec allows most ampersands to go unescaped.
- CHARACTER_TO_XML_ENTITY: Dict[str, str]¶
A map of Unicode strings to the corresponding named XML entities.
- HTML_ENTITY_TO_CHARACTER: Dict[str, str]¶
A map of named HTML entities to the corresponding Unicode string.
- classmethod quoted_attribute_value(value: str) str ¶
Make a value into a quoted XML attribute, possibly escaping it.
Most strings will be quoted using double quotes.
Bob's Bar -> "Bob's Bar"
If a string contains double quotes, it will be quoted using single quotes.
Welcome to "my bar" -> 'Welcome to "my bar"'
If a string contains both single and double quotes, the double quotes will be escaped, and the string will be quoted using double quotes.
Welcome to "Bob's Bar" -> Welcome to "Bob's bar"
- Parameters:
value -- The XML attribute value to quote
- Returns:
The quoted value
- classmethod substitute_html(s: str) str ¶
Replace certain Unicode characters with named HTML entities.
This differs from
data.encode(encoding, 'xmlcharrefreplace')
in that the goal is to make the result more readable (to those with ASCII displays) rather than to recover from errors. There's absolutely nothing wrong with a UTF-8 string containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that character with "é" will make it more readable to some people.- Parameters:
s -- The string to be modified.
- Returns:
The string with some Unicode characters replaced with HTML entities.
- classmethod substitute_html5(s: str) str ¶
Replace certain Unicode characters with named HTML entities using HTML5 rules.
Specifically, this method is much less aggressive about escaping ampersands than substitute_html. Only ambiguous ampersands are escaped, per the HTML5 standard:
"An ambiguous ampersand is a U+0026 AMPERSAND character (&) that is followed by one or more ASCII alphanumerics, followed by a U+003B SEMICOLON character (;), where these characters do not match any of the names given in the named character references section."
Unlike substitute_html5_raw, this method assumes HTML entities were converted to Unicode characters on the way in, as Beautiful Soup does. By the time Beautiful Soup does its work, the only ambiguous ampersands that need to be escaped are the ones that were escaped in the original markup when mentioning HTML entities.
- Parameters:
s -- The string to be modified.
- Returns:
The string with some Unicode characters replaced with HTML entities.
- classmethod substitute_html5_raw(s: str) str ¶
Replace certain Unicode characters with named HTML entities using HTML5 rules.
substitute_html5_raw is similar to substitute_html5 but it is designed for standalone use (whereas substitute_html5 is designed for use with Beautiful Soup).
- Parameters:
s -- The string to be modified.
- Returns:
The string with some Unicode characters replaced with HTML entities.
- classmethod substitute_xml(value: str, make_quoted_attribute: bool = False) str ¶
Replace special XML characters with named XML entities.
The less-than sign will become <, the greater-than sign will become >, and any ampersands will become &. If you want ampersands that seem to be part of an entity definition to be left alone, use
substitute_xml_containing_entities
instead.- Parameters:
value -- A string to be substituted.
make_quoted_attribute -- If True, then the string will be quoted, as befits an attribute value.
- Returns:
A version of
value
with special characters replaced with named entities.
- classmethod substitute_xml_containing_entities(value: str, make_quoted_attribute: bool = False) str ¶
Substitute XML entities for special XML characters.
- Parameters:
value -- A string to be substituted. The less-than sign will become <, the greater-than sign will become >, and any ampersands that are not part of an entity defition will become &.
make_quoted_attribute -- If True, then the string will be quoted, as befits an attribute value.
- class bs4.dammit.UnicodeDammit(markup: bytes, known_definite_encodings: Iterable[str] | None = [], smart_quotes_to: Literal['ascii', 'xml', 'html'] | None = None, is_html: bool = False, exclude_encodings: Iterable[str] | None = [], user_encodings: Iterable[str] | None = None, override_encodings: Iterable[str] | None = None)¶
Bases:
object
A class for detecting the encoding of a bytestring containing an HTML or XML document, and decoding it to Unicode. If the source encoding is windows-1252,
UnicodeDammit
can also replace Microsoft smart quotes with their HTML or XML equivalents.- Parameters:
markup -- HTML or XML markup in an unknown encoding.
known_definite_encodings --
When determining the encoding of
markup
, these encodings will be tried first, in order. In HTML terms, this corresponds to the "known definite encoding" step defined in section 13.2.3.1 of the HTML standard.user_encodings --
These encodings will be tried after the
known_definite_encodings
have been tried and failed, and after an attempt to sniff the encoding by looking at a byte order mark has failed. In HTML terms, this corresponds to the step "user has explicitly instructed the user agent to override the document's character encoding", defined in section 13.2.3.2 of the HTML standard.override_encodings -- A deprecated alias for
known_definite_encodings
. Any encodings here will be tried immediately after the encodings inknown_definite_encodings
.smart_quotes_to -- By default, Microsoft smart quotes will, like all other characters, be converted to Unicode characters. Setting this to
ascii
will convert them to ASCII quotes instead. Setting it toxml
will convert them to XML entity references, and setting it tohtml
will convert them to HTML entity references.is_html -- If True,
markup
is treated as an HTML document. Otherwise it's treated as an XML document.exclude_encodings -- These encodings will not be considered, even if the sniffing code thinks they might make sense.
- CHARSET_ALIASES: Dict[str, str]¶
This dictionary maps commonly seen values for "charset" in HTML meta tags to the corresponding Python codec names. It only covers values that aren't in Python's aliases and can't be determined by the heuristics in
find_codec
.
- ENCODINGS_WITH_SMART_QUOTES: Iterable[str]¶
A list of encodings that tend to contain Microsoft smart quotes.
- MS_CHARS: Dict[bytes, str | Tuple[str, str]]¶
A partial mapping of ISO-Latin-1 to HTML entities/XML numeric entities.
- WINDOWS_1252_TO_UTF8: Dict[int, bytes]¶
A map used when removing rogue Windows-1252/ISO-8859-1 characters in otherwise UTF-8 documents.
Note that \x81, \x8d, \x8f, \x90, and \x9d are undefined in Windows-1252.
- contains_replacement_characters: bool¶
This is True if
UnicodeDammit.unicode_markup
contains U+FFFD REPLACEMENT_CHARACTER characters which were not present inUnicodeDammit.markup
. These mark character sequences that could not be represented in Unicode.
- property declared_html_encoding: str | None¶
If the markup is an HTML document, returns the encoding, if any, declared inside the document.
- classmethod detwingle(in_bytes: bytes, main_encoding: str = 'utf8', embedded_encoding: str = 'windows-1252') bytes ¶
Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its subset ISO-8859-1), embedded in UTF-8.
- Parameters:
in_bytes -- A bytestring that you suspect contains characters from multiple encodings. Note that this must be a bytestring. If you've already converted the document to Unicode, you're too late.
main_encoding -- The primary encoding of
in_bytes
.embedded_encoding -- The encoding that was used to embed characters in the main document.
- Returns:
A bytestring similar to
in_bytes
, in whichembedded_encoding
characters have been converted to theirmain_encoding
equivalents.
- find_codec(charset: str) str | None ¶
Look up the Python codec corresponding to a given character set.
- Parameters:
charset -- The name of a character set.
- Returns:
The name of a Python codec.
- markup: bytes¶
The original markup, before it was converted to Unicode. This is not necessarily the same as what was passed in to the constructor, since any byte-order mark will be stripped.
- original_encoding: str | None¶
Unicode, Dammit's best guess as to the original character encoding of
UnicodeDammit.markup
.
bs4.element module¶
- class bs4.element.AttributeDict¶
-
Superclass for the dictionary used to hold a tag's attributes. You can use this, but it's just a regular dict with no special logic.
- class bs4.element.AttributeValueList(iterable=(), /)¶
-
Class for the list used to hold the values of attributes which have multiple values (such as HTML's 'class'). It's just a regular list, but you can subclass it and pass it in to the TreeBuilder constructor as attribute_value_list_class, to have your subclass instantiated instead.
- class bs4.element.AttributeValueWithCharsetSubstitution¶
Bases:
str
An abstract class standing in for a character encoding specified inside an HTML
<meta>
tag.Subclasses exist for each place such a character encoding might be found: either inside the
charset
attribute (CharsetMetaAttributeValue
) or inside thecontent
attribute (ContentMetaAttributeValue
)This allows Beautiful Soup to replace that part of the HTML file with a different encoding when ouputting a tree as a string.
- class bs4.element.CData(value: str | bytes)¶
Bases:
PreformattedString
- PREFIX: str = '<![CDATA['¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = ']]>'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- class bs4.element.CharsetMetaAttributeValue(original_value: str)¶
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a
<meta>
tag'scharset
attribute.When Beautiful Soup parses the markup
<meta charset="utf8">
, the value of thecharset
attribute will become one of these objects.If the document is later encoded to an encoding other than UTF-8, its
<meta>
tag will mention the new encoding instead ofutf8
.
- class bs4.element.Comment(value: str | bytes)¶
Bases:
PreformattedString
An HTML comment or XML comment.
- PREFIX: str = '<!--'¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = '-->'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- class bs4.element.ContentMetaAttributeValue(original_value: str)¶
Bases:
AttributeValueWithCharsetSubstitution
A generic stand-in for the value of a
<meta>
tag'scontent
attribute.- When Beautiful Soup parses the markup:
<meta http-equiv="content-type" content="text/html; charset=utf8">
The value of the
content
attribute will become one of these objects.If the document is later encoded to an encoding other than UTF-8, its
<meta>
tag will mention the new encoding instead ofutf8
.
- bs4.element.DEFAULT_OUTPUT_ENCODING: str = 'utf-8'¶
Documents output by Beautiful Soup will be encoded with this encoding unless you specify otherwise.
- class bs4.element.Declaration(value: str | bytes)¶
Bases:
PreformattedString
An XML declaration.
- PREFIX: str = '<?'¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = '?>'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- class bs4.element.Doctype(value: str | bytes)¶
Bases:
PreformattedString
- PREFIX: str = '<!DOCTYPE '¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = '>\n'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- classmethod for_name_and_ids(name: str, pub_id: str | None, system_id: str | None) Doctype ¶
Generate an appropriate document type declaration for a given public ID and system ID.
- Parameters:
name -- The name of the document's root element, e.g. 'html'.
pub_id -- The Formal Public Identifier for this document type, e.g. '-//W3C//DTD XHTML 1.1//EN'
system_id -- The system identifier for this document type, e.g. 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd'
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- class bs4.element.HTMLAttributeDict¶
Bases:
AttributeDict
A dictionary for holding a Tag's attributes, which processes incoming values for consistency with the HTML spec, which says 'Attribute values are a mixture of text and character references...'
Basically, this means converting common non-string values into strings, like XMLAttributeDict, though HTML also has some rules around boolean attributes that XML doesn't have.
- class bs4.element.NamespacedAttribute(prefix: str | None, name: str | None = None, namespace: str | None = None)¶
Bases:
str
A namespaced attribute (e.g. the 'xml:lang' in 'xml:lang="en"') which remembers the namespace prefix ('xml') and the name ('lang') that were used to create it.
Bases:
str
,PageElement
A Python string that is part of a parse tree.
When Beautiful Soup parses the markup
<b>penguin</b>
, it will create aNavigableString
for the string "penguin".A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
Run the string through the provided formatter, making it ready for output as part of an HTML or XML document.
- Parameters:
formatter -- A
Formatter
object, or a string naming one of the standard formatters.
Yield this string, but only if it is interesting.
This is defined the way it is for compatibility with
Tag.strings
. SeeTag
for information on which strings are interesting in a given context.- Yield:
A sequence that either contains this string, or is empty.
- bs4.element.PYTHON_SPECIFIC_ENCODINGS: Set[_Encoding] = {'idna', 'mbcs', 'oem', 'palmos', 'punycode', 'raw-unicode-escape', 'raw_unicode_escape', 'string-escape', 'string_escape', 'undefined', 'unicode-escape', 'unicode_escape'}¶
These encodings are recognized by Python (so
Tag.encode
could theoretically support them) but XML and HTML don't recognize them (so they should not show up in an XML or HTML document as that document's encoding).If an XML document is encoded in one of these encodings, no encoding will be mentioned in the XML declaration. If an HTML document is encoded in one of these encodings, and the HTML document has a <meta> tag that mentions an encoding, the encoding will be given as the empty string.
Source: Python documentation, Python Specific Encodings
- class bs4.element.PageElement¶
Bases:
object
An abstract class representing a single element in the parse tree.
NavigableString
,Tag
, etc. are all subclasses ofPageElement
. For this reason you'll see a lot of methods that returnPageElement
, but you'll never see an actualPageElement
object. For the most part you can think ofPageElement
as meaning "aTag
or aNavigableString
."- decompose() None ¶
Recursively destroys this
PageElement
and its children.The element will be removed from the tree and wiped out; so will everything beneath it.
The behavior of a decomposed
PageElement
is undefined and you should never use one for anything, but if you need to check whether an element has been decomposed, you can use thePageElement.decomposed
property.
- extract(_self_index: int | None = None) Self ¶
Destructively rips this element out of the tree.
- Parameters:
_self_index -- The location of this element in its parent's .contents, if known. Passing this in allows for a performance optimization.
- Returns:
this
PageElement
, no longer part of the tree.
- find_all_next(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_all_next(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Find all
PageElement
objects that match the given criteria and appear later in the document than thisPageElement
.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a NavigableString with specific text.
limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- find_all_previous(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_all_previous(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Look backwards in the document from this
PageElement
and find allPageElement
that match the given criteria.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a
NavigableString
with specific text.limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- find_next(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find_next(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString
Find the first PageElement that matches the given criteria and appears later in the document than this PageElement.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a NavigableString with specific text.
- Kwargs:
Additional filters on attribute values.
- find_next_sibling(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find_next_sibling(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString
Find the closest sibling to this PageElement that matches the given criteria and appears later in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a
NavigableString
with specific text.
- Kwargs:
Additional filters on attribute values.
- find_next_siblings(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_next_siblings(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Find all siblings of this
PageElement
that match the given criteria and appear later in the document.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a
NavigableString
with specific text.limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- find_parent(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
Find the closest parent of this PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
self -- Whether the PageElement itself should be considered as one of its 'parents'.
- Kwargs:
Additional filters on attribute values.
- find_parents(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
Find all parents of this
PageElement
that match the given criteria.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- find_previous(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find_previous(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString
Look backwards in the document from this
PageElement
and find the firstPageElement
that matches the given criteria.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a
NavigableString
with specific text.
- Kwargs:
Additional filters on attribute values.
- find_previous_sibling(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find_previous_sibling(name: None = None, attrs: None = None, string: _StrainableString = '', **kwargs: _StrainableAttribute) _AtMostOneNavigableString
Returns the closest sibling to this
PageElement
that matches the given criteria and appears earlier in the document.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a
NavigableString
with specific text.
- Kwargs:
Additional filters on attribute values.
- find_previous_siblings(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_previous_siblings(name: None = None, attrs: None = None, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Returns all siblings to this PageElement that match the given criteria and appear earlier in the document.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
string -- A filter for a NavigableString with specific text.
limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- format_string(s: str, formatter: _FormatterOrName | None) str ¶
Format the given string using the given formatter.
- Parameters:
s -- A string.
formatter -- A Formatter object, or a string naming one of the standard formatters.
- formatter_for_name(formatter_name: _FormatterOrName | _EntitySubstitutionFunction) Formatter ¶
Look up or create a Formatter for the given identifier, if necessary.
- Parameters:
formatter -- Can be a
Formatter
object (used as-is), a function (used as the entity substitution hook for anbs4.formatter.XMLFormatter
orbs4.formatter.HTMLFormatter
), or a string (used to look up anbs4.formatter.XMLFormatter
orbs4.formatter.HTMLFormatter
in the appropriate registry.
- getText(separator: str = '', strip: bool = False, types: Iterable[Type[NavigableString]] = ()) str ¶
Get all child strings of this PageElement, concatenated using the given separator.
- Parameters:
separator -- Strings will be concatenated using this separator.
strip -- If True, strings will be stripped before being concatenated.
types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.
- Returns:
A string.
- get_text(separator: str = '', strip: bool = False, types: Iterable[Type[NavigableString]] = ()) str ¶
Get all child strings of this PageElement, concatenated using the given separator.
- Parameters:
separator -- Strings will be concatenated using this separator.
strip -- If True, strings will be stripped before being concatenated.
types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.
- Returns:
A string.
Whether or not this element is hidden from generated output. Only the
BeautifulSoup
object itself is hidden.
- insert_after(*args: _InsertableElement) List[PageElement] ¶
Makes the given element(s) the immediate successor of this one.
The elements will have the same
PageElement.parent
as this one, and the given elements will occur immediately after this one.- Parameters:
args -- One or more PageElements.
:return The list of PageElements that were inserted.
- insert_before(*args: _InsertableElement) List[PageElement] ¶
Makes the given element(s) the immediate predecessor of this one.
All the elements will have the same
PageElement.parent
as this one, and the given elements will occur immediately before this one.- Parameters:
args -- One or more PageElements.
:return The list of PageElements that were inserted.
- known_xml: bool | None = None¶
In general, we can't tell just by looking at an element whether it's contained in an XML document or an HTML document. But for
Tag
objects (q.v.) we can store this information at parse time. :meta private:
- property next: _AtMostOneElement¶
The
PageElement
, if any, that was parsed just after this one.
- next_element: _AtMostOneElement¶
- property next_elements: Iterator[PageElement]¶
All PageElements that were parsed after this one.
- next_sibling: _AtMostOneElement¶
- property next_siblings: Iterator[PageElement]¶
All PageElements that are siblings of this one but were parsed later.
- property parents: Iterator[Tag]¶
All elements that are parents of this PageElement.
- Yield:
A sequence of Tags, ending with a BeautifulSoup object.
- property previous: _AtMostOneElement¶
The
PageElement
, if any, that was parsed just before this one.
- previous_element: _AtMostOneElement¶
- property previous_elements: Iterator[PageElement]¶
All PageElements that were parsed before this one.
- Yield:
A sequence of PageElements.
- previous_sibling: _AtMostOneElement¶
- property previous_siblings: Iterator[PageElement]¶
All PageElements that are siblings of this one but were parsed earlier.
- Yield:
A sequence of PageElements.
- replace_with(*args: _InsertableElement) Self ¶
Replace this
PageElement
with one or more other elements, objects, keeping the rest of the tree the same.- Returns:
This
PageElement
, no longer part of the tree.
- property self_and_next_elements: Iterator[PageElement]¶
This PageElement, then all PageElements that were parsed after it.
- property self_and_next_siblings: Iterator[PageElement]¶
This PageElement, then all of its siblings.
- property self_and_parents: Iterator[PageElement]¶
This element, then all of its parents.
- Yield:
A sequence of PageElements, ending with a BeautifulSoup object.
- property self_and_previous_elements: Iterator[PageElement]¶
This PageElement, then all elements that were parsed earlier.
- property self_and_previous_siblings: Iterator[PageElement]¶
This PageElement, then all of its siblings that were parsed earlier.
- setup(parent: Tag | None = None, previous_element: _AtMostOneElement = None, next_element: _AtMostOneElement = None, previous_sibling: _AtMostOneElement = None, next_sibling: _AtMostOneElement = None) None ¶
Sets up the initial relations between this element and other elements.
- Parameters:
parent -- The parent of this element.
previous_element -- The element parsed immediately before this one.
next_element -- The element parsed immediately after this one.
previous_sibling -- The most recently encountered element on the same level of the parse tree as this one.
previous_sibling -- The next element to be encountered on the same level of the parse tree as this one.
- property stripped_strings: Iterator[str]¶
Yield all interesting strings in this PageElement, stripping them first.
See
Tag
for information on which strings are considered interesting in a given context.
- property text: str¶
Get all child strings of this PageElement, concatenated using the given separator.
- Parameters:
separator -- Strings will be concatenated using this separator.
strip -- If True, strings will be stripped before being concatenated.
types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. Although there are exceptions, the default behavior in most cases is to consider only NavigableString and CData objects. That means no comments, processing instructions, etc.
- Returns:
A string.
- wrap(wrap_inside: Tag) Tag ¶
Wrap this
PageElement
inside aTag
.- Returns:
wrap_inside
, occupying the position in the tree that used to be occupied by this object, and with this object now inside it.
- class bs4.element.PreformattedString(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
not subject to the normal formatting rules.This is an abstract class used for special kinds of strings such as comments (
Comment
) and CDATA blocks (CData
).- PREFIX: str = ''¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = ''¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- class bs4.element.ProcessingInstruction(value: str | bytes)¶
Bases:
PreformattedString
A SGML processing instruction.
- PREFIX: str = '<?'¶
A string prepended to the body of the 'real' string when formatting it as part of a document, such as the '<!--' in an HTML comment.
- SUFFIX: str = '>'¶
A string appended to the body of the 'real' string when formatting it as part of a document, such as the '-->' in an HTML comment.
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- class bs4.element.ResultSet(source: ElementFilter | None, result: Sequence[_PageElementT] = ())¶
Bases:
Sequence
[_PageElementT
],Generic
[_PageElementT
]A ResultSet is a sequence of
PageElement
objects, gathered as the result of matching anElementFilter
against a parse tree. Basically, a list of search results.- result: Sequence[_PageElementT]¶
- source: ElementFilter | None¶
- class bs4.element.RubyParenthesisString(value: str | bytes)¶
Bases:
NavigableString
A NavigableString representing the contents of an <rp> HTML tag.
- class bs4.element.RubyTextString(value: str | bytes)¶
Bases:
NavigableString
A NavigableString representing the contents of an <rt> HTML tag.
Can be used to distinguish such strings from the strings they're annotating.
- class bs4.element.Script(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing the contents of a <script> HTML tag (probably Javascript).Used to distinguish executable code from textual content.
- class bs4.element.Stylesheet(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing the contents of a <style> HTML tag (probably CSS).Used to distinguish embedded stylesheets from textual content.
- class bs4.element.Tag(parser: BeautifulSoup | None = None, builder: TreeBuilder | None = None, name: str | None = None, namespace: str | None = None, prefix: str | None = None, attrs: _RawOrProcessedAttributeValues | None = None, parent: BeautifulSoup | Tag | None = None, previous: _AtMostOneElement = None, is_xml: bool | None = None, sourceline: int | None = None, sourcepos: int | None = None, can_be_empty_element: bool | None = None, cdata_list_attributes: Dict[str, Set[str]] | None = None, preserve_whitespace_tags: Set[str] | None = None, interesting_string_types: Set[Type[NavigableString]] | None = None, namespaces: Dict[str, str] | None = None)¶
Bases:
PageElement
An HTML or XML tag that is part of a parse tree, along with its attributes, contents, and relationships to other parts of the tree.
When Beautiful Soup parses the markup
<b>penguin</b>
, it will create aTag
object representing the<b>
tag. You can instantiateTag
objects directly, but it's not necessary unless you're adding entirely new markup to a parsed document. Most of the constructor arguments are intended for use by theTreeBuilder
that's parsing a document.- Parameters:
parser -- A
BeautifulSoup
object representing the parse tree thisTag
will be part of.builder -- The
TreeBuilder
being used to build the tree.name -- The name of the tag.
namespace -- The URI of this tag's XML namespace, if any.
prefix -- The prefix for this tag's XML namespace, if any.
attrs -- A dictionary of attribute values.
parent -- The
Tag
to use as the parent of thisTag
. May be theBeautifulSoup
object itself.previous -- The
PageElement
that was parsed immediately before parsing this tag.is_xml -- If True, this is an XML tag. Otherwise, this is an HTML tag.
sourceline -- The line number where this tag was found in its source document.
sourcepos -- The character position within
sourceline
where this tag was found.can_be_empty_element -- If True, this tag should be represented as <tag/>. If False, this tag should be represented as <tag></tag>.
cdata_list_attributes -- A dictionary of attributes whose values should be parsed as lists of strings if they ever show up on this tag.
preserve_whitespace_tags -- Names of tags whose contents should have their whitespace preserved if they are encountered inside this tag.
interesting_string_types -- When iterating over this tag's string contents in methods like
Tag.strings
orPageElement.get_text
, these are the types of strings that are interesting enough to be considered. By default,NavigableString
(normal strings) andCData
(CDATA sections) are the only interesting string subtypes.namespaces -- A dictionary mapping currently active namespace prefixes to URIs, as of the point in the parsing process when this tag was encountered. This can be used later to construct CSS selectors.
- append(tag: _InsertableElement) PageElement ¶
Appends the given
PageElement
to the contents of thisTag
.- Parameters:
tag -- A PageElement.
:return The newly appended PageElement.
- attrs: _AttributeValues¶
- property children: Iterator[PageElement]¶
Iterate over all direct children of this
PageElement
.
- clear(decompose: bool = False) None ¶
- Destroy all children of this
Tag
by calling PageElement.extract
on them.
- Parameters:
decompose -- If this is True,
PageElement.decompose
(a more destructive method) will be called instead ofPageElement.extract
.
- Destroy all children of this
- contents: List[PageElement]¶
- copy_self() Self ¶
Create a new Tag just like this one, but with no contents and unattached to any parse tree.
This is the first step in the deepcopy process, but you can call it on its own to create a copy of a Tag without copying its contents.
- decode(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal', iterator: Iterator[PageElement] | None = None) str ¶
Render this
Tag
and its contents as a Unicode string.- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.encoding -- The encoding you intend to use when converting the string to a bytestring. decode() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).
formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.iterator -- The iterator to use when navigating over the parse tree. This is only used by
Tag.decode_contents
and you probably won't need to use it.
- decode_contents(indent_level: int | None = None, eventual_encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') str ¶
Renders the contents of this tag as a Unicode string.
- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The formatter decides what a 'level' means in terms of spaces or other characters output.) Used internally in recursive calls while pretty-printing.
eventual_encoding -- The tag is destined to be encoded into this encoding. decode_contents() is not responsible for performing that encoding. This information is needed so that a real encoding can be substituted in if the document contains an encoding declaration (e.g. in a <meta> tag).
formatter -- A
Formatter
object, or a string naming one of the standard Formatters.
- property descendants: Iterator[PageElement]¶
Iterate over all children of this
Tag
in a breadth-first sequence.
- encode(encoding: _Encoding = 'utf-8', indent_level: int | None = None, formatter: _FormatterOrName = 'minimal', errors: str = 'xmlcharrefreplace') bytes ¶
Render this
Tag
and its contents as a bytestring.- Parameters:
encoding -- The encoding to use when converting to a bytestring. This may also affect the text of the document, specifically any encoding declarations within the document.
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.errors --
An error handling strategy such as 'xmlcharrefreplace'. This value is passed along into
str.encode()
and its value should be one of the error handling constants defined by Python's codecs module.
- encode_contents(indent_level: int | None = None, encoding: _Encoding = 'utf-8', formatter: _FormatterOrName = 'minimal') bytes ¶
Renders the contents of this PageElement as a bytestring.
- Parameters:
indent_level -- Each line of the rendering will be indented this many levels. (The
formatter
decides what a 'level' means, in terms of spaces or other characters output.) This is used internally in recursive calls while pretty-printing.formatter -- Either a
Formatter
object, or a string naming one of the standard formatters.encoding -- The bytestring will be in this encoding.
- extend(tags: Iterable[_InsertableElement] | Tag) List[PageElement] ¶
Appends one or more objects to the contents of this
Tag
.- Parameters:
tags -- If a list of
PageElement
objects is provided, they will be appended to this tag's contents, one at a time. If a singleTag
is provided, itsTag.contents
will be used to extend this object'sTag.contents
.
:return The list of PageElements that were appended.
- find(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, **kwargs: _StrainableAttribute) _AtMostOneTag ¶
- find(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '') _AtMostOneNavigableString
Look in the children of this PageElement and find the first PageElement that matches the given criteria.
All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
recursive -- If this is True, find() will perform a recursive search of this Tag's children. Otherwise, only the direct children will be considered.
string -- A filter on the
Tag.string
attribute.
- Kwargs:
Additional filters on attribute values.
- find_all(name: _FindMethodName = None, attrs: _StrainableAttributes | None = None, recursive: bool = True, string: None = None, limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeTags ¶
- find_all(name: None = None, attrs: None = None, recursive: bool = True, string: _StrainableString = '', limit: int | None = None, _stacklevel: int = 2, **kwargs: _StrainableAttribute) _SomeNavigableStrings
Look in the children of this
PageElement
and find allPageElement
objects that match the given criteria.All find_* methods take a common set of arguments. See the online documentation for detailed explanations.
- Parameters:
name -- A filter on tag name.
attrs -- Additional filters on attribute values.
recursive -- If this is True, find_all() will perform a recursive search of this PageElement's children. Otherwise, only the direct children will be considered.
limit -- Stop looking after finding this many results.
_stacklevel -- Used internally to improve warning messages.
- Kwargs:
Additional filters on attribute values.
- get(key: str, default: _AttributeValue | None = None) _AttributeValue | None ¶
Returns the value of the 'key' attribute for the tag, or the value given for 'default' if it doesn't have that attribute.
- Parameters:
key -- The attribute to look for.
default -- Use this value if the attribute is not present on this
Tag
.
- get_attribute_list(key: str, default: AttributeValueList | None = None) AttributeValueList ¶
The same as get(), but always returns a (possibly empty) list.
- Parameters:
key -- The attribute to look for.
default -- Use this value if the attribute is not present on this
Tag
.
- Returns:
A list of strings, usually empty or containing only a single value.
- index(element: PageElement) int ¶
Find the index of a child of this
Tag
(by identity, not value).Doing this by identity avoids issues when a
Tag
contains two children that have string equality.- Parameters:
element -- Look for this
PageElement
in this object's contents.
- insert(position: int, *new_children: _InsertableElement) List[PageElement] ¶
Insert one or more new PageElements as a child of this
Tag
.This works similarly to
list.insert()
, except you can insert multiple elements at once.- Parameters:
position -- The numeric position that should be occupied in this Tag's
Tag.children
by the first newPageElement
.new_children -- The PageElements to insert.
:return The newly inserted PageElements.
- interesting_string_types: Set[Type[NavigableString]] | None¶
- property is_empty_element: bool¶
Is this tag an empty-element tag? (aka a self-closing tag)
A tag that has contents is never an empty-element tag.
A tag that has no contents may or may not be an empty-element tag. It depends on the
TreeBuilder
used to create the tag. If the builder has a designated list of empty-element tags, then only a tag whose name shows up in that list is considered an empty-element tag. This is usually the case for HTML documents.If the builder has no designated list of empty-element, then any tag with no contents is an empty-element tag. This is usually the case for XML documents.
- next_element: _AtMostOneElement¶
- next_sibling: _AtMostOneElement¶
- parser_class: type[BeautifulSoup] | None¶
- prettify(encoding: None = None, formatter: _FormatterOrName = 'minimal') str ¶
- prettify(encoding: _Encoding, formatter: _FormatterOrName = 'minimal') bytes
Pretty-print this
Tag
as a string or bytestring.- Parameters:
encoding -- The encoding of the bytestring, or None if you want Unicode.
formatter -- A Formatter object, or a string naming one of the standard formatters.
- Returns:
A string (if no
encoding
is provided) or a bytestring (otherwise).
- previous_element: _AtMostOneElement¶
- previous_sibling: _AtMostOneElement¶
- replaceWithChildren() _OneElement ¶
: :meta private:
- replace_with_children() Self ¶
Replace this
PageElement
with its contents.- Returns:
This object, no longer part of the tree.
- select(selector: str, namespaces: Dict[str, str] | None = None, limit: int = 0, **kwargs: Any) ResultSet[Tag] ¶
Perform a CSS selection operation on the current element.
This uses the SoupSieve library.
- Parameters:
selector -- A string containing a CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
limit -- After finding this number of results, stop looking.
kwargs -- Keyword arguments to be passed into SoupSieve's soupsieve.select() method.
- select_one(selector: str, namespaces: Dict[str, str] | None = None, **kwargs: Any) Tag | None ¶
Perform a CSS selection operation on the current element.
- Parameters:
selector -- A CSS selector.
namespaces -- A dictionary mapping namespace prefixes used in the CSS selector to namespace URIs. By default, Beautiful Soup will use the prefixes it encountered while parsing the document.
kwargs -- Keyword arguments to be passed into Soup Sieve's soupsieve.select() method.
- property self_and_descendants: Iterator[PageElement]¶
Iterate over this
Tag
and its children in a breadth-first sequence.
- smooth() None ¶
Smooth out the children of this
Tag
by consolidating consecutive strings.If you perform a lot of operations that modify the tree, calling this method afterwards can make pretty-printed output look more natural.
- property string: str | None¶
Convenience property to get the single string within this
Tag
, assuming there is just one.- Returns:
If this
Tag
has a single child that's aNavigableString
, the return value is that string. If this element has one childTag
, the return value is that child'sTag.string
, recursively. If thisTag
has no children, or has more than one child, the return value isNone
.If this property is unexpectedly returning
None
for you, it's probably because yourTag
has more than one thing inside it.
- property strings: Iterator[str]¶
Yield all strings of certain classes, possibly stripping them.
- Parameters:
strip -- If True, all strings will be stripped before being yielded.
types -- A tuple of NavigableString subclasses. Any strings of a subclass not found in this list will be ignored. By default, the subclasses considered are the ones found in self.interesting_string_types. If that's not specified, only NavigableString and CData objects will be considered. That means no comments, processing instructions, etc.
- unwrap() Self ¶
Replace this
PageElement
with its contents.- Returns:
This object, no longer part of the tree.
- class bs4.element.TemplateString(value: str | bytes)¶
Bases:
NavigableString
A
NavigableString
representing a string found inside an HTML <template> tag embedded in a larger document.Used to distinguish such strings from the main body of the document.
- class bs4.element.XMLAttributeDict¶
Bases:
AttributeDict
A dictionary for holding a Tag's attributes, which processes incoming values for consistency with the HTML spec.
- class bs4.element.XMLProcessingInstruction(value: str | bytes)¶
Bases:
ProcessingInstruction
bs4.filter module¶
- class bs4.filter.AttributeValueMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)¶
Bases:
MatchRule
A MatchRule implementing the rules for matches against attribute value.
- class bs4.filter.ElementFilter(match_function: Callable[[PageElement], bool] | None = None)¶
Bases:
object
ElementFilter
encapsulates the logic necessary to decide:1. whether a
PageElement
(aTag
or aNavigableString
) matches a user-specified query.2. whether a given sequence of markup found during initial parsing should be turned into a
PageElement
at all, or simply discarded.The base class is the simplest
ElementFilter
. By default, it matches everything and allows all markup to becomePageElement
objects. You can make it more selective by passing in a user-defined match function, or defining a subclass.Most users of Beautiful Soup will never need to use
ElementFilter
, or its more capable subclassSoupStrainer
. Instead, they will use methods likeTag.find()
, which will convert their arguments intoSoupStrainer
objects and run them against the tree.However, if you find yourself wanting to treat the arguments to Beautiful Soup's find_*() methods as first-class objects, those objects will be
SoupStrainer
objects. You can create them yourself and then make use of functions likeElementFilter.filter()
.- allow_string_creation(string: str) bool ¶
Based on the content of a string, see whether this
ElementFilter
will allow aNavigableString
object based on this string to be added to the parse tree.By default, all strings are processed into
NavigableString
objects. To change this, subclassElementFilter
.- Parameters:
str -- The string under consideration.
- allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool ¶
Based on the name and attributes of a tag, see whether this
ElementFilter
will allow aTag
object to even be created.By default, all tags are parsed. To change this, subclass
ElementFilter
.- Parameters:
name -- The name of the prospective tag.
attrs -- The attributes of the prospective tag.
- property excludes_everything: bool¶
Does this
ElementFilter
obviously exclude everything? If so, Beautiful Soup will issue a warning if you try to use it when parsing a document.The
ElementFilter
might turn out to exclude everything even if this returnsFalse
, but it won't exclude everything in an obvious way.The base
ElementFilter
implementation excludes things based on a match function we can't inspect, so excludes_everything is always false.
- filter(generator: Iterator[PageElement]) Iterator[PageElement | Tag | NavigableString] ¶
The most generic search method offered by Beautiful Soup.
Acts like Python's built-in
filter
, usingElementFilter.match
as the filtering function.
- find(generator: Iterator[PageElement]) PageElement | Tag | NavigableString | None ¶
A lower-level equivalent of
Tag.find()
.You can pass in your own generator for iterating over
PageElement
objects. The first one that matches thisElementFilter
will be returned.- Parameters:
generator -- A way of iterating over
PageElement
objects.
- find_all(generator: Iterator[PageElement], limit: int | None = None) ResultSet[PageElement | Tag | NavigableString] ¶
A lower-level equivalent of
Tag.find_all()
.You can pass in your own generator for iterating over
PageElement
objects. Only elements that match thisElementFilter
will be returned in theResultSet
.- Parameters:
generator -- A way of iterating over
PageElement
objects.limit -- Stop looking after finding this many results.
- property includes_everything: bool¶
Does this
ElementFilter
obviously include everything? If so, the filter process can be made much faster.The
ElementFilter
might turn out to include everything even if this returnsFalse
, but it won't include everything in an obvious way.The base
ElementFilter
implementation includes things based on the match function, so includes_everything is only true if there is no match function.
- match(element: PageElement, _known_rules: bool = False) bool ¶
Does the given PageElement match the rules set down by this ElementFilter?
The base implementation delegates to the function passed in to the constructor.
- Parameters:
_known_rules -- Defined for compatibility with SoupStrainer._match(). Used more for consistency than because we need the performance optimization.
- match_function: Callable[[PageElement], bool] | None¶
- class bs4.filter.MatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)¶
Bases:
object
Each MatchRule encapsulates the logic behind a single argument passed in to one of the Beautiful Soup find* methods.
- pattern: _RegularExpressionProtocol | None¶
- class bs4.filter.SoupStrainer(name: str | bytes | Pattern[str] | bool | Callable[[Tag], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[Tag], bool]] | None = None, attrs: Dict[str, str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]]] | None = None, string: str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]] | None = None, **kwargs: str | bytes | Pattern[str] | bool | Callable[[str | None], bool] | Iterable[str | bytes | Pattern[str] | bool | Callable[[str | None], bool]])¶
Bases:
ElementFilter
The
ElementFilter
subclass used internally by Beautiful Soup.A
SoupStrainer
encapsulates the logic necessary to perform the kind of matches supported by methods such asTag.find()
.SoupStrainer
objects are primarily created internally, but you can create one yourself and pass it in asparse_only
to theBeautifulSoup
constructor, to parse a subset of a large document.Internally,
SoupStrainer
objects work by converting the constructor arguments intoMatchRule
objects. Incoming tags/markup are matched against those rules.- Parameters:
name -- One or more restrictions on the tags found in a document.
attrs -- A dictionary that maps attribute names to restrictions on tags that use those attributes.
string -- One or more restrictions on the strings found in a document.
kwargs -- A dictionary that maps attribute names to restrictions on tags that use those attributes. These restrictions are additive to any specified in
attrs
.
- allow_string_creation(string: str) bool ¶
Based on the content of a markup string, see whether this
SoupStrainer
will allow it to be instantiated as aNavigableString
object, or whether it should be ignored.
- allow_tag_creation(nsprefix: str | None, name: str, attrs: _RawAttributeValues | None) bool ¶
Based on the name and attributes of a tag, see whether this
SoupStrainer
will allow aTag
object to even be created.- Parameters:
name -- The name of the prospective tag.
attrs -- The attributes of the prospective tag.
- attribute_rules: Dict[str, List[AttributeValueMatchRule]]¶
- property excludes_everything: bool¶
Check whether the provided rules will obviously exclude everything. (They might exclude everything even if this returns
False
, but not in an obvious way.)
- property includes_everything: bool¶
Check whether the provided rules will obviously include everything. (They might include everything even if this returns
False
, but not in an obvious way.)
- match(element: PageElement, _known_rules: bool = False) bool ¶
Does the given
PageElement
match the rules set down by thisSoupStrainer
?The find_* methods rely heavily on this method to find matches.
- Parameters:
element -- A
PageElement
._known_rules -- Set to true in the common case where we already checked and found at least one rule in this SoupStrainer that might exclude a PageElement. Without this, we need to check .includes_everything every time, just to be safe.
- Returns:
True
if the element matches thisSoupStrainer
's rules;False
otherwise.
- matches_any_string_rule(string: str) bool ¶
See whether the content of a string matches any of this
SoupStrainer
's string rules.
- matches_tag(tag: Tag) bool ¶
Do the rules of this
SoupStrainer
trigger a match against the givenTag
?If the
SoupStrainer
has anyTagNameMatchRule
, at least one must match theTag
or itsTag.name
.If there are any
AttributeValueMatchRule
for a given attribute, at least one of them must match the attribute value.If there are any
StringMatchRule
, at least one must match, but aSoupStrainer
that only containsStringMatchRule
cannot match aTag
, only aNavigableString
.
- name_rules: List[TagNameMatchRule]¶
- search_tag(name: str, attrs: _RawAttributeValues | None) bool ¶
A less elegant version of
allow_tag_creation
. Deprecated as of 4.13.0
- string_rules: List[StringMatchRule]¶
- class bs4.filter.StringMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)¶
Bases:
MatchRule
A MatchRule implementing the rules for matches against a NavigableString.
- class bs4.filter.TagNameMatchRule(string: str | bytes | None = None, pattern: _RegularExpressionProtocol | None = None, function: Callable | None = None, present: bool | None = None, exclude_everything: bool | None = None)¶
Bases:
MatchRule
A MatchRule implementing the rules for matches against tag name.
bs4.formatter module¶
- class bs4.formatter.Formatter(language: str | None = None, entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)¶
Bases:
EntitySubstitution
Describes a strategy to use when outputting a parse tree to a string.
Some parts of this strategy come from the distinction between HTML4, HTML5, and XML. Others are configurable by the user.
Formatters are passed in as the
formatter
argument to methods likebs4.element.Tag.encode
. Most people won't need to think about formatters, and most people who need to think about them can pass in one of these predefined strings asformatter
rather than making a new Formatter object:- For HTML documents:
'html' - HTML entity substitution for generic HTML documents. (default)
- 'html5' - HTML entity substitution for HTML5 documents, as
well as some optimizations in the way tags are rendered.
- 'html5-4.12.0' - The version of the 'html5' formatter used prior to
Beautiful Soup 4.13.0.
- 'minimal' - Only make the substitutions necessary to guarantee
valid HTML.
- None - Do not perform any substitution. This will be faster
but may result in invalid markup.
- For XML documents:
'html' - Entity substitution for XHTML documents.
- 'minimal' - Only make the substitutions necessary to guarantee
valid XML. (default)
- None - Do not perform any substitution. This will be faster
but may result in invalid markup.
- HTML_DEFAULTS: Dict[str, Set[str]] = {'cdata_containing_tags': {'script', 'style'}}¶
Default values for the various constructor options when the markup language is HTML.
- attribute_value(value: str) str ¶
Process the value of an attribute.
- Parameters:
ns -- A string.
- Returns:
A string with certain characters replaced by named or numeric entities.
- attributes(tag: bs4.element.Tag) Iterable[Tuple[str, _AttributeValue | None]] ¶
Reorder a tag's attributes however you want.
By default, attributes are sorted alphabetically. This makes behavior consistent between Python 2 and Python 3, and preserves backwards compatibility with older versions of Beautiful Soup.
If
empty_attributes_are_booleans
is True, then attributes whose values are set to the empty string will be treated as boolean attributes.
- class bs4.formatter.HTMLFormatter(entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)¶
Bases:
Formatter
A generic Formatter for HTML.
- REGISTRY: Dict[str | None, HTMLFormatter] = {'html': <bs4.formatter.HTMLFormatter object>, 'html5': <bs4.formatter.HTMLFormatter object>, 'html5-4.12': <bs4.formatter.HTMLFormatter object>, 'minimal': <bs4.formatter.HTMLFormatter object>, None: <bs4.formatter.HTMLFormatter object>}¶
- class bs4.formatter.XMLFormatter(entity_substitution: Callable[[str], str] | None = None, void_element_close_prefix: str = '/', cdata_containing_tags: Set[str] | None = None, empty_attributes_are_booleans: bool = False, indent: int | str = 1)¶
Bases:
Formatter
A generic Formatter for XML.
- REGISTRY: Dict[str | None, XMLFormatter] = {'html': <bs4.formatter.XMLFormatter object>, 'minimal': <bs4.formatter.XMLFormatter object>, None: <bs4.formatter.XMLFormatter object>}¶
bs4._typing module¶
- bs4._typing._AttributeValues¶
A dictionary of names to
_AttributeValue
objects. This is what a tag's attributes look like after processing.
- bs4._typing._BaseStrainable¶
Either a tag name, an attribute value or a string can be matched against a string, bytestring, regular expression, or a boolean.
- bs4._typing._BaseStrainableAttribute¶
A tag's attribute value can be matched either with the
_BaseStrainable
options, or using a function that takes that value as its sole argument.alias of
Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
]]
- bs4._typing._BaseStrainableElement¶
A tag can be matched either with the
_BaseStrainable
options, or using a function that takes theTag
as its sole argument.alias of
Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Tag
],bool
]]
- bs4._typing._Encoding¶
A data encoding.
- bs4._typing._IncomingMarkup¶
The rawest form of markup: either a string, bytestring, or an open filehandle.
- bs4._typing._InsertableElement¶
A number of tree manipulation methods can take either a
PageElement
or a normal Python string (which will be converted to aNavigableString
).
- bs4._typing._InvertedNamespaceMapping¶
A mapping of namespace URLs to prefixes
- bs4._typing._NamespacePrefix¶
The prefix for an XML namespace.
- bs4._typing._NamespaceURL¶
The URL of an XML namespace
- bs4._typing._NullableStringMatchFunction¶
A function that takes a string (or None) and returns a yes-or-no answer. An
AttributeValueMatchRule
expects this kind of function, if you're going to pass it a function.
- bs4._typing._OneElement¶
Many Beautiful soup methods return a PageElement or an ResultSet of PageElements. A PageElement is either a Tag or a NavigableString. These convenience aliases make it easier for IDE users to see which methods are available on the objects they're dealing with.
alias of
Union
[PageElement
,Tag
,NavigableString
]
- bs4._typing._PageElementMatchFunction¶
A function that takes a PageElement and returns a yes-or-no answer.
alias of
Callable
[[PageElement
],bool
]
- bs4._typing._RawAttributeValue¶
The value associated with an HTML or XML attribute. This is the relatively unprocessed value Beautiful Soup expects to come from a
TreeBuilder
.
- bs4._typing._RawAttributeValues: TypeAlias = 'Mapping[Union[str, NamespacedAttribute], _RawAttributeValue]'¶
A dictionary of names to
_RawAttributeValue
objects. This is how Beautiful Soup expects aTreeBuilder
to represent a tag's attribute values.
- bs4._typing._RawMarkup¶
Markup that is in memory but has (potentially) yet to be converted to Unicode.
- bs4._typing._RawOrProcessedAttributeValues¶
The methods that deal with turning
_RawAttributeValue
into_AttributeValue
may be called several times, even after the values are already processed (e.g. when cloning a tag), so they need to be able to acommodate both possibilities.alias of
Union
[Mapping[Union[str, NamespacedAttribute], _RawAttributeValue]
,Dict
[str
,Union
[str
,AttributeValueList
]]]
- class bs4._typing._RegularExpressionProtocol(*args, **kwargs)¶
Bases:
Protocol
A protocol object which can accept either Python's built-in
re.Pattern
objects, or the similarRegex
objects defined by the third-partyregex
package.- _abc_impl = <_abc._abc_data object>¶
- _is_protocol = True¶
- _is_runtime_protocol = True¶
- bs4._typing._StrainableAttribute¶
An attribute value can be matched using either a single criterion or a list of criteria.
alias of
Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
],Iterable
[Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
]]]]
- bs4._typing._StrainableAttributes¶
A dictionary may be used to match against multiple attribute vlaues at once.
alias of
Dict
[str
,Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
],Iterable
[Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
]]]]]
- bs4._typing._StrainableElement¶
A tag can be matched using either a single criterion or a list of criteria.
alias of
Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Tag
],bool
],Iterable
[Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Tag
],bool
]]]]
- bs4._typing._StrainableString¶
An string can be matched using the same techniques as an attribute value.
alias of
Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
],Iterable
[Union
[str
,bytes
,Pattern
[str
],bool
,Callable
[[Optional
[str
]],bool
]]]]
- bs4._typing._TagMatchFunction¶
A function that takes a
Tag
and returns a yes-or-no answer. ATagNameMatchRule
expects this kind of function, if you're going to pass it a function.
bs4.diagnose module¶
Diagnostic functions, mainly for use when doing tech support.
- class bs4.diagnose.AnnouncingParser(*, convert_charrefs=True)¶
Bases:
HTMLParser
Subclass of HTMLParser that announces parse events, without doing anything else.
You can use this to get a picture of how html.parser sees a given document. The easiest way to do this is to call
htmlparser_trace
.
- bs4.diagnose.benchmark_parsers(num_elements: int = 100000) None ¶
Very basic head-to-head performance benchmark.
- bs4.diagnose.diagnose(data: _IncomingMarkup) None ¶
Diagnostic suite for isolating common problems.
- Parameters:
data -- Some markup that needs to be explained.
- Returns:
None; diagnostics are printed to standard output.
- bs4.diagnose.htmlparser_trace(data: str) None ¶
Print out the HTMLParser events that occur during parsing.
This lets you see how HTMLParser parses a document when no Beautiful Soup code is running.
- Parameters:
data -- Some markup.
- bs4.diagnose.lxml_trace(data: _IncomingMarkup, html: bool = True, **kwargs: Any) None ¶
Print out the lxml events that occur during parsing.
This lets you see how lxml parses a document when no Beautiful Soup code is running. You can use this to determine whether an lxml-specific problem is in Beautiful Soup's lxml tree builders or in lxml itself.
- Parameters:
data -- Some markup.
html -- If True, markup will be parsed with lxml's HTML parser. if False, lxml's XML parser will be used.