bs4.builder package¶

Module contents¶

class bs4.builder.DetectsXMLParsedAsHTML¶

Bases: object

A mixin class for any class (a TreeBuilder, or some class used by a TreeBuilder) that's in a position to detect whether an XML document is being incorrectly parsed as HTML, and issue an appropriate warning.

This requires being able to observe an incoming processing instruction that might be an XML declaration, and also able to observe tags as they're opened. If you can't do that for a given TreeBuilder, there's a less reliable implementation based on examining the raw markup.

LOOKS_LIKE_HTML: Pattern[str] = re.compile('<[^ +]html', re.IGNORECASE)¶: Regular expression for seeing if string markup has an <html> tag.

LOOKS_LIKE_HTML_B: Pattern[bytes] = re.compile(b'<[^ +]html', re.IGNORECASE)¶: Regular expression for seeing if byte markup has an <html> tag.

XML_PREFIX: str = '<?xml'¶: The start of an XML document string.

XML_PREFIX_B: bytes = b'<?xml'¶: The start of an XML document bytestring.

classmethod warn_if_markup_looks_like_xml(markup: _RawMarkup | None, stacklevel: int = 3) → bool¶

Perform a check on some markup to see if it looks like XML that's not XHTML. If so, issue a warning.

This is much less reliable than doing the check while parsing, but some of the tree builders can't do that.

Parameters:: stacklevel -- The stacklevel of the code calling this function.
Returns:: True if the markup looks like non-XHTML XML, False otherwise.

class bs4.builder.HTML5TreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)¶

Bases: HTMLTreeBuilder

Use html5lib to build a tree.

Note that HTML5TreeBuilder does not support some common HTML TreeBuilder features. Some of these features could theoretically be implemented, but at the very least it's quite difficult, because html5lib moves the parse tree around as it's being built.

Specifically:

This TreeBuilder doesn't use different subclasses of NavigableString (e.g. Script) based on the name of the tag in which the string was found.
You can't use a SoupStrainer to parse only part of a document.

NAME: str = 'html5lib'¶

TRACKS_LINE_NUMBERS: bool = True¶: html5lib can tell us which line number and position in the original file is the source of an element.

features: Iterable[str] = ['html5lib', 'permissive', 'html5', 'html']¶

feed(markup: str | bytes) → None¶: Run some incoming markup through some parsing process, populating the BeautifulSoup object in HTML5TreeBuilder.soup.

test_fragment_to_document(fragment: str) → str¶: See TreeBuilder.

user_specified_encoding: str | None¶

class bs4.builder.HTMLParserTreeBuilder(parser_args: Iterable[Any] | None = None, parser_kwargs: Dict[str, Any] | None = None, **kwargs: Any)¶

Bases: HTMLTreeBuilder

A Beautiful soup bs4.builder.TreeBuilder that uses the html.parser.HTMLParser parser, found in the Python standard library.

NAME: str = 'html.parser'¶

TRACKS_LINE_NUMBERS: bool = True¶: The html.parser knows which line number and position in the original file is the source of an element.

features: Iterable[str] = ['html.parser', 'html', 'strict']¶

feed(markup: _RawMarkup, _parser_class: type[BeautifulSoupHTMLParser] = <class 'bs4.builder._htmlparser.BeautifulSoupHTMLParser'>) → None¶

Parameters:

markup -- The markup to feed into the parser.
_parser_class -- An HTMLParser subclass to use. This is only intended for use in unit tests.

is_xml: bool = False¶

parser_args: Tuple[Iterable[Any], Dict[str, Any]]¶

picklable: bool = True¶

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

Parameters:

markup -- Some markup -- probably a bytestring.
user_specified_encoding -- The user asked to try this encoding.
document_declared_encoding -- The markup itself claims to be in this encoding.
exclude_encodings -- The user asked _not_ to try any of these encodings.

Yield:

A series of 4-tuples: (markup, encoding, declared encoding,: has undergone character replacement)

Each 4-tuple represents a strategy for parsing the document. This TreeBuilder uses Unicode, Dammit to convert the markup into Unicode, so the markup element of the tuple will always be a string.

class bs4.builder.HTMLTreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)¶

Bases: TreeBuilder

This TreeBuilder knows facts about HTML, such as which tags are treated specially by the HTML standard.

DEFAULT_BLOCK_ELEMENTS: Set[str] = {'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'output', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video'}¶: The HTML standard defines these tags as block-level elements. Beautiful Soup does not treat these elements differently from other elements, but it may do so eventually, and this information is available if you need to use it.

DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {'*': {'accesskey', 'class', 'dropzone'}, 'a': {'rel', 'rev'}, 'area': {'rel'}, 'form': {'accept-charset'}, 'icon': {'sizes'}, 'iframe': {'sandbox'}, 'link': {'rel', 'rev'}, 'object': {'archive'}, 'output': {'for'}, 'td': {'headers'}, 'th': {'headers'}}¶: The HTML standard defines these attributes as containing a space-separated list of values, not a single value. That is, class="foo bar" means that the 'class' attribute has two values, 'foo' and 'bar', not the single value 'foo bar'. When we encounter one of these attributes, we will parse its value into a list of values if possible. Upon output, the list will be converted back into a string.

DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] | None = {'area', 'base', 'basefont', 'bgsound', 'br', 'col', 'command', 'embed', 'frame', 'hr', 'image', 'img', 'input', 'isindex', 'keygen', 'link', 'menuitem', 'meta', 'nextid', 'param', 'source', 'spacer', 'track', 'wbr'}¶: Some HTML tags are defined as having no contents. Beautiful Soup treats these specially.

DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = {'pre', 'textarea'}¶: By default, whitespace inside these HTML tags will be preserved rather than being collapsed.

DEFAULT_STRING_CONTAINERS: Dict[str, Type[bs4.element.NavigableString]] = {'rp': <class 'bs4.element.RubyParenthesisString'>, 'rt': <class 'bs4.element.RubyTextString'>, 'script': <class 'bs4.element.Script'>, 'style': <class 'bs4.element.Stylesheet'>, 'template': <class 'bs4.element.TemplateString'>}¶

These HTML tags need special treatment so they can be represented by a string class other than bs4.element.NavigableString.

For some of these tags, it's because the HTML standard defines an unusual content model for them. I made this list by going through the HTML spec (https://html.spec.whatwg.org/#metadata-content) and looking for "metadata content" elements that can contain strings.

The Ruby tags (<rt> and <rp>) are here despite being normal "phrasing content" tags, because the content they contain is qualitatively different from other text in the document, and it can be useful to be able to distinguish it.

TODO: Arguably <noscript> could go here but it seems qualitatively different from the other tags.

class bs4.builder.LXMLTreeBuilder(parser: XMLParser | None = None, empty_element_tags: Set[str] | None = None, huge_tree: bool = False, **kwargs: Any)¶

Bases: HTMLTreeBuilder, LXMLTreeBuilderForXML

ALTERNATE_NAMES: Iterable[str] = ['lxml-html']¶

NAME: str = 'lxml'¶

default_parser(encoding: _Encoding | None) → _ParserOrParserClass¶

Find the default parser for the given encoding.

Returns:: Either a parser object or a class, which will be instantiated with default arguments.

features: Iterable[str] = ['lxml-html', 'lxml', 'html', 'fast', 'permissive']¶

feed(markup: _RawMarkup) → None¶: Run incoming markup through some parsing process.

is_xml: bool = False¶

test_fragment_to_document(fragment: str) → str¶: See TreeBuilder.

class bs4.builder.LXMLTreeBuilderForXML(parser: XMLParser | None = None, empty_element_tags: Set[str] | None = None, huge_tree: bool = False, **kwargs: Any)¶

Bases: TreeBuilder

ALTERNATE_NAMES: Iterable[str] = ['xml']¶

CHUNK_SIZE: int = 512¶

DEFAULT_NSMAPS: _NamespaceMapping = {'xml': 'http://www.w3.org/XML/1998/namespace'}¶

DEFAULT_NSMAPS_INVERTED: _InvertedNamespaceMapping = {'http://www.w3.org/XML/1998/namespace': 'xml'}¶

DEFAULT_PARSER_CLASS¶: alias of XMLParser

NAME: str = 'lxml-xml'¶

close() → None¶

comment(text: str | bytes) → None¶: Handle comments as Comment objects.

data(data: str | bytes) → None¶

default_parser(encoding: _Encoding | None) → _ParserOrParserClass¶

Find the default parser for the given encoding.

Returns:: Either a parser object or a class, which will be instantiated with default arguments.

doctype(name: str, pubid: str, system: str) → None¶

end(tag: str | bytes) → None¶

features: Iterable[str] = ['lxml-xml', 'lxml', 'xml', 'fast', 'permissive']¶

feed(markup: _RawMarkup) → None¶: Run incoming markup through some parsing process.

huge_tree: bool¶: Set this to true (probably by passing huge_tree=True into the : BeautifulSoup constructor) to enable the lxml feature "disable security restrictions and support very deep trees and very long text content".

initialize_soup(soup: BeautifulSoup) → None¶

Let the BeautifulSoup object know about the standard namespace mapping.

Parameters:: soup -- A BeautifulSoup.

is_xml: bool = True¶

nsmaps: List[_InvertedNamespaceMapping | None]¶

parser: Any¶

parser_for(encoding: _Encoding | None) → _LXMLParser¶

Instantiate an appropriate parser for the given encoding.

Parameters:: encoding -- A string.
Returns:: A parser object such as an etree.XMLParser.

pi(target: str, data: str) → None¶

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

lxml really wants to get a bytestring and convert it to Unicode itself. So instead of using UnicodeDammit to convert the bytestring to Unicode using different encodings, this implementation uses EncodingDetector to iterate over the encodings, and tell lxml to try to parse the document as each one in turn.

Parameters:

markup -- Some markup -- hopefully a bytestring.
user_specified_encoding -- The user asked to try this encoding.
document_declared_encoding -- The markup itself claims to be in this encoding.
exclude_encodings -- The user asked _not_ to try any of these encodings.

Yield:

A series of 4-tuples: (markup, encoding, declared encoding, has undergone character replacement)

Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn.

processing_instruction_class: Type[ProcessingInstruction]¶

start(tag: str | bytes, attrib: Dict[str | bytes, str | bytes], nsmap: _NamespaceMapping = {}) → None¶

test_fragment_to_document(fragment: str) → str¶: See TreeBuilder.

exception bs4.builder.ParserRejectedMarkup(message_or_exception: str | Exception)¶

Bases: Exception

An Exception to be raised when the underlying parser simply refuses to parse the given markup.

class bs4.builder.TreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)¶

Bases: object

Turn a textual document into a Beautiful Soup object tree.

This is an abstract superclass which smooths out the behavior of different parser libraries into a single, unified interface.

Parameters:

multi_valued_attributes --
If this is set to None, the TreeBuilder will not turn any values for attributes like 'class' into lists. Setting this to a dictionary will customize this behavior; look at bs4.builder.HTMLTreeBuilder.DEFAULT_CDATA_LIST_ATTRIBUTES for an example.

Internally, these are called "CDATA list attributes", but that probably doesn't make sense to an end-user, so the argument name is multi_valued_attributes.
preserve_whitespace_tags -- A set of tags to treat the way <pre> tags are treated in HTML. Tags in this set are immune from pretty-printing; their contents will always be output as-is.
string_containers -- A dictionary mapping tag names to the classes that should be instantiated to contain the textual contents of those tags. The default is to use NavigableString for every tag, no matter what the name. You can override the default by changing DEFAULT_STRING_CONTAINERS.
store_line_numbers -- If the parser keeps track of the line numbers and positions of the original markup, that information will, by default, be stored in each corresponding bs4.element.Tag object. You can turn this off by passing store_line_numbers=False; then Tag.sourcepos and Tag.sourceline will always be None. If the parser you're using doesn't keep track of this information, then store_line_numbers is irrelevant.
attribute_dict_class -- The value of a multi-valued attribute (such as HTML's 'class') willl be stored in an instance of this class. The default is Beautiful Soup's built-in AttributeValueList, which is a normal Python list, and you will probably never need to change it.

ALTERNATE_NAMES: Iterable[str] = []¶

DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {}¶: A value for these tag/attribute combinations is a space- or comma-separated list of CDATA, rather than a single CDATA.

DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] | None = None¶: By default, tags are treated as empty-element tags if they have no contents--that is, using XML rules. HTMLTreeBuilder defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the HTML 4 and HTML5 standards.

DEFAULT_PRESERVE_WHITESPACE_TAGS: Set[str] = {}¶: Whitespace should be preserved inside these tags.

DEFAULT_STRING_CONTAINERS: Dict[str, Type[bs4.element.NavigableString]] = {}¶: The textual contents of tags with these names should be instantiated with some class other than bs4.element.NavigableString.

NAME: str = '[Unknown tree builder]'¶

TRACKS_LINE_NUMBERS: bool = False¶: Most parsers don't keep track of line numbers.

can_be_empty_element(tag_name: str) → bool¶

Might a tag with this name be an empty-element tag?

The final markup may or may not actually present this tag as self-closing.

For instance: an HTMLBuilder does not consider a tag to be an empty-element tag (it's not in HTMLBuilder.empty_element_tags). This means an empty tag will be presented as "", not "" or "".

The default implementation has no opinion about which tags are empty-element tags, so a tag will be presented as an empty-element tag if and only if it has no children. "<foo></foo>" will become "<foo/>", and "<foo>bar</foo>" will be left alone.

Parameters:: tag_name -- The name of a markup tag.

features: Iterable[str] = []¶

feed(markup: _RawMarkup) → None¶: Run incoming markup through some parsing process.

initialize_soup(soup: BeautifulSoup) → None¶

The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder.

Parameters:: soup -- A BeautifulSoup object.

is_xml: bool = False¶

picklable: bool = False¶

reset() → None¶

Do any work necessary to reset the underlying parser for a new document.

By default, this does nothing.

class bs4.builder.TreeBuilderRegistry¶

Bases: object

A way of looking up TreeBuilder subclasses by their name or by desired features.

builders: List[Type[TreeBuilder]]¶

builders_for_feature: Dict[str, List[Type[TreeBuilder]]]¶

lookup(*features: str) → Type[TreeBuilder] | None¶

Look up a TreeBuilder subclass with the desired features.

Parameters:: features -- A list of features to look for. If none are provided, the most recently registered TreeBuilder subclass will be used.
Returns:: A TreeBuilder subclass, or None if there's no registered subclass with all the requested features.

register(treebuilder_class: type[TreeBuilder]) → None¶

Register a treebuilder based on its advertised features.

Parameters:: treebuilder_class -- A subclass of TreeBuilder. its TreeBuilder.features attribute should list its features.