bs4.builder package

Module contents

class bs4.builder.DetectsXMLParsedAsHTML

Bases: object

A mixin class for any class (a TreeBuilder, or some class used by a TreeBuilder) that's in a position to detect whether an XML document is being incorrectly parsed as HTML, and issue an appropriate warning.

This requires being able to observe an incoming processing instruction that might be an XML declaration, and also able to observe tags as they're opened. If you can't do that for a given TreeBuilder, there's a less reliable implementation based on examining the raw markup.

LOOKS_LIKE_HTML: Pattern[str] = re.compile('<[^ +]html', re.IGNORECASE)

Regular expression for seeing if string markup has an <html> tag.

LOOKS_LIKE_HTML_B: Pattern[bytes] = re.compile(b'<[^ +]html', re.IGNORECASE)

Regular expression for seeing if byte markup has an <html> tag.

XML_PREFIX: str = '<?xml'

The start of an XML document string.

XML_PREFIX_B: bytes = b'<?xml'

The start of an XML document bytestring.

classmethod warn_if_markup_looks_like_xml(markup: _RawMarkup | None, stacklevel: int = 3) bool

Perform a check on some markup to see if it looks like XML that's not XHTML. If so, issue a warning.

This is much less reliable than doing the check while parsing, but some of the tree builders can't do that.

Parameters:

stacklevel -- The stacklevel of the code calling this function.

Returns:

True if the markup looks like non-XHTML XML, False otherwise.

class bs4.builder.HTML5TreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)

Bases: HTMLTreeBuilder

Use html5lib to build a tree.

Note that HTML5TreeBuilder does not support some common HTML TreeBuilder features. Some of these features could theoretically be implemented, but at the very least it's quite difficult, because html5lib moves the parse tree around as it's being built.

Specifically:

NAME: str = 'html5lib'
TRACKS_LINE_NUMBERS: bool = True

html5lib can tell us which line number and position in the original file is the source of an element.

features: Iterable[str] = ['html5lib', 'permissive', 'html5', 'html']
feed(markup: str | bytes) None

Run some incoming markup through some parsing process, populating the BeautifulSoup object in HTML5TreeBuilder.soup.

test_fragment_to_document(fragment: str) str

See TreeBuilder.

user_specified_encoding: str | None
class bs4.builder.HTMLParserTreeBuilder(parser_args: Iterable[Any] | None = None, parser_kwargs: Dict[str, Any] | None = None, **kwargs: Any)

Bases: HTMLTreeBuilder

A Beautiful soup bs4.builder.TreeBuilder that uses the html.parser.HTMLParser parser, found in the Python standard library.

NAME: str = 'html.parser'
TRACKS_LINE_NUMBERS: bool = True

The html.parser knows which line number and position in the original file is the source of an element.

features: Iterable[str] = ['html.parser', 'html', 'strict']
feed(markup: _RawMarkup) None

Run incoming markup through some parsing process.

is_xml: bool = False
parser_args: Tuple[Iterable[Any], Dict[str, Any]]
picklable: bool = True
prepare_markup(markup: _RawMarkup, user_specified_encoding: _Encoding | None = None, document_declared_encoding: _Encoding | None = None, exclude_encodings: _Encodings | None = None) Iterable[Tuple[str, _Encoding | None, _Encoding | None, bool]]

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

Parameters:
  • markup -- Some markup -- probably a bytestring.

  • user_specified_encoding -- The user asked to try this encoding.

  • document_declared_encoding -- The markup itself claims to be in this encoding.

  • exclude_encodings -- The user asked _not_ to try any of these encodings.

Yield:
A series of 4-tuples: (markup, encoding, declared encoding,

has undergone character replacement)

Each 4-tuple represents a strategy for parsing the document. This TreeBuilder uses Unicode, Dammit to convert the markup into Unicode, so the markup element of the tuple will always be a string.

class bs4.builder.HTMLTreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)

Bases: TreeBuilder

This TreeBuilder knows facts about HTML, such as which tags are treated specially by the HTML standard.

DEFAULT_BLOCK_ELEMENTS: Set[str] = {'address', 'article', 'aside', 'blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'output', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video'}

The HTML standard defines these tags as block-level elements. Beautiful Soup does not treat these elements differently from other elements, but it may do so eventually, and this information is available if you need to use it.

DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {'*': {'accesskey', 'class', 'dropzone'}, 'a': {'rel', 'rev'}, 'area': {'rel'}, 'form': {'accept-charset'}, 'icon': {'sizes'}, 'iframe': {'sandbox'}, 'link': {'rel', 'rev'}, 'object': {'archive'}, 'output': {'for'}, 'td': {'headers'}, 'th': {'headers'}}

The HTML standard defines these attributes as containing a space-separated list of values, not a single value. That is, class="foo bar" means that the 'class' attribute has two values, 'foo' and 'bar', not the single value 'foo bar'. When we encounter one of these attributes, we will parse its value into a list of values if possible. Upon output, the list will be converted back into a string.

DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] | None = {'area', 'base', 'basefont', 'bgsound', 'br', 'col', 'command', 'embed', 'frame', 'hr', 'image', 'img', 'input', 'isindex', 'keygen', 'link', 'menuitem', 'meta', 'nextid', 'param', 'source', 'spacer', 'track', 'wbr'}

Some HTML tags are defined as having no contents. Beautiful Soup treats these specially.

DEFAULT_PRESERVE_WHITESPACE_TAGS: set[str] = {'pre', 'textarea'}

By default, whitespace inside these HTML tags will be preserved rather than being collapsed.

DEFAULT_STRING_CONTAINERS: Dict[str, Type[bs4.element.NavigableString]] = {'rp': <class 'bs4.element.RubyParenthesisString'>, 'rt': <class 'bs4.element.RubyTextString'>, 'script': <class 'bs4.element.Script'>, 'style': <class 'bs4.element.Stylesheet'>, 'template': <class 'bs4.element.TemplateString'>}

These HTML tags need special treatment so they can be represented by a string class other than bs4.element.NavigableString.

For some of these tags, it's because the HTML standard defines an unusual content model for them. I made this list by going through the HTML spec (https://html.spec.whatwg.org/#metadata-content) and looking for "metadata content" elements that can contain strings.

The Ruby tags (<rt> and <rp>) are here despite being normal "phrasing content" tags, because the content they contain is qualitatively different from other text in the document, and it can be useful to be able to distinguish it.

TODO: Arguably <noscript> could go here but it seems qualitatively different from the other tags.

class bs4.builder.LXMLTreeBuilder(parser: XMLParser | None = None, empty_element_tags: Set[str] | None = None, **kwargs: Any)

Bases: HTMLTreeBuilder, LXMLTreeBuilderForXML

ALTERNATE_NAMES: Iterable[str] = ['lxml-html']
NAME: str = 'lxml'
default_parser(encoding: _Encoding | None) _ParserOrParserClass

Find the default parser for the given encoding.

Returns:

Either a parser object or a class, which will be instantiated with default arguments.

features: Iterable[str] = ['lxml-html', 'lxml', 'html', 'fast', 'permissive']
feed(markup: _RawMarkup) None

Run incoming markup through some parsing process.

is_xml: bool = False
test_fragment_to_document(fragment: str) str

See TreeBuilder.

class bs4.builder.LXMLTreeBuilderForXML(parser: XMLParser | None = None, empty_element_tags: Set[str] | None = None, **kwargs: Any)

Bases: TreeBuilder

ALTERNATE_NAMES: Iterable[str] = ['xml']
CHUNK_SIZE: int = 512
DEFAULT_NSMAPS: _NamespaceMapping = {'xml': 'http://www.w3.org/XML/1998/namespace'}
DEFAULT_NSMAPS_INVERTED: _InvertedNamespaceMapping = {'http://www.w3.org/XML/1998/namespace': 'xml'}
DEFAULT_PARSER_CLASS

alias of XMLParser

NAME: str = 'lxml-xml'
close() None
comment(text: str | bytes) None

Handle comments as Comment objects.

data(data: str | bytes) None
default_parser(encoding: _Encoding | None) _ParserOrParserClass

Find the default parser for the given encoding.

Returns:

Either a parser object or a class, which will be instantiated with default arguments.

doctype(name: str, pubid: str, system: str) None
end(tag: str | bytes) None
features: Iterable[str] = ['lxml-xml', 'lxml', 'xml', 'fast', 'permissive']
feed(markup: _RawMarkup) None

Run incoming markup through some parsing process.

initialize_soup(soup: BeautifulSoup) None

Let the BeautifulSoup object know about the standard namespace mapping.

Parameters:

soup -- A BeautifulSoup.

is_xml: bool = True
nsmaps: List[_InvertedNamespaceMapping | None]
parser: Any
parser_for(encoding: _Encoding | None) _LXMLParser

Instantiate an appropriate parser for the given encoding.

Parameters:

encoding -- A string.

Returns:

A parser object such as an etree.XMLParser.

pi(target: str, data: str) None
prepare_markup(markup: _RawMarkup, user_specified_encoding: _Encoding | None = None, document_declared_encoding: _Encoding | None = None, exclude_encodings: _Encodings | None = None) Iterable[Tuple[str | bytes, _Encoding | None, _Encoding | None, bool]]

Run any preliminary steps necessary to make incoming markup acceptable to the parser.

lxml really wants to get a bytestring and convert it to Unicode itself. So instead of using UnicodeDammit to convert the bytestring to Unicode using different encodings, this implementation uses EncodingDetector to iterate over the encodings, and tell lxml to try to parse the document as each one in turn.

Parameters:
  • markup -- Some markup -- hopefully a bytestring.

  • user_specified_encoding -- The user asked to try this encoding.

  • document_declared_encoding -- The markup itself claims to be in this encoding.

  • exclude_encodings -- The user asked _not_ to try any of these encodings.

Yield:

A series of 4-tuples: (markup, encoding, declared encoding, has undergone character replacement)

Each 4-tuple represents a strategy for converting the document to Unicode and parsing it. Each strategy will be tried in turn.

processing_instruction_class: Type[ProcessingInstruction]
start(tag: str | bytes, attrib: Dict[str | bytes, str | bytes], nsmap: _NamespaceMapping = {}) None
test_fragment_to_document(fragment: str) str

See TreeBuilder.

exception bs4.builder.ParserRejectedMarkup(message_or_exception: str | Exception)

Bases: Exception

An Exception to be raised when the underlying parser simply refuses to parse the given markup.

class bs4.builder.TreeBuilder(multi_valued_attributes: Dict[str, Set[str]] = <object object>, preserve_whitespace_tags: Set[str] = <object object>, store_line_numbers: bool = <object object>, string_containers: Dict[str, Type[NavigableString]] = <object object>, empty_element_tags: Set[str] = <object object>, attribute_dict_class: Type[AttributeDict] = <class 'bs4.element.AttributeDict'>, attribute_value_list_class: Type[AttributeValueList] = <class 'bs4.element.AttributeValueList'>)

Bases: object

Turn a textual document into a Beautiful Soup object tree.

This is an abstract superclass which smooths out the behavior of different parser libraries into a single, unified interface.

Parameters:
  • multi_valued_attributes --

    If this is set to None, the TreeBuilder will not turn any values for attributes like 'class' into lists. Setting this to a dictionary will customize this behavior; look at bs4.builder.HTMLTreeBuilder.DEFAULT_CDATA_LIST_ATTRIBUTES for an example.

    Internally, these are called "CDATA list attributes", but that probably doesn't make sense to an end-user, so the argument name is multi_valued_attributes.

  • preserve_whitespace_tags -- A set of tags to treat the way <pre> tags are treated in HTML. Tags in this set are immune from pretty-printing; their contents will always be output as-is.

  • string_containers -- A dictionary mapping tag names to the classes that should be instantiated to contain the textual contents of those tags. The default is to use NavigableString for every tag, no matter what the name. You can override the default by changing DEFAULT_STRING_CONTAINERS.

  • store_line_numbers -- If the parser keeps track of the line numbers and positions of the original markup, that information will, by default, be stored in each corresponding bs4.element.Tag object. You can turn this off by passing store_line_numbers=False; then Tag.sourcepos and Tag.sourceline will always be None. If the parser you're using doesn't keep track of this information, then store_line_numbers is irrelevant.

  • attribute_dict_class -- The value of a multi-valued attribute (such as HTML's 'class') willl be stored in an instance of this class. The default is Beautiful Soup's built-in AttributeValueList, which is a normal Python list, and you will probably never need to change it.

ALTERNATE_NAMES: Iterable[str] = []
DEFAULT_CDATA_LIST_ATTRIBUTES: Dict[str, Set[str]] = {}

A value for these tag/attribute combinations is a space- or comma-separated list of CDATA, rather than a single CDATA.

DEFAULT_EMPTY_ELEMENT_TAGS: Set[str] | None = None

By default, tags are treated as empty-element tags if they have no contents--that is, using XML rules. HTMLTreeBuilder defines a different set of DEFAULT_EMPTY_ELEMENT_TAGS based on the HTML 4 and HTML5 standards.

DEFAULT_PRESERVE_WHITESPACE_TAGS: Set[str] = {}

Whitespace should be preserved inside these tags.

DEFAULT_STRING_CONTAINERS: Dict[str, Type[bs4.element.NavigableString]] = {}

The textual contents of tags with these names should be instantiated with some class other than bs4.element.NavigableString.

NAME: str = '[Unknown tree builder]'
TRACKS_LINE_NUMBERS: bool = False

Most parsers don't keep track of line numbers.

can_be_empty_element(tag_name: str) bool

Might a tag with this name be an empty-element tag?

The final markup may or may not actually present this tag as self-closing.

For instance: an HTMLBuilder does not consider a <p> tag to be an empty-element tag (it's not in HTMLBuilder.empty_element_tags). This means an empty <p> tag will be presented as "<p></p>", not "<p/>" or "<p>".

The default implementation has no opinion about which tags are empty-element tags, so a tag will be presented as an empty-element tag if and only if it has no children. "<foo></foo>" will become "<foo/>", and "<foo>bar</foo>" will be left alone.

Parameters:

tag_name -- The name of a markup tag.

features: Iterable[str] = []
feed(markup: _RawMarkup) None

Run incoming markup through some parsing process.

initialize_soup(soup: BeautifulSoup) None

The BeautifulSoup object has been initialized and is now being associated with the TreeBuilder.

Parameters:

soup -- A BeautifulSoup object.

is_xml: bool = False
picklable: bool = False
reset() None

Do any work necessary to reset the underlying parser for a new document.

By default, this does nothing.

class bs4.builder.TreeBuilderRegistry

Bases: object

A way of looking up TreeBuilder subclasses by their name or by desired features.

builders: List[Type[TreeBuilder]]
builders_for_feature: Dict[str, List[Type[TreeBuilder]]]
lookup(*features: str) Type[TreeBuilder] | None

Look up a TreeBuilder subclass with the desired features.

Parameters:

features -- A list of features to look for. If none are provided, the most recently registered TreeBuilder subclass will be used.

Returns:

A TreeBuilder subclass, or None if there's no registered subclass with all the requested features.

register(treebuilder_class: type[TreeBuilder]) None

Register a treebuilder based on its advertised features.

Parameters:

treebuilder_class -- A subclass of TreeBuilder. its TreeBuilder.features attribute should list its features.