<D <M <Y
Y> M> D>

[Comments] (1) : I felt listless today, so I did some Beautiful Soup work so that I wouldn't have wasted the day. I fixed the handling of CDATA sections and doctype declarations.

Fun fact: it doesn't seem to be legal to stick a CDATA section into an HTML document. (By this I mean something like "<![CDATA[foo]]>", not the contents of a <pre> tag.) My knowledge of weird HTML constructs like CDATA comes mostly from studying Python's SGMLParser, which handles CDATA sections just fine since it's an SGML parser. So I had BS3 just create objects for CDATA sections, even when they occurred in HTML documents. But the two parsers I'm using as my testbeds for BS4 basically ignore CDATA sections in HTML documents:

By default, lxml's elementtree implementation replaces CDATA sections with the actual character data, and has an option to leave the CDATA sections alone, but this only works for XML. When parsing HTML, CDATA sections are ignored altogether. The HTMLParser constructor has a "strip_cdata" argument, inherited from XMLParser, but setting it to False does nothing.

BS4 can't be used to parse XML yet (unless you want to parse it by HTML rules), but once I add that, I'll have the lxml elementtree builder preserve CDATA sections.

html5lib treats CDATA sections as broken comments, so "<![CDATA[foo]]>" becomes "<!-- [CDATA[foo]] -->" The latest version of html5lib will replace a CDATA section with the character data if the CDATA section happens within a <svg> or <math> tag (see test), but this is not in any released version.

That took me the morning to figure out, so I hope it saves someone some time. But that person would have to bear a suspicious resemblance to me.


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.