< That's A New One!
Next >

[Comments] (5) : Can someone point me to an HTML parser that turns an HTML document into a nested data structure like what I sketched out below? I'm sick of having to jump through hoops to collect the text of a link. I know there's something similar for XML because I heard about it at EuroPython. Stop me before I write my own! Must work or be makable to work in Python 1.5.2. Offer not valid in "Mirror, Mirror" universe.

Sorry, no spectacular thing to show you today. I've got only one of the three things working that I wanted to have, and (sad to say) it's not the one that hits people in the gut and makes them think This Is Important, insofar as any of them do that.

This is an HTML document.

->

["This is an", [TAG name='a' attrs={href: 'http://www.w3.org/MarkUp/', title: "Brought to you by HTMLCorp"} child=[TAG name='b' child=["HTML"]], "document"], "."]

Update, much later: I wrote Beautiful Soup because I wasn't happy with any existing parser.

Filed under: , ,

Comments:

Posted by Nick Moffitt at Mon Mar 29 2004 11:43

You mean like DOM? YOu want a DOM for HTML, right? If you stick to XHTML, you could just use the xmllib DOM.

Python's built-in XML stuff in the 2.x series was originally a standalone module in the 1.5.x days. I think it used nested dictionaries instead of lists, though.

This potentially misguided tip was brought to you by Lambda. Lambda: The Ultimate Buzzword!

Posted by Leonard at Mon Mar 29 2004 12:22

Yeah, I want a DOM. A nested dictionary would be fine. Unfortunately I don't control the HTML, so I can't guarantee that it's XHTML. I can't think of a way of converting HTML to XHTML that's not equivalent to just building the data structure myself.

Posted by Fredrik at Mon Mar 29 2004 12:27

http://www.effbot.org/zone/element-tidylib.htm

Posted by Aaron Swartz at Mon Mar 29 2004 12:36

Perhaps the one for XML you were thinking of was my xmltramp.

That page has a link to PyMeld, which might be what you want.

Posted by Leonard at Mon Mar 29 2004 13:07

I just realized that I've already got an HTML->XHTML converter packaged with NewsBruiser; Gary Benson's HTMLTidy (I guess great minds think alike). I should be able to use that and SimpleXMLTreeBuilder to do what I want. Thanks!


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.