Earlier this year I quietly retired Rubyful Soup because I think
_why's hpricot
does a better job of being a Rubyish screen-scraping parser than RS
can be. But nothing similar has happened in Python, mainly because BS
is the market leader. I want to keep that going, but I also want to
take advantage of the work that's gone on in this field since 2004.
So, what are the useful features of Beautiful Soup?
Of these, the only one I really care about is #4. If I could rid
myself of the need to handle all the edge cases in #1-#3, edge cases that
in many cases have outstripped my ability to solve them with my
current tools and sanity, I'd be happy.
Fortunately, there's html5lib, which is
supposed to be as good at parsing HTML as a web browser.
My current plan is to write something that goes on top of html5lib
and gives the BS API to whatever DOM you've built. This would take
care of #1 and #3. It's not clear to me how you tell html5lib the
rules of your XML vocabulary; maybe it only parses valid XML. But BS
is relatively rarely used to parse invalid XML, so if I could
outsource all the HTML and Unicode crap to html5lib I'd be much more
inclined to hack randomly on BS, so I think it's a fair trade.
html5lib already has a "beautifulsoup" tree builder, which creates
a tree of Beautiful Soup objects. So in theory I would just need to
maintain those objects? I'll find out soon enough.
(4) Tue Dec 09 2008 23:13 Beautiful Soup Future:
I've got a chunk of time off at the end of the year, not having used
it earlier in the year. Among my other projects I'm going to redo
Beautiful Soup. This entry is an early spelling out of my rationales
and my plans.