News You Can Bruise for 2011 February

Fri Feb 04 2011 08:59: I've reached the end of the first volume of Mark Twain's autobiography, although I started at page 200 (this was recommended in the introduction). I had a pencil handy while I read the book, with which to mark up all the funniest and most interesting passages, and I thought I'd share a few with you.

Earlier I called this book "an advertisement for e-books", and although I was referring to its physical dimension I was more truthful than I knew, because more or less the second half of this book is scholarly notes. Even the notes are pretty interesting, but it would have been nice to have some kind of hyper-text mechanism to keep them at hand during the reading.

A few sample interesting notes:

The [Second Iowa] Regiment had been disgraced by general order for having failed to prevent vandals from stealing taxidermic specimens from McDowell College in St. Louis, which was being used as a prison.

In 1910 in "The Turning Point of My Life", Clemens recalled that Herndon told "an astonishing tale" about the "miraculous powers" of coca, instilling in him "a longing to ascend the Amazon" and "open up a trade in coca with all the world."

I'd like to buy the world some coke?

Ok, on with the fun. Quotes from the autobiography itself:

My parents removed to Missouri in the early thirties; I do not remember just when, for I was not born then, and cared nothing for such things. It was a long journey in those days, and must have been a rough and tiresome one. The home was made in the wee village of Florida, in Monroe County, and I was born there in 1835. The village contained a hundred people and I increased the population by 1 per cent. It is more than the best man in history ever did for any other town. It may not be modest in me to refer to this, but it is true. There is no record of a person doing as much—not even Shakspeare. But I did it for Florida, and it shows that I could have done it for any place—even London, I suppose.

On unconscious plagiarism: "all our phrasings are spiritualized shadows cast multitudinously from our readings."

"William Swinton was a brilliant creature, highly educated, accomplished. He was such a contrast to me that I did not know which of us most to admire, because both ends of a contrast are equally delightful to me."

"I conceived the idea of a magazine to be called The Back Number, and to contain nothing but ancient news; narratives culled from mouldy old newspapers and mouldy old books; narratives set down by eye-witnesses at the time that the episodes treated of happened."

The account of his duel in "About Dueling" is great. "I woke up Mr. Laird with some courtesies of the kind that were fashionable among newspaper editors in that region."

I also loved this section about James G. Blaine, the Continental liar from the state of Maine.

On election day we went to the polls and consummated our hellish design. At that time the voting was public. Any spectator could see how a man was voting—and straightaway this crime was known to the whole community.

I feel that this ties somehow into the "Horrid Tragedy In Private Life" saga:

Susy [Twain's daughter] and her nearest neighbor, Margaret Warner, often devised tragedies and played them in the schoolroom, with little Jean’s help—with closed doors—no admission to anybody. The chief characters were always a couple of queens, with a quarrel in stock—historical when possible, but a quarrel anyway, even if it had to be a work of the imagination. Jean always had one function—only one. She sat at a little table about a foot high and drafted death-warrants for these queens to sign. In the course of time, they completely wore out Elizabeth and Mary Queen of Scots—also all of Mrs. Clemens’s gowns that they could get hold of—for nothing charmed these monarchs like having four or five feet of gown dragging on the floor behind.

I think that's enough for now.

Wed Feb 09 2011 10:36 Disturbing Search Requests: (rot13ed to avoid giving anyone else the same idea.)

evpuneqfba gurfvf erfg jro freivpr

No, but thanks for playing!

Thu Feb 10 2011 08:38 Let's Check In With Roy's Postcards: "I think Matthew Barney made a movie with this plot." (Update: here's the front of that postcard, in case you wanted to see what it looked like.)

Fri Feb 11 2011 08:57 The Last Workshop on Theoretical Physics in the Soviet Union: Can Beatriz Gato-Rivera's paper "The Last Workshop on Theoretical Physics in the Soviet Union" live up to its awesome title? On the whole, I think it doesn't—relatively little of the narrative takes place at said workshop— but it's worth reading for the many good bits, some of which I'll extract for you:

I was allowed to keep the original key of the back door of Einstein’s house, that I saved in its way to the garbage truck... In the garden I noticed that the old door had been replaced but was still there lying on the wall with the key inside, so.....Back in Boston, Cumrun Vafa (my extra-official supervisor) was not amused when I showed him the Einstein’s key: 'Look, what you have done is, precisely, what Einstein didn’t want people to do!'

The last day of the workshop I was supposed to give a talk. Then someone told me: 'we are sorry, you cannot give your talk because the mathematicians have finished their workshop ahead of schedule and they have brought the blackboard along, we borrowed it from them'.

We took a night train and we were hosted by a female friend of his mother: the renowed mathematician Olga Ladyzhenskaya who was the leader of the 'Leningrad School of Partial Differential Equations'. When we arrived to her flat Olga received us with very low voice saying: 'Look, my niece is here in the living room. She works for the government and she is not allowed to have any contact with foreign people. So, please, rush through the living room, enter the corridor, and take the two rooms at the end'. So, we followed the instructions and ran stealthily while the niece was looking through the window giving us the back.

Natalia looked at me and said: 'You know, you are the second foreign person to enter in this flat. The first was Niels Bohr'. The surprise was enormous for me: I was the second after Niels Bohr in something!

Fri Feb 11 2011 09:20 Beautiful Soup 4 Status Report: Yesterday I ported some more tests and added basic doctype handling to the parser plugins. The work is slowing down a little because I'm porting tests where html5lib and lxml handle the same markup differently, such as incorrectly nested tables. I'm not going to find and test every such difference, but I want to have all the old tests working, and it'll give you an idea of what the differences are in common situations.

In BS3 you could choose to convert incoming entities into Unicode characters, or to leave them alone. You could also choose to convert Microsoft smart quotes into Unicode characters, XML entities, or HTML entities. In BS4 this will depend on the parser. Both lxml and html5lib convert everything to Unicode. I think this makes more sense--convert absolutely everything to Unicode, use Unicode internally, and optionally convert back to entities when writing the document out. (I'll probably have to write the "convert back to entities" part.)

(1) Sun Feb 13 2011 20:18: I felt listless today, so I did some Beautiful Soup work so that I wouldn't have wasted the day. I fixed the handling of CDATA sections and doctype declarations.

Fun fact: it doesn't seem to be legal to stick a CDATA section into an HTML document. (By this I mean something like "<![CDATA[foo]]>", not the contents of a <pre> tag.) My knowledge of weird HTML constructs like CDATA comes mostly from studying Python's SGMLParser, which handles CDATA sections just fine since it's an SGML parser. So I had BS3 just create objects for CDATA sections, even when they occurred in HTML documents. But the two parsers I'm using as my testbeds for BS4 basically ignore CDATA sections in HTML documents:

By default, lxml's elementtree implementation replaces CDATA sections with the actual character data, and has an option to leave the CDATA sections alone, but this only works for XML. When parsing HTML, CDATA sections are ignored altogether. The HTMLParser constructor has a "strip_cdata" argument, inherited from XMLParser, but setting it to False does nothing.

BS4 can't be used to parse XML yet (unless you want to parse it by HTML rules), but once I add that, I'll have the lxml elementtree builder preserve CDATA sections.

html5lib treats CDATA sections as broken comments, so "<![CDATA[foo]]>" becomes "" The latest version of html5lib will replace a CDATA section with the character data if the CDATA section happens within a <svg> or <math> tag (see test), but this is not in any released version.

That took me the morning to figure out, so I hope it saves someone some time. But that person would have to bear a suspicious resemblance to me.

Fri Feb 18 2011 16:22 Beautiful Soup 4 Status Report: What an exciting weblog I run. Actually this update is pretty cool. I've ported all the non-XML tests for BS4, which means you should now be able to use the code for all HTML processing purposes. If you want to try it, note that the module is now called 'beautifulsoup', not 'BeautifulSoup'. So from beautifulsoup import BeautifulSoup. I may rename it to bs4 just because I'm tired of typing "from BeautifulSoup import BeautifulSoup" for the past six years.

I also decided this would be a good time to run a performance test. Here's a moderately sized document:

Document is 66409 bytes
BS4 lxml time: 0.06
BS4 html5lib time: 0.25
BS3 time: 0.15

("BS3" here is the latest released version, 3.2.0.)

Pretty good! And here's a huge complicated document:

Document is 1329825 bytes
BS4 lxml time: 12.60
BS4 html5lib time: 2.88
BS3 time: 14.11

Okay, that's kind of random. The problem is in Unicode, Dammit. It takes a long time to figure out the encoding for this particular page. This is ultimately because the document is in ISO-8859-2, but it includes a <meta> tag that claims the document is in UTF-8. I don't yet understand the problem on any deeper level than that. If it's gonna be like that, I may just stop believing anything I see in a <meta> tag.

If you specify the encoding up front, the lxml time drops to 0.90 seconds. The html5lib tree builder doesn't have the problem because it uses html5lib's native Unicode conversion functionality instead of Unicode, Dammit.

Incidentally, BS3 has the same problem. Specify the encoding up front, and BS3 takes about 2 seconds on this page, which makes sense--faster than html5lib but slower than lxml. I find it very annoying that I'm only discovering this problem now—I think this has wasted a lot of cumulative time over the past couple years.

Anyway, now is a good time to start trying out BS4, if you're a fan of new things. I renamed all the major methods to be PEP-8 compliant--details are in the CHANGELOG.

Update: Profiler shows the bottleneck is in the chardet library, sbchardetprober.py, which goes through the file character by character and crunches some numbers for each character. If it can't make a decision until late in a huge file, there's your twelve seconds. So... I have a couple ideas, but it's not a bug in my code that I can just fix. But html5lib uses chardet, so it must be doable.

Sat Feb 19 2011 22:01 Black Planning: I'm doing more work on Beautiful Soup, but I'll spare you the details and share with you one of my favorite Ken Macleod quotes. This is from The Stone Canal, and it came up in brunchtime conversation with Evan:

"Because your Yank dingbat libertarian pals are right—the Western democracies are socialist! Big public sectors, big companies that plan production while officially everything's on the market... sort of black planning, like the East had a black market."

Sun Feb 20 2011 20:16: Got XML parsing working in Beautiful Soup 4, and then added a feature I've been wanting to add for a while. Instead of separate BeautifulSoup and BeautifulStoneSoup classes[0], in BS4 there's just BeautifulSoup. To get a tree-builder that's optimized for XML, you write BeautifulSoup(markup, "xml"). HTML is the default, but if you want to make it explicit, you write BeautifulSoup(markup, "html").

But this is just the tip of a general feature. "html" and "xml" are just strings, features for which a tree-builder might or might not advertise support. The tree builders also publish other features, like "fast", "permissive", "html5", and library names like "lxml". So you can make semi-fine distinctions:

BeautifulSoup(markup, ["html", "fast"])
BeautifulSoup(markup, ["html", "permissive"])
BeautifulSoup(markup, ["html", "lxml"])

The BS constructor will try to find the best tree-builder that matches all the features you specify, and will raise an exception if it can't match them all (because you don't have lxml installed or something).

This is overkill right now because there are only three tree-builders (["lxml", "xml"], ["lxml", "html"], and ["html5lib"]). But this gives me an easy way to add tree-builders to the code base, and for you to plug in additional builders, without making end-users learn where the classes are.

This is looking good enough that I can do an alpha release soon. I'm not sure why I've been putting so much work into BS, but I'm sure it has something to do with the fact that my other projects are stalled, blocked, or I want to procrastinate on them.

[0] Those little classes like ICantBelieveItsBeautifulSoup are also gone, because distinguishing between different techniques for parsing markup is now the parser's job. And those classes were kind of silly to begin with.

Wed Feb 23 2011 18:13 Looking For Work: That's right. I'm on the market. I may just move to a different position within Canonical, but I'm taking the opportunity to talk to people at other companies. If you're reading this and want to hire/work with the person who brought the world Beautiful Soup and RESTful Web Services (the book, not the concept), send me email at leonardr@segfault.org.

Here's my resume. Ideally I'd like to do a combination of coding and writing, but tell me what you've got.

(2) Fri Feb 25 2011 00:12 The Board Game Remix Kit: I imported the book of The Board Game Remix Kit from the UK. I probably should have bought the PDF instead because it's a really small book that, pound for pound, was rather expensive to import. But, what I didn't realize until I bought the book is that it's by Holly and Kevan! So, buy that sucker. It's a lot of rules for new games you can play with the pieces from other games that you're tired of. The one into which the most care was put is a game that turns Clue[do], the game with the most irrelevant pieces ever, into a tactical wargame where you fight zombies.

There's also (for instance) a game played with Monopoly title cards that's suspiciously like our Man Bites Dog game remix. The whole thing makes me glad Sumana accepted that Trivial Pursuit game from Beth when Beth was moving. ("Dadaist Pursuit: ...every other player turns over their top card and selects the funniest answer from those printed on it...")