Thu Feb 02 2012 11:59 easy_install beautifulsoup4:
This is an HTMLized version of an email I sent to the Beautiful Soup discussion group, about the impending release of Beautiful Soup 4.
Introduction
When Beautiful Soup was first released in 2004, the state of HTML
parsing in Python was appalling. Over the past eight years, things
have improved so dramatically that Beautiful Soup's HTML parser is no
longer a competitive advantage. I don't want to duplicate other
peoples', work, so I'm getting Beautiful Soup out of the parser
businesss. Beautiful Soup's job is now to provide a Pythonic
screen-scraping API on top of a data structure created by a
third-party parser.
This will be Beautiful Soup 4, and I've been planning it for
years. With help from Thomas Kluyver and Ezio Melotti, I've now met
the three main goals of Beautiful Soup 4:
- Make a single codebase that works under Python 2 and Python 3.
- Stop using SGMLParser (removed in Python 3) and make it possible to
swap out one parser for another.
- Support two major Python parsers (lxml and html5lib) as well as
Python's (not currently very good) batteries-included parser,
html.parser.
The first version of BS4 is almost ready for release, and I'd like you
to test it out, if you haven't already. I still to fix some things, in
particular some performance problems. But, note that even with the
performance problems, BS4 is faster than BS3 across the board.
On Python 2 or Python 3 you can install the BS4 beta with this command:
easy_install beautifulsoup4
You can also get the source tarball.
The documentation has been completely rewritten. You may find the section on porting BS3 code to BS4 especially
interesting.
There are three major things I'd like your feedback on before
completing the release.
Hall of Fame
The BS3 documentation lists open-source projects that use Beautiful
Soup. I stopped maintaining this list many years ago because there are
hundreds of these projects, and since most of them are
screen-scrapers, they're pretty ephemeral.
I'd like to bring this feature back as a "hall of fame", featuring
applications of Beautiful Soup that grab a reader's attention. People
who used Beautiful Soup in a high-profile way or to tackle a big
issue. Projects that are interesting to hear about even if the
software doesn't work anymore, or uses an old version of Beautiful
Soup, or if Beautiful Soup was used internally and the public only saw
the results.
My bias is towards projects having to do with space, science,
journalism, politics and social justice. Here are some examples so you
know the kind of thing I'm thinking of:
- "Movable Type", a work of digital art on display in the lobby of the
New York Times building, uses Beautiful Soup to scrape New York Times
feeds.
- Alexander Harrowell uses Beautiful Soup to track the business
activities of an arms merchant.
- The Lawrence Journal-World used Beautiful Soup in 2006 and 2010 to
gather election results.
- The NOAA's Forecast Applications Branch uses Beautiful Soup in
TopoGrabber, a script for downloading "high resolution USGS datasets."
If you did anything of this sort, or know of someone who did, I'd
like to hear about it.
Do you prefer lxml or html5lib?
Right now, the parser ranking goes lxml, html5lib, html.parser. I like
lxml because it's incredibly fast and it can parse anything. But I'd
like to see what you think of the trees it generates. Would html5lib,
with its web-browser-like heuristics, be a better default?
substitute_html_entities
BS3 had a number of overlapping and inconsistent ways of turning
HTML/XML entities into Unicode characters, and possibly turning
Microsoft smart quotes into HTML entities at the same time. In BS4,
all this stuff is gone. HTML and XML entities are *always* converted
into Unicode characters.
This is great but there's one problem: output. If you want to turn
those Unicode characters back into entities when outputting as a
string, you need to call soup.encode(substitute_html_entities=True),
which is a little clunky. I'm thinking of adding an
output_html_entities attribute that you can set on a soup or tag to
control whether this substitution happens. Do you like this idea?
I think I also need to ensure that characters like "&" and "always converted to XML entities on output, even though this will hurt performance a bit.
Conclusion
What you install with easy_install beautifulsoup4 is a beta
release. If I hear of a problem soon, there's still time to fix it,
even if it means a major change to the API. So please try it out and
give me feedback.
(2) Tue Jan 31 2012 09:18 Constellation Games Author Commentary #10: "K.I.S.S.I.N.G.":
This is Dana Light's big chapter, and I'm having trouble writing
commentary because it's pretty self-contained. A problem is introduced
and Ariel solves it by the application of technology. If I hadn't been
writing a novel when I came up with Dana, this chapter would have
become a short story, maybe part of a sequel to "Mallory". It
would have been about the way evil psychologists use game mechanics
and the ELIZA effect to manipulate users into spending money, and the
way people get real pleasure from spending money on things designed to
manipulate them.
Although evil
psychology does show up in Constellation Games, I didn't
have as much space for it as I'd like. Instead this chapter shows
the first grown-up thing we see Ariel do. In a world in
which sub-human-level AI has suddenly become very common, Ariel
decides to empathize with it.
He doesn't anthropomorphize Dana. Dana doesn't pass the Turing test,
she isn't terribly smart or self-aware, but she's capable of happiness and she doesn't deserve to be
deliberately made unhappy by evil psychologists. This attitude is what
ultimately makes Ariel a hero, not just a POV character. The consequences of his decision to empathize will run through the entire book, and then overflow the book into "Dana no Chousen," and I still don't know when and whether Ariel does the right thing w/r/t Dana. But you gotta have empathy.
Apart from that, I don't have much to say. Here are a few miscellaneous notes:
- As you might expect, a lot of this stuff will come up again in "Dana no Chousen". But the callback you probably won't notice unless I point it out is that Dana loves popcorn.
- I enjoy many bits of this chapter but my favorite is Bai's big moment of lucidity, when he immediately detects and shoots down Ariel's Manic Pixie Dream Girl fantasy. (And you can bet that's gonna come up again.)
- I'm sure that G'go Investigation: When You Gotta Die
makes sense in cultural context. Like, imagine if the first thing
you learned about 21st-century Earth was Mario Kart: Double Dash.
- I really like the design of the chainable memory cylinders on the Simulates Hi-Def False Daylight. In the second draft, "[False Daylight] games were
distributed as a set of ROM chips, snapped onto standardized circuit
boards, and enclosed in a removable plastic case to be plugged into
the computer's game slot." This led to chips popping out, hiding in the carpet and stabbing people in the foot. That's a design in keeping with the generally poor quality of Ip Shkoy consumer goods, but it doesn't fit with the fact that the False Daylight is a clone of the Brain Embryo, so I switched to the much cooler chainable cylinders.
- Originally I transliterated Bai's "bro" as "bra". Everyone hated
this. I changed it to "brah". The hate did not abate. What is wrong
with you people? "Brah" is an accurate transliteration! It's so
accurate.
Tune in next week for action, intrigue, and romance between people at the same level of sentience. It's the only chapter when Ariel will say: "I just have a slight fear of being a tiny speck in the infinite cosmic void." But not the only chapter when he'll think that.
PS: Due to an error on my part, the chapter 9 Twitter feeds ran as part of chapter 8, and chapter 10's Twitter feeds ran last week. This really can't go on, because next week's feeds are tightly integrated with chapter 11. So except for a brief bit of bonus material I just wrote, there will be no Twitter stuff this week. Sorry about that!
Photo credits: Kevin Trotman and Peter Anderson.
<- Last week
Sat Jan 28 2012 09:58 Fruit to Fruit:
Time for another crummy.com Apples to Apples variant (previous editions), this one discovered last week by Pat.
On every green A2A card there's the name of the card, like "Handsome", but there are also three related words, like "attractive", "elegant", "fine". In Fruit to Fruit, you don't read the name of the card. You just read the related words. Sometimes the related words are so similar that you might as well be reading the name of the card, but usually something goes missing (such as the masculinity of "handsome"), leading to funnier red cards being put down. The name of the card is finally revealed during judging.
We had a great time with this and played it in conjunction with the Apples to Placebos variant, even though there were four players. You might think this overkill, but at this point A2A is more a social activity than a game. Anyway, it says right on the box "The game of hilarious comparisons!", so anything that makes the comparisons more hilarious is legit.
While seeing if anyone else had come up with this variant I discovered Apples to Trivial Pursuit, and the improv comedy variant. I also discovered that the game is patented, and that there is an entire patent classification system for "means... by which contests of skill or chance may be engaged in among two or more participants, where the result of such contests can be indicated according to definite rules."
(6) Tue Jan 24 2012 09:06 CG Author Commentary #9: "Import System":
Last week and this week have some of my favorite Twitter bits (e.g.) because the CDBOEGOACC is finally available in English. Sunday night while working on Loaded Dice I realized that one of the reasons I really like playing around with the BoardGameGeek dataset is it's like a real-life CDBOEGOACC.
The flip side is this chapter doesn't have a lot of plot. But hopefully you're okay
with that because of all the fun mini-stories like the Sea Level game/food. It's supposed to
represent the design phases of a software project, where you're
throwing around a lot of ideas but not much is being produced.
Next week is a set piece, and after that the plot won't let up
until the cliffhanger that ends Part One. Before that happens, I need to get some solid exoludology in to bring in topics that are important later, like Sayable Spice and Ariel's unsuccessful attempts to translate it.
Before beginning the chapter 9 commentary, I want to get something off my chest about the first sighting of the Farang in chapter 1. In that chapter, Ariel compares their antennacles to the oral tentacles of a
"cerebrophage". In the second draft I just out and said "mind
flayer". My writing group said I should change it because readers
might not know what a mind flayer is. ("Did you mean: mind flower?") Taking their advice to heart, I
changed the reference to a made-up reference that nobody will get. Well,
at least we're all in the same boat now!
And here's chapter 9. Vent your egg sacs before reading this commentary:
- This chapter represents the absolute end of the abandoned first
draft. Beyond this point everything is from the second or third draft.
In a questionable move on my part, Ariel gets an Alien computer
before he meets any Alien characters, requiring that I introduce you
to the species with an infodump ("eight-foot monkey-lizards"). Don't
worry, in just a couple weeks, Alien characters will show up and run
off with the whole damn book.
- Speaking of infodumps, I want to do a little infodump of my
own, about the Ip Shkoy. The Ip Shkoy were an ancient civilization of
Aliens, much like the Roman Empire was an ancient civilization of
humans. "Ip Shkoy" is not the native-language name for the
Alien species. I tried to make this abundantly clear, but I've noticed
well-intentioned people calling the Alien species "the Ip Shkoy" or
ascribing to modern Aliens the (frequently pretty awful) opinions of
the Ip Shkoy. Which would be like Curic thinking that Ariel regularly
offers sacrifices to Jupiter Optimus Maximus.
Star Trek has conditioned us to see an ET species as having
a single homogenous culture that never changes, and this sort of
confusion is why they do that on Star Trek. That said, I don't
think this is anyone's fault but my own. If I'd presented modern Alien
society in as much detail as I present the Ip Shkoy, the other
probably wouldn't crowd out the one. It doesn't help that certain
features are shared by both cultures, such as transitive pair bonding (aka polyamory).
- Recapture That Remarkable Taste, the Ip Shkoy remake of
Sayable Spice, is not to be confused with the new William
Gibson anthology, Distrust That Particular Flavor.
Charlene Siph is mentioned again, which gives me a good excuse to
talk about Alien names. The Aliens on the contact mission have all taken human first names, but their surnames are monosyllables which I usually generated by truncating creepy English words ("siphon", "somnolent") to four letters. The impression I want is of someone who's trying to be accommodating but doesn't quite have it down.
This is another detail imported from "Vanilla", one that I'm really
happy with, one that even becomes important to the plot in one
place. And if you like symbolism, check this out: "Ariel Blum" could be an Alien name.
- In the second draft I wrote a whole review of Proty's Big
Escape, but turns out it's a dumb idea to insert one game review
wholesale into another game review. Let me know if you want the
Proty review, and I'll make it the first CG Deleted Scene, even
though it's only five paragraphs long.
Be sure to tune in next Tuesday, when Dana will say, "This application will terminate due to suspected theft or circumvention."
Oh, and you might want to keep an eye on @Tetsuo_Milk.
Image credits: Flickr user krusty, Guillaume Piolle, and Flickr user CoffeeGeek.
<- Last week | Next week ->
Mon Jan 23 2012 11:43 To This Basic Game Hedgehogs Are Added:
I bought a cute game about hedgehogs, Der Igelwettkampf ("The hedgehog contest"), as a Christmas present for my niece. On Der Igelwettkampf's BoardGameGeek page I noticed that it was classified under the game family "Animals: Hedgehogs/Porcupines". I'd thought "Family" was for boring things like grouping together the endless versions of Ticket to Ride, but turns out it's also used to group together all the games about hedgehogs.
The question then arises: what's the best game about hedgehogs? According to BGG it's Igel Ärgern + Tante Tarantel, a double bill in which Tante Tarantel might be doing some of that work because Igel Ärgern on its own is rated a bit lower.
More importantly, what's the worst hedgehog game? Indubitably it's Hedgehog's Revenge, "The GAME where the hedgehog STRIKES BACK!", whose BGG description includes the now-hopefully-immortal saying "To this basic game hedgehogs are added."
At this point I was on a roll... of the dice! I went back to my now-old BGG data dump, sorted the board game families by how many games they contained, and picked out interesting groupings for use in Loaded Dice. We've got Games about animals (most popular: dogs) Game versions of sports (soccer), and Games about countries (the Roman Empire, in a landslide). That page shows the top-rated game and the lowest-rated game, so get ready to load a lot of cover images.
I did a couple other lists, like media tie-ins (champion: Disney) and "families" that are strongly tied to one single game (the 889-strong "Monopoly" family), but I think the three lists I put up are the most interesting.
Bizarre trivia abounds! Did you know that crows are board game gold? The worst game about crows (The Crow and the Pitcher) has a BGG rating of 6.32, which isn't that bad at all. (Longtime fans will remember the median rating is 6.0).
Did you know there are twenty rodeo-themed games? Apparently you didn't, since only one of those games has more than five ratings. How many wargames take place in Switzerland, a country that doesn't fight wars? Only two: Switzerland must be Swallowed and Zürich 1799.
My data is six months old now and it's starting to show some cracks. There are BGG families for Russia and Antarctica which were created after I took my dataset, so they don't show up in the country list even though most of their games are in my data. After getting the Switzerland idea I ran the "What percentage of a country's games are wargames?" test on all countries, but wargames were drastically undercounted. For instance, all but one "Vietnam" game on BGG is a wargame (the exception being Venture Vietnam), but only 35% of those games were classified under a general "Wargames" category.
But, the lists are still a lot of fun and there are some interesting games in there. I'll leave you with the board game equivalent of the dusty World Book Encyclopedia sitting on the shelf at your grandparents' house: Trivial Pursuit - The Year in Review - Questions about 1992, the worst-rated game (3.90) in the 155-strong Trivial Pursuit family. Also available in 1993 flavor!
|