News You Can Bruise for 2015 February

Mon Feb 02 2015 09:07 January Film Roundup: January started with three highly anticipated films that all turned out to be duds! What to do for the rest of the month, but stack the deck?

The Strange Little Cat (2013) - a.k.a. "Das merkwürdige Kätzchen". The museum handout said that the sound design really shines in this movie, and maybe if we'd read the handout ahead of time we'd have focused on that and marvelled. But how groundbreaking can sound design be in a slice-of-life film that takes place in a normal house? I admit I don't understand the technical details, but I hear sounds every day, and the sounds in the movie were what I'd expect from a dull movie where someone fixes a washing machine. Is this the curse of the filmmaker? To have recreated the normal soundscape of life so precisely that philistines don't even realize anything special is going on? Anyway, not recommended unless you're a sound engineer and want to explain to me what the deal is here.
Hard to be a God (2013) - a.k.a. "Trudno byt bogom". Okay, look. I love the Strugatsky brothers. I know they're not the cheeriest science fiction writers. Judging from the plot summary Hard to be a God is not their cheeriest book. (As far as my reading goes, their cheeriest book is in fact Monday Begins on Saturday.) I don't mind seeing the occasional Russian sci-fi movie that's nearly three hours long. But I don't know what I did to deserve this film. My only clue, once again, comes from the museum handout. Director Aleksei German said that "[f]ilm has turned into something for people who are bored to read the book," so I guess this film is my punishment for not reading the book.
When I try to describe Hard to be a God I come up with words like "shitshow" and "grueling" which also describe the movie literally--there's a lot of shit in this movie and a fair amount of gruel. And there's pretty much no science fiction element. When there is science fiction on the screen, the film is grim but inventive and bearable. The image of a Will Riker-type medieval baron training his serf to accompany his jazz clarinet riff on a crappy medieval tuba. The woman who wants to have a baby by a demigod, but before they have sex she has to hang up the big religious statue of the demigod she inherited from her mother, and then the statue breaks in half while they're having sex and conks them on the head. It's creative stuff. But most of the film is like the first scene of Monty Python and the Holy Grail, except it never ends and King Arthur has to also be Denis the Peasant.
As a viewer guide, to help you decide if you want to see this movie I'm gonna rank the top four bodily excreta featured:
1. shit
2. blood
3. phlegm
4. piss
Honorable mention to the technically ineligible but omnipresent "mud". If you like Game of Thrones but think it's not yucky enough to be real medieval, you might like this movie. I will admit there is one hilarious buckets-of-blood sight gag, but you could probably say the same for Cannibal Holocaust. There's someone who will read this review and think this movie sounds great and what's my problem, and if you're that person, I think I can guarantee you will like this movie. I'm laying it all out there! Everyone else, read the book, I guess? I'm interested in reading it just to see what exactly happened in this adaptation.
In a weird twist, many of the characters seem aware of the camera, or the audience, but nothing really comes of this. A Russian on IMDB says that "the main character has a camera on his forehead, that is transmitting back to Earth", but that detail is not in the English subtitles and I don't think it makes sense--who is this supposed "character" and why do they fill exactly the same filmic role usually filled by a non-digetic camera? And would people will never see anything displayed on a screen know how to engage with a camera? I don't know.
Inherent Vice (2014) - I'm glad they gave it a shot, I think they did as good a job as possible, Josh Brolin and Benicio del Toro are really good in this... but as the great Russian director Aleksei Germain has noted, "[f]ilm has turned into something for people who are bored to read the book." And in particular you can't act out what happens in a Pynchon book and call it a movie.
It's an especially bad deal when the film ends up very similar to The Big Lebowski, which not only superficially resembles Inherent Vice but which I've argued translates Pynchon's primary sylistic innovation to film. "[E]ach of his characters is surrounded by a protective bubble of literary genre, which colors the way the narrative is reported and even shapes the plot." It's not too difficult to pull this off when you have multiple-POV, but it's really really tricky when you have an omniscient narrator. That's why The Big Lebowski starts with a narrator who quickly discovers that he's a lousy narrator, and gives up and becomes a normal character.
The narrator of Inherent Vice the movie is also a character in the movie, but she also never stops also being the godlike omniscient narrator, even showing up hallucination-like in scenes she's not really in. The presence of this strong narrator stops the protective bubbles from forming. Doc Sportello is supposed to focus the classic Pynchon conspiracy through the lens of noir (private eye) and Illuminatus! (hippie pothead), revealing the Golden Fang and the 1970s in general as a grand conspiracy of the square against the hip. It shows up in the film if you know to look for it, but it's super confusing because the dominant voice of the film--the narrator, who again is a specific person in the film--isn't involved in this plotline at all.
The Pynchonness is more visible in Josh Brolin's Bigfoot Bjornson, the cop who thinks he's on a cop show, who actually picks up extra roles in cop shows to preserve this fantasy even as his real-life career stalls. That's what I want to see. My point is that The Big Lebowski is not just a better film, it's a better Pynchon adaptation, because it lets the bubbles form.
What to do? You could film different parts of the movie in different styles, but because of that dang narrator it would never be clear why one bit was filmed in one style versus another. Sumana suggested animation, which could work--the different characters could be drawn in slightly different styles.
Sullivan's Travels (1941) - After those three I had to throw in a ringer. Sumana saw Sullivan's Travels in college and liked it, and it's a movie the Coen brothers ripped off rather than the other way around, so I borrowed it from the library that's conveniently across the street from the library where I work. Oh man, it's great! It's got a really unusual plot structure. I was having a good time for the whole movie, but the third act kicked it up to such a higher level—comedically and politically and emotionally—that I started feeling bad for even liking the goofy butlers and movie producers in the first act. I think Down By Law may have also been ripping this movie off. And why not rip it off? It's funny, it's inventive, and it makes a dark-comedic argument for the value of light comedy.
I thought it was weird that the poster for this movie says "Veronica Lake's On The Take". How is that any way to advertise a movie, accusing your actors of corruption? That statement also has no justification within the movie. Maybe "on the take" meant something different back then.
Sweet Smell of Success (1957) - A really unusual, or maybe just uncommon, sort of noir in that it deals with the relationship between the upper crust and the pseudo-riche strivers. Instead of heists and gunplay it's all cutting words and breaking up engagements. Very All About Eve. Burt Lancaster and Tony Curtis are great. The characters who only show up in one scene are great. The secondary cast is kinda meh. Great movie though, a cut above average popcorn noir.
The Italian Job (1969): More shallow fun in the form of light comedy. Breaks the rule of heist movies that if you explain how the heist is going to go down, you need to introduce complications during the heist. The ending is obviously fishing for a sequel, but since there was no sequel I'm quite happy with the ending. Not a fan of the way the movie blatantly shuffles characters offstage once they've played their part in the heist.
I was not expecting Benny Hill as the super hacker. Michael Caine was well-cast—you gotta play Bruce Wayne before you can play Alfred—but his character's a pretty bad heist manager and I'm glad the no-sequel ending gave him his comeuppance in a lighthearted way.
The Godfather, Part II (1974): Bigger, badder, but not better than the original. It's a good movie, it kept my interest despite being long as hell, but at the end the obsession with mirroring the first movie kinda unleashed the Arrested Development farce that underlies the somber seriousness of the Godfather universe. The bit where Connie convinces Michael to forgive Fredo and give him a big hug really needed a Ron Howard "And that's when Michael realized..."
IMDB trivia:
Danny Aiello said that his line "Michael Corleone says hello" was completely ad-libbed. Francis Ford Coppola loved it and asked him to do it again in the retakes.

... ... ...doesn't that ad-lib completely change the main plotline of the movie? Oh well!

Wed Feb 04 2015 20:45 The Ghost of Ghostbusters Past: Just a quick semi-technical post on how I made @WeBustedGhosts, my new bot that casts movies from an alternate history where "ghostbusters" is a stock comedy genre, sort of a twentieth-century commedia dell'arte. In particular, I did a lot of work with IMDB data that I want to record for your benefit (and by you, I mean future me).

The bot was inspired by two things: first, this video by Ivan Guerrero which "premakes" Ghostbusters as a 1954 comedy starring Bob Hope, Fred MacMurray, and Martin/Lewis. Second, the reaction of fools to the fact that women comedians will bust ghosts in the upcoming Ghostbusters remake. More specifically, Kris's endless mockery of the idea that "ghostbuster" is a job with a legitimate gender qualification.

These things got me thinking about the minimal set of things you need to make Ghostbusters. You need the idea of combining a horror movie with a comedy about starting a business. Someone could have come up with that idea in the silent film era. You need a director and four actors who can do comedy. And all those people need to be alive and working at the same time, because ghosts aren't real... OR ARE THEY? Either way, you can describe a point in Ghostbusters space with six pieces of information: four actors, a director, and a year. That's small enough to fit into a tweet, so I made a Twitter bot.

Our journey to botdom starts, as you might expect, with an IMDB data dump. I've dealt with IMDB data before and this time I was excited to learn about IMDbPY, which promised to get a handle on the ancient and not-terribly-consistent flat-file IMDB data format. Unfortunately IMDbPY is designed for looking up facts about specific movies, not for reasoning over the set of all movies. However, it does have a great script called imdbpy2sql.py, which will take the flat-file format and turn it into a SQL database.

There will be SQL in this discussion (because I want to show you/future me how to do semi-complex stuff with the database created by IMDbPY), but unless you're future me, you can skip it. Basically, for each actor in IMDB, I need to calculate that actor's tendency to get high billing in popular comedies for a given year. They don't have to be good comedies, or Ghostbusters-like comedies, they just have to have a lot of IMDB ratings.

I also want to figure out each actor's effective comedy lifespan. If an actor stops doing popular comedy or dies or retires, they should stop showing up in the dataset. If a dramatic actor branches out into comedy they should show up in the dataset as of their first comedic performance. Basically, if you learned that this actor starred in a comedy that came out in a certain year, it shouldn't be a big surprise.

Orson Wells would be great in a Ghostbusters movie, but he never did comedy, so he's not in the dataset. How about... Cameron Diaz? She rarely gets top billing, but she has second or third billing in a lot of very popular comedies. For a year like 1997 she tops the list of potential women Ghostbusters.

How about... Peter Falk? His first comedy role was in 1961's Pocket Full of Miracles, his last in 2005's Checking Out. His acting career stretches from 1957 to 2009, but he's only a potential Ghostbuster between 1961 and 2005. He won't get chosen very often, because he's not primarily known for comedy (i.e. his comedies aren't as popular as other peoples'), but it will happen occasionally.

That's the data I extracted. Not "how famous is this actor" but "how much would you expect this actor to be in a comedy in a given year".

The IMDbPY database is more complicated than I like to deal with, so my strategy was to use SQL get a big table of roles and then process it with Python. Here's SQL to get every major role in a comedy that has more than 1000 votes on IMDB:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender, cast_info.nr_order, kind_id from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (1,2) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 1000 and movie_info_idx.info_type_id=100 and cast_info.nr_order <= 7;

Some explanation of numbers and IDs:

movie_info_idx.info_type_id=100 means the join against the movie_info_idx table is looking up the number of votes (id #100 in my info_type table).
cast(movie_info_idx.info as integer) > 1000 means that the number of votes has to be more than 1000.
cast_info.role_id in (1,2) means I'm only considering "actor" and "actress" roles (IDs 1 and 2 in my role_type table). I'm not considering directors, writers, etc.
movie_info.info_type_id=3 means that I'm looking up the genre of the movie ("genre" is ID 3 in my info_type table). Then I use movie_info.info='Comedy' to restrict to 'Comedy'.
kind_id in (1,3,4) means I'm only considering "movie", "tv movie" and "video movie" (items 1, 3, and 4 in my kind_type table) I'm not considering television, video games, etc.
cast_info.nr_order <= 7 means I'm only considering the top seven billed actors for each movie.

I run this on a SQLite database and the output looks like:

#1 Cheerleader Camp|2010|2297|Cassell, Seth|m|2|4
...

So the title of the movie is "#1 Cheerleader Camp", it came out in 2010, it has 2297 votes, and Seth Cassell (a man) was an actor in that movie and got fourth billing.

Why didn't I include television in this query? Because television on IMDB is really complicated. See, actors aren't credited to television shows; they're credited to individual episodes. But nobody rates individual episodes; they rate the show as a whole. So I had to do a separate query to determine who the top actors were on each comedy television show, and then divide up that show's votes between the four top actors. Otherwise actors whose primary comedy career is in television won't get their due.

Here's SQL to get all the roles in TV episodes:

select tv_show.title, episode.title, episode.production_year, votes.info, name.name, name.gender, cast_info.nr_order from title as tv_show join title as episode on tv_show.id=episode.episode_of_id join cast_info on episode.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx as votes on votes.movie_id=tv_show.id join movie_info on movie_info.movie_id=tv_show.id where cast_info.role_id in (1,2) and tv_show.kind_id in (2,5) and episode.kind_id=7 and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(votes.info as integer) > 10000 and votes.info_type_id=100 and cast_info.nr_order < 5;

This is pretty similar to the last query but some of the IDs are different.

tv_show.kind_id in (2,5) means the show "tv series" and "tv mini series", IDs 2 and 5 from my kind_type table.
episode.kind_id=7 is "episode". I'm joining the title table against itself, the first time as "tv_show" and the second time as "episode". The votes come from "tv_show" and the roles come from "episode".

I run this and the output looks like:

'Allo 'Allo!|A Bun in the Oven|1991|14022|Kaye, Gorden|m|1
...

This means there's an 'Allo 'Allo! episode called "A Bun in the Oven", the episode came out in 1991, 'Allo 'Allo (NOT this specific episode) has 14,022 votes, and Gorden Kaye got top billing for this episode.

I got this data out of a database as quickly as possible and bashed at it to make a TV show look like a movie with four actors--the four actors who appeared in the most episodes of the TV show.

Directors were pretty similar to film actors. for each director who's ever worked in comedy, I measured their tendency towards putting out a popular comedy in any given year. There's a very strong power law here, with a few modern directors overshadowing their contemporaries, and Charlie Chaplin completely obliterating all his contemporaries.

Here's SQL to get all comedies with their directors:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (8) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 5000 and movie_info_idx.info_type_id=100;

The only new number here is cast_info.role_id in (8), which means I'm now picking up directors instead of actors.

At this point I was done with the SQL database. I wrote the "Ghostbusters casting office". It chooses a year, picks a cast and a director for that year, and then (15% of the time) it picks a custom title. My stupidly hilarious technique for custom titles is to choose the name of an actual comedy from the given year and replace one of the nouns with "Ghost" or "Ghostbuster". So far this has led to films like "Don't Drink the Ghost" and (I swear this happened during testing) "Ghostbuster Dad".

Here's how I pick a cast for a given year: I line up all the actors for that year by my calculated variable "tendency towards being a Ghostbuster", and then I use random.expovariate to choose from different places near the front of the list (to bias the output towards actors you won't have to look up). This is the same trick I use for Serial Entrepreneur to choose common (but not too common) adjectives and nouns for its inventions. My means are 0.85, 0.8, 0.75, and 0.7, which will, on average, give me someone who's at the 85th percentile, someone at the 80th percentile, 75th percentile and 70th percentile.

This is the best I could do to recreate the dynamic of 1984 Ghostbusters where Bill Murray and Dan Aykroyd were very well-known actors even before Ghostbusters, where Ernie Hudson and Harold Ramis were not. At this point you might object that Ernie Hudson and Harold Ramis weren't even 75th or 70th percentile. Ghostbusters was Ramis's second movie ever as an actor; I think there was an oral history that said he gave himself the part of Egon Spengler because no one else was a big enough dork. So for pure accuracy I should be doing, like, 0.90/0.85/0.35/0.30. But that gives you way too many obscure actors and the output isn't as fun. It also doesn't feel accurate, because 1984 Ghostbusters was a real movie, and all by itself it made Hudson and Ramis pretty famous actors. So now we expect "Ghostbuster" to be sort of a prestige comedy role.

A more valid point is that 0.8/0.8/0.75/0.7 also doesn't really capture the dynamic of the 2016 Ghostbusters, where all four actors are well-known but Kristen Wiig has twice the credits of the other three. So I also created an 0.85/0.8/0.8/0.75 mode, which will tend to give you more big-name ensembles.

As always, there's a lot of behind-the-scenes data munging. Going from a bunch of "xth billing in movie with y votes" entries to a single "tendency towards being a Ghostbuster" number required a lot of semi-arbitrary decisions, and I think my algorithm still undercounts television actors. Whenever there was a power law, I smoothed it out a little to increase the variety of the output. I smoothed out the overrepresentation of post-IMDB comedies compared to pre-IMDB comedies; of superstar directors like Chaplin who overshadow everyone else in their time; and of men directors vastly outnumbering women.

Representation of women comedic actors vs. men was not an issue because I followed the lead of the Ghostbusters remake. 45% of the ghostbusting teams are all women, and 45% are all men. (10% of teamups are coed, just to add variety.) There's no code that makes sure all the actors speak the same language or anything like that—I could extract that data from IMDB but it would be a lot of work to make the output of the bot less interesting.

And there you go. It's not source code, but you should be able to see more or less how I took this bot from concept to execution, and how I negotiated the tricky space between "this is an accurate representation of what would happen in an alternate universe where the primary cinematic comedy genre is films about busting ghosts" and "this is a fun output for this bot to have."

Sun Feb 15 2015 18:23 Poems of SCIENCE! I Mean, Science: I picked up a cheap old poetry anthology called Poems of Science, figuring there'd be some good stuff. And... there was, but I had wait for the modern conception of "science" to come about, and then spot poetry about a hundred years to come to grips with it, and decide that science is interesting and not going to go away. By that time I was more than halfway through the anthology. But around the late nineteenth century some excellent poetry starts happening, and I thought I'd share a couple links.

Miroslav Holub's Zito the Magician and Robert Browning's much longer An Epistle Containing the Strange Medical Experience of Karshish, the Arab Physician are really great and work as spec-fic stories. Swinburne's Hertha is this weird humanist we-are-made-of-star-stuff mythology that's what you'd expect from Swinburne. And then there's "Cosmic Gall", a goofy poem by John Updike which I'm gonna quote in full because it's the only thing of John Updike's I've read and liked.

Cosmic Gall
John Updike
Neutrinos, they are very small.
They have no charge and have no mass
And do not interact at all.
The earth is just a silly ball
To them, through which they simply pass,
Like dustmaids down a drafty hall
Or photons through a sheet of glass.
They snub the most exquisite gas,
Ignore the most substantial wall,
Cold shoulder steel and sounding brass,
Insult the stallion in his stall,
And, scorning barriers of class,
Infiltrate you and me. Like tall
And painless guillotines they fall
Down through our heads into the grass.
At night, they enter at Nepal
And pierce the lover and his lass
From underneath the bed—you call
It wonderful; I call it crass.

Sat Feb 21 2015 21:34 Minecraft Archive Project: 201502 Capture: I've done a new capture of data for the Minecraft Archive Project, my big 2014 project to archive the early history of Minecraft before it disappeared. My goal for the refresh was to capture what has happened in the past year while doing as little work as possible, and I met my goal. The whole thing took about two weeks, and most of that was a matter of letting things run overnight. Most of the actual work was refactoring the code I wrote the first time to make future captures even easier.

Top-line numbers: I've archived another 150 gigabytes of good stuff, including 18k maps and schematics, 1k mods, 11k skins, 7k texture packs (resource packs now, I guess), and 100k screenshots. I was able to archive about 73% of the maps. Four percent of them maps were just gone, and 23% I didn't know how to download.

The 201404 Minecraft Archive Project capture contains data from four sites. The new 201502 capture is limited to two sites: the official Minecraft forum and the huge Planet Minecraft site. I started archiving maps, mods, and textures for Minecraft Pocket Edition, and was able to pick up about 5500 MCPE maps.

Now that I've done this twice without getting into trouble, I'll give a little more detail about the process. I've got scripts that download the archives of the Minecraft forum and Planet Minecraft. I find all the threads/projects modified since the last capture, download the corresponding detail pages (e.g. the first page of a forum thread--I'm only after the original post), and extract all the links.

Then it's a matter of archiving as many of those links as possible. I've written recipes for archiving images and downloads. These six recipes take care of the vast majority of items:

Two file hosts: Mediafire and Dropbox
Four image hosts: imgur, Photobucket, TinyPic, and postimage.org

There's also a general catch-all for people who host things on normal home pages, as Tim Berners-Lee intended. If your URL looks like the URL to an image or a binary archive, I will ask for that URL. If you serve me the image or the binary instead of an HTML file telling me to click on something, then I'll archive the file.

I decode most link shorteners except for the ones that make you click through ads, mainly adfoc.us and adf.ly. The 2014 archive had about 18,000 maps behind adf.ly links, and I spent a lot of time running Selenium clients clicking through the ads to discover the Mediafire links. I think that took a month. This time there were about 3000 new maps behind adf.ly links and I just didn't bother.

There are two big blind spots in my dataset, and they're the same as last time. One is mods. A lot of mods are hosted on Github and CurseForge, two big sites I didn't write recipes for. There's also the issue of mod packs, which have been steadily growing in popularity and complexity as development on core Minecraft winds down. Thanks to things like the Hardcore Questing Mod, modpacks are entering the "custom challenge" territory previously occupied solely by world archives.

There are sites that list mod packs (1 2) but I don't want to spend the time figuring out how to archive all the mod packs. There's also the problem that mod packs are huge.

The second blind spot is servers. It's theoretically possible to join a public Minecraft server with a modded client and automatically archive the map, but realistically it ain't gonna happen. I complained about this last time, but now I've done an assessment of what's being lost.

Planet Minecraft has a big server list that mentions the last time it was able to ping any particular server. There doesn't seem to be any purging of dead servers, so I'm able to get good measurements of the typical lifecycle.

Of the 136k servers in the list, 12k are "online" (The most recent Planet Minecraft ping was successful). 51k are "offline" (Most recent Planet Minecraft ping failed, but there was a successful ping less than two weeks ago) and 73k I declare "dead" (last successful ping was more than two weeks ago). It seems really weird that of the nearly half of the 'offline' servers went offline in the past two weeks, so something's going on there; maybe Planet Minecraft's ping process is unreliable, or it just takes a long time to check every server, or servers go up and down all the time.

Anyway, the median lifetime for a public Minecraft server is 434 days, a little over a year. These things go online, people do a bunch of work on them, and then they disappear. I've kind of gotten to 'acceptance' on this, but it's still obnoxious.

One final thing: I thought I'd check if I could see the result of Mojang's June announcement of rules for how you can make money by hosting servers (and, more importantly, how you can't). I wanted to see if these rules had a chilling effect on the formation of new servers or caused a lot of old servers to shut down.

And... no, not really. Here's a chart showing two sixty-day periods around June 12, the date of the Mojang blog post. For each day I show 'births' (the number of servers first seen on that day) and 'deaths' (the number of servers last seen on that day). There's a drop-off in new servers around the end of July, but then it picks up again stronger than before. I don't have an explanation for it but I don't think there's anything in here you can pin on a blog post. The Mojang rules were probably intended to go after a small number of large obnoxious servers, and everyone else either doesn't care or flies under the radar.

(Screenshot is from World #57 by Art_Fox. I didn't archive the map because it's behind an adf.ly link, but I got the screenshot.)

PS: Congratulations to Anticraft, the oldest public Minecraft server I could find that's still online, added to Planet Minecraft on February 28, 2011.

Update: I fixed up the adf.ly code and let it run for another two weeks (!), saving another 2000 Minecraft maps and 700 MCPE maps. I probably won't do this again because it's a huge pain, but I said that this time and ended up doing it out of some sense of obligation to the future, so maybe obligation will strike again, who knows.