<M <Y
Y> M>

January Film Roundup: January started with three highly anticipated films that all turned out to be duds! What to do for the rest of the month, but stack the deck?

The Ghost of Ghostbusters Past: Just a quick semi-technical post on how I made @WeBustedGhosts, my new bot that casts movies from an alternate history where "ghostbusters" is a stock comedy genre, sort of a twentieth-century commedia dell'arte. In particular, I did a lot of work with IMDB data that I want to record for your benefit (and by you, I mean future me).

The bot was inspired by two things: first, this video by Ivan Guerrero which "premakes" Ghostbusters as a 1954 comedy starring Bob Hope, Fred MacMurray, and Martin/Lewis. Second, the reaction of fools to the fact that women comedians will bust ghosts in the upcoming Ghostbusters remake. More specifically, Kris's endless mockery of the idea that "ghostbuster" is a job with a legitimate gender qualification.

These things got me thinking about the minimal set of things you need to make Ghostbusters. You need the idea of combining a horror movie with a comedy about starting a business. Someone could have come up with that idea in the silent film era. You need a director and four actors who can do comedy. And all those people need to be alive and working at the same time, because ghosts aren't real... OR ARE THEY? Either way, you can describe a point in Ghostbusters space with six pieces of information: four actors, a director, and a year. That's small enough to fit into a tweet, so I made a Twitter bot.

Our journey to botdom starts, as you might expect, with an IMDB data dump. I've dealt with IMDB data before and this time I was excited to learn about IMDbPY, which promised to get a handle on the ancient and not-terribly-consistent flat-file IMDB data format. Unfortunately IMDbPY is designed for looking up facts about specific movies, not for reasoning over the set of all movies. However, it does have a great script called imdbpy2sql.py, which will take the flat-file format and turn it into a SQL database.

There will be SQL in this discussion (because I want to show you/future me how to do semi-complex stuff with the database created by IMDbPY), but unless you're future me, you can skip it. Basically, for each actor in IMDB, I need to calculate that actor's tendency to get high billing in popular comedies for a given year. They don't have to be good comedies, or Ghostbusters-like comedies, they just have to have a lot of IMDB ratings.

I also want to figure out each actor's effective comedy lifespan. If an actor stops doing popular comedy or dies or retires, they should stop showing up in the dataset. If a dramatic actor branches out into comedy they should show up in the dataset as of their first comedic performance. Basically, if you learned that this actor starred in a comedy that came out in a certain year, it shouldn't be a big surprise.

Orson Wells would be great in a Ghostbusters movie, but he never did comedy, so he's not in the dataset. How about... Cameron Diaz? She rarely gets top billing, but she has second or third billing in a lot of very popular comedies. For a year like 1997 she tops the list of potential women Ghostbusters.

How about... Peter Falk? His first comedy role was in 1961's Pocket Full of Miracles, his last in 2005's Checking Out. His acting career stretches from 1957 to 2009, but he's only a potential Ghostbuster between 1961 and 2005. He won't get chosen very often, because he's not primarily known for comedy (i.e. his comedies aren't as popular as other peoples'), but it will happen occasionally.

That's the data I extracted. Not "how famous is this actor" but "how much would you expect this actor to be in a comedy in a given year".

The IMDbPY database is more complicated than I like to deal with, so my strategy was to use SQL get a big table of roles and then process it with Python. Here's SQL to get every major role in a comedy that has more than 1000 votes on IMDB:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender, cast_info.nr_order, kind_id from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (1,2) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 1000 and movie_info_idx.info_type_id=100 and cast_info.nr_order <= 7;

Some explanation of numbers and IDs:

I run this on a SQLite database and the output looks like:

#1 Cheerleader Camp|2010|2297|Cassell, Seth|m|2|4
...

So the title of the movie is "#1 Cheerleader Camp", it came out in 2010, it has 2297 votes, and Seth Cassell (a man) was an actor in that movie and got fourth billing.

Why didn't I include television in this query? Because television on IMDB is really complicated. See, actors aren't credited to television shows; they're credited to individual episodes. But nobody rates individual episodes; they rate the show as a whole. So I had to do a separate query to determine who the top actors were on each comedy television show, and then divide up that show's votes between the four top actors. Otherwise actors whose primary comedy career is in television won't get their due.

Here's SQL to get all the roles in TV episodes:

select tv_show.title, episode.title, episode.production_year, votes.info, name.name, name.gender, cast_info.nr_order from title as tv_show join title as episode on tv_show.id=episode.episode_of_id join cast_info on episode.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx as votes on votes.movie_id=tv_show.id join movie_info on movie_info.movie_id=tv_show.id where cast_info.role_id in (1,2) and tv_show.kind_id in (2,5) and episode.kind_id=7 and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(votes.info as integer) > 10000 and votes.info_type_id=100 and cast_info.nr_order < 5;

This is pretty similar to the last query but some of the IDs are different.

I run this and the output looks like:

'Allo 'Allo!|A Bun in the Oven|1991|14022|Kaye, Gorden|m|1
...

This means there's an 'Allo 'Allo! episode called "A Bun in the Oven", the episode came out in 1991, 'Allo 'Allo (NOT this specific episode) has 14,022 votes, and Gorden Kaye got top billing for this episode.

I got this data out of a database as quickly as possible and bashed at it to make a TV show look like a movie with four actors--the four actors who appeared in the most episodes of the TV show.

Directors were pretty similar to film actors. for each director who's ever worked in comedy, I measured their tendency towards putting out a popular comedy in any given year. There's a very strong power law here, with a few modern directors overshadowing their contemporaries, and Charlie Chaplin completely obliterating all his contemporaries.

Here's SQL to get all comedies with their directors:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (8) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 5000 and movie_info_idx.info_type_id=100;

The only new number here is cast_info.role_id in (8), which means I'm now picking up directors instead of actors.

At this point I was done with the SQL database. I wrote the "Ghostbusters casting office". It chooses a year, picks a cast and a director for that year, and then (15% of the time) it picks a custom title. My stupidly hilarious technique for custom titles is to choose the name of an actual comedy from the given year and replace one of the nouns with "Ghost" or "Ghostbuster". So far this has led to films like "Don't Drink the Ghost" and (I swear this happened during testing) "Ghostbuster Dad".

Here's how I pick a cast for a given year: I line up all the actors for that year by my calculated variable "tendency towards being a Ghostbuster", and then I use random.expovariate to choose from different places near the front of the list (to bias the output towards actors you won't have to look up). This is the same trick I use for Serial Entrepreneur to choose common (but not too common) adjectives and nouns for its inventions. My means are 0.85, 0.8, 0.75, and 0.7, which will, on average, give me someone who's at the 85th percentile, someone at the 80th percentile, 75th percentile and 70th percentile.

This is the best I could do to recreate the dynamic of 1984 Ghostbusters where Bill Murray and Dan Aykroyd were very well-known actors even before Ghostbusters, where Ernie Hudson and Harold Ramis were not. At this point you might object that Ernie Hudson and Harold Ramis weren't even 75th or 70th percentile. Ghostbusters was Ramis's second movie ever as an actor; I think there was an oral history that said he gave himself the part of Egon Spengler because no one else was a big enough dork. So for pure accuracy I should be doing, like, 0.90/0.85/0.35/0.30. But that gives you way too many obscure actors and the output isn't as fun. It also doesn't feel accurate, because 1984 Ghostbusters was a real movie, and all by itself it made Hudson and Ramis pretty famous actors. So now we expect "Ghostbuster" to be sort of a prestige comedy role.

A more valid point is that 0.8/0.8/0.75/0.7 also doesn't really capture the dynamic of the 2016 Ghostbusters, where all four actors are well-known but Kristen Wiig has twice the credits of the other three. So I also created an 0.85/0.8/0.8/0.75 mode, which will tend to give you more big-name ensembles.

As always, there's a lot of behind-the-scenes data munging. Going from a bunch of "xth billing in movie with y votes" entries to a single "tendency towards being a Ghostbuster" number required a lot of semi-arbitrary decisions, and I think my algorithm still undercounts television actors. Whenever there was a power law, I smoothed it out a little to increase the variety of the output. I smoothed out the overrepresentation of post-IMDB comedies compared to pre-IMDB comedies; of superstar directors like Chaplin who overshadow everyone else in their time; and of men directors vastly outnumbering women.

Representation of women comedic actors vs. men was not an issue because I followed the lead of the Ghostbusters remake. 45% of the ghostbusting teams are all women, and 45% are all men. (10% of teamups are coed, just to add variety.) There's no code that makes sure all the actors speak the same language or anything like that—I could extract that data from IMDB but it would be a lot of work to make the output of the bot less interesting.

And there you go. It's not source code, but you should be able to see more or less how I took this bot from concept to execution, and how I negotiated the tricky space between "this is an accurate representation of what would happen in an alternate universe where the primary cinematic comedy genre is films about busting ghosts" and "this is a fun output for this bot to have."

Poems of SCIENCE! I Mean, Science: I picked up a cheap old poetry anthology called Poems of Science, figuring there'd be some good stuff. And... there was, but I had wait for the modern conception of "science" to come about, and then spot poetry about a hundred years to come to grips with it, and decide that science is interesting and not going to go away. By that time I was more than halfway through the anthology. But around the late nineteenth century some excellent poetry starts happening, and I thought I'd share a couple links.

Miroslav Holub's Zito the Magician and Robert Browning's much longer An Epistle Containing the Strange Medical Experience of Karshish, the Arab Physician are really great and work as spec-fic stories. Swinburne's Hertha is this weird humanist we-are-made-of-star-stuff mythology that's what you'd expect from Swinburne. And then there's "Cosmic Gall", a goofy poem by John Updike which I'm gonna quote in full because it's the only thing of John Updike's I've read and liked.

Cosmic Gall
John Updike

Neutrinos, they are very small.
They have no charge and have no mass
And do not interact at all.
The earth is just a silly ball
To them, through which they simply pass,
Like dustmaids down a drafty hall
Or photons through a sheet of glass.
They snub the most exquisite gas,
Ignore the most substantial wall,
Cold shoulder steel and sounding brass,
Insult the stallion in his stall,
And, scorning barriers of class,
Infiltrate you and me. Like tall
And painless guillotines they fall
Down through our heads into the grass.
At night, they enter at Nepal
And pierce the lover and his lass
From underneath the bed—you call
It wonderful; I call it crass.

Minecraft Archive Project: 201502 Capture: I've done a new capture of data for the Minecraft Archive Project, my big 2014 project to archive the early history of Minecraft before it disappeared. My goal for the refresh was to capture what has happened in the past year while doing as little work as possible, and I met my goal. The whole thing took about two weeks, and most of that was a matter of letting things run overnight. Most of the actual work was refactoring the code I wrote the first time to make future captures even easier.

Top-line numbers: I've archived another 150 gigabytes of good stuff, including 18k maps and schematics, 1k mods, 11k skins, 7k texture packs (resource packs now, I guess), and 100k screenshots. I was able to archive about 73% of the maps. Four percent of them maps were just gone, and 23% I didn't know how to download.

The 201404 Minecraft Archive Project capture contains data from four sites. The new 201502 capture is limited to two sites: the official Minecraft forum and the huge Planet Minecraft site. I started archiving maps, mods, and textures for Minecraft Pocket Edition, and was able to pick up about 5500 MCPE maps.

Now that I've done this twice without getting into trouble, I'll give a little more detail about the process. I've got scripts that download the archives of the Minecraft forum and Planet Minecraft. I find all the threads/projects modified since the last capture, download the corresponding detail pages (e.g. the first page of a forum thread--I'm only after the original post), and extract all the links.

Then it's a matter of archiving as many of those links as possible. I've written recipes for archiving images and downloads. These six recipes take care of the vast majority of items:

There's also a general catch-all for people who host things on normal home pages, as Tim Berners-Lee intended. If your URL looks like the URL to an image or a binary archive, I will ask for that URL. If you serve me the image or the binary instead of an HTML file telling me to click on something, then I'll archive the file.

I decode most link shorteners except for the ones that make you click through ads, mainly adfoc.us and adf.ly. The 2014 archive had about 18,000 maps behind adf.ly links, and I spent a lot of time running Selenium clients clicking through the ads to discover the Mediafire links. I think that took a month. This time there were about 3000 new maps behind adf.ly links and I just didn't bother.

There are two big blind spots in my dataset, and they're the same as last time. One is mods. A lot of mods are hosted on Github and CurseForge, two big sites I didn't write recipes for. There's also the issue of mod packs, which have been steadily growing in popularity and complexity as development on core Minecraft winds down. Thanks to things like the Hardcore Questing Mod, modpacks are entering the "custom challenge" territory previously occupied solely by world archives.

There are sites that list mod packs (1 2) but I don't want to spend the time figuring out how to archive all the mod packs. There's also the problem that mod packs are huge.

The second blind spot is servers. It's theoretically possible to join a public Minecraft server with a modded client and automatically archive the map, but realistically it ain't gonna happen. I complained about this last time, but now I've done an assessment of what's being lost.

Planet Minecraft has a big server list that mentions the last time it was able to ping any particular server. There doesn't seem to be any purging of dead servers, so I'm able to get good measurements of the typical lifecycle.

Of the 136k servers in the list, 12k are "online" (The most recent Planet Minecraft ping was successful). 51k are "offline" (Most recent Planet Minecraft ping failed, but there was a successful ping less than two weeks ago) and 73k I declare "dead" (last successful ping was more than two weeks ago). It seems really weird that of the nearly half of the 'offline' servers went offline in the past two weeks, so something's going on there; maybe Planet Minecraft's ping process is unreliable, or it just takes a long time to check every server, or servers go up and down all the time.

Anyway, the median lifetime for a public Minecraft server is 434 days, a little over a year. These things go online, people do a bunch of work on them, and then they disappear. I've kind of gotten to 'acceptance' on this, but it's still obnoxious.

One final thing: I thought I'd check if I could see the result of Mojang's June announcement of rules for how you can make money by hosting servers (and, more importantly, how you can't). I wanted to see if these rules had a chilling effect on the formation of new servers or caused a lot of old servers to shut down.

And... no, not really. Here's a chart showing two sixty-day periods around June 12, the date of the Mojang blog post. For each day I show 'births' (the number of servers first seen on that day) and 'deaths' (the number of servers last seen on that day). There's a drop-off in new servers around the end of July, but then it picks up again stronger than before. I don't have an explanation for it but I don't think there's anything in here you can pin on a blog post. The Mojang rules were probably intended to go after a small number of large obnoxious servers, and everyone else either doesn't care or flies under the radar.

(Screenshot is from World #57 by Art_Fox. I didn't archive the map because it's behind an adf.ly link, but I got the screenshot.)

PS: Congratulations to Anticraft, the oldest public Minecraft server I could find that's still online, added to Planet Minecraft on February 28, 2011.

Update: I fixed up the adf.ly code and let it run for another two weeks (!), saving another 2000 Minecraft maps and 700 MCPE maps. I probably won't do this again because it's a huge pain, but I said that this time and ended up doing it out of some sense of obligation to the future, so maybe obligation will strike again, who knows.

Reviews of Old Science Fiction Magazines: F&SF October 1985: The first story in this magazine is James Tiptree's "The Only Neat Thing to Do", and the introductory copy introduces the main character as "a green-eyed young woman who happens to be one of the most appealing characters you are likely to encounter in these or any other pages," and my attitude was "Pffft, green eyes, sure, we'll see about that... DAMMIT." This story's so good. It starts out with this perfect wish-fulfillment space adventure but look at the title, folks, it's not gonna end well. Argh, so good.

Harlan Ellison still hates Gremlins, in fact he says he's been getting letters from people who scoffed at his Gremlins hate but now they've seen the movie they're swallowing their pride and sending him "toe-scuffling, red-faced, abnegating appeals for absolution." I'm harboring a doubt or two here, because he's also saying other people who took his advice (and presumably didn't see the movie) are thanking him. Given that Gremlins has consistently been a well-regarded film since its release, why would someone say "Thanks for warning me off the movie I haven't seen that people still seem to like."?

But all that's in the past. In this issue Ellison doubles down, telling people not to see The Goonies due to "utter emptyheadedness", which, okay, at least it's a critique and not 'the lurkers support me in email.' Also on Ellison's shit list for this month: Rambo: First Blood Part II, A View to a Kill, and The Black Cauldron. He loves Cocoon, Ladyhawke, and Return to Oz, and who's to say he's wrong? Not me, 'cause I haven't seen any of those movies.

There's some really corny back-cover copy in one of the ads for books, but I know from experience that writing back-cover copy is the worst, so as a professional courtesy I'm not going to make fun of it. Kind of weird that most of the stories in this issue are SF or horror, but all the ads are for fantasy books.

Halley's Comet fever strikes the classifieds! There's an ad for Halley's Comet, 1910: Fire in the Sky, sort of a historical recreation by Jerred Metz. Also a "HALLEY'S COMET. TIE TAC or Stick Pin. Four color enamel and beautiful." I'm hyping up the Halley's Comet thing because I happen to own a mint in-box Halley's Comet Hot Wheels car the likes of which are currently going on eBay for a measly $5.32 used including shipping. C'mon! This is my nest egg here! I demand... demand!


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.