< January Film Roundup
Poems of SCIENCE! I Mean, Science >

[No comments] The Ghost of Ghostbusters Past: Just a quick semi-technical post on how I made @WeBustedGhosts, my new bot that casts movies from an alternate history where "ghostbusters" is a stock comedy genre, sort of a twentieth-century commedia dell'arte. In particular, I did a lot of work with IMDB data that I want to record for your benefit (and by you, I mean future me).

The bot was inspired by two things: first, this video by Ivan Guerrero which "premakes" Ghostbusters as a 1954 comedy starring Bob Hope, Fred MacMurray, and Martin/Lewis. Second, the reaction of fools to the fact that women comedians will bust ghosts in the upcoming Ghostbusters remake. More specifically, Kris's endless mockery of the idea that "ghostbuster" is a job with a legitimate gender qualification.

These things got me thinking about the minimal set of things you need to make Ghostbusters. You need the idea of combining a horror movie with a comedy about starting a business. Someone could have come up with that idea in the silent film era. You need a director and four actors who can do comedy. And all those people need to be alive and working at the same time, because ghosts aren't real... OR ARE THEY? Either way, you can describe a point in Ghostbusters space with six pieces of information: four actors, a director, and a year. That's small enough to fit into a tweet, so I made a Twitter bot.

Our journey to botdom starts, as you might expect, with an IMDB data dump. I've dealt with IMDB data before and this time I was excited to learn about IMDbPY, which promised to get a handle on the ancient and not-terribly-consistent flat-file IMDB data format. Unfortunately IMDbPY is designed for looking up facts about specific movies, not for reasoning over the set of all movies. However, it does have a great script called imdbpy2sql.py, which will take the flat-file format and turn it into a SQL database.

There will be SQL in this discussion (because I want to show you/future me how to do semi-complex stuff with the database created by IMDbPY), but unless you're future me, you can skip it. Basically, for each actor in IMDB, I need to calculate that actor's tendency to get high billing in popular comedies for a given year. They don't have to be good comedies, or Ghostbusters-like comedies, they just have to have a lot of IMDB ratings.

I also want to figure out each actor's effective comedy lifespan. If an actor stops doing popular comedy or dies or retires, they should stop showing up in the dataset. If a dramatic actor branches out into comedy they should show up in the dataset as of their first comedic performance. Basically, if you learned that this actor starred in a comedy that came out in a certain year, it shouldn't be a big surprise.

Orson Wells would be great in a Ghostbusters movie, but he never did comedy, so he's not in the dataset. How about... Cameron Diaz? She rarely gets top billing, but she has second or third billing in a lot of very popular comedies. For a year like 1997 she tops the list of potential women Ghostbusters.

How about... Peter Falk? His first comedy role was in 1961's Pocket Full of Miracles, his last in 2005's Checking Out. His acting career stretches from 1957 to 2009, but he's only a potential Ghostbuster between 1961 and 2005. He won't get chosen very often, because he's not primarily known for comedy (i.e. his comedies aren't as popular as other peoples'), but it will happen occasionally.

That's the data I extracted. Not "how famous is this actor" but "how much would you expect this actor to be in a comedy in a given year".

The IMDbPY database is more complicated than I like to deal with, so my strategy was to use SQL get a big table of roles and then process it with Python. Here's SQL to get every major role in a comedy that has more than 1000 votes on IMDB:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender, cast_info.nr_order, kind_id from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (1,2) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 1000 and movie_info_idx.info_type_id=100 and cast_info.nr_order <= 7;

Some explanation of numbers and IDs:

I run this on a SQLite database and the output looks like:

#1 Cheerleader Camp|2010|2297|Cassell, Seth|m|2|4
...

So the title of the movie is "#1 Cheerleader Camp", it came out in 2010, it has 2297 votes, and Seth Cassell (a man) was an actor in that movie and got fourth billing.

Why didn't I include television in this query? Because television on IMDB is really complicated. See, actors aren't credited to television shows; they're credited to individual episodes. But nobody rates individual episodes; they rate the show as a whole. So I had to do a separate query to determine who the top actors were on each comedy television show, and then divide up that show's votes between the four top actors. Otherwise actors whose primary comedy career is in television won't get their due.

Here's SQL to get all the roles in TV episodes:

select tv_show.title, episode.title, episode.production_year, votes.info, name.name, name.gender, cast_info.nr_order from title as tv_show join title as episode on tv_show.id=episode.episode_of_id join cast_info on episode.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx as votes on votes.movie_id=tv_show.id join movie_info on movie_info.movie_id=tv_show.id where cast_info.role_id in (1,2) and tv_show.kind_id in (2,5) and episode.kind_id=7 and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(votes.info as integer) > 10000 and votes.info_type_id=100 and cast_info.nr_order < 5;

This is pretty similar to the last query but some of the IDs are different.

I run this and the output looks like:

'Allo 'Allo!|A Bun in the Oven|1991|14022|Kaye, Gorden|m|1
...

This means there's an 'Allo 'Allo! episode called "A Bun in the Oven", the episode came out in 1991, 'Allo 'Allo (NOT this specific episode) has 14,022 votes, and Gorden Kaye got top billing for this episode.

I got this data out of a database as quickly as possible and bashed at it to make a TV show look like a movie with four actors--the four actors who appeared in the most episodes of the TV show.

Directors were pretty similar to film actors. for each director who's ever worked in comedy, I measured their tendency towards putting out a popular comedy in any given year. There's a very strong power law here, with a few modern directors overshadowing their contemporaries, and Charlie Chaplin completely obliterating all his contemporaries.

Here's SQL to get all comedies with their directors:

select title.title, title.production_year, movie_info_idx.info, name.name, name.gender from title join cast_info on title.id=cast_info.movie_id join name on cast_info.person_id=name.id join movie_info_idx on movie_info_idx.movie_id=title.id join movie_info on movie_info.movie_id=title.id where cast_info.role_id in (8) and kind_id in (1,3,4) and movie_info.info_type_id=3 and movie_info.info='Comedy' and cast(movie_info_idx.info as integer) > 5000 and movie_info_idx.info_type_id=100;

The only new number here is cast_info.role_id in (8), which means I'm now picking up directors instead of actors.

At this point I was done with the SQL database. I wrote the "Ghostbusters casting office". It chooses a year, picks a cast and a director for that year, and then (15% of the time) it picks a custom title. My stupidly hilarious technique for custom titles is to choose the name of an actual comedy from the given year and replace one of the nouns with "Ghost" or "Ghostbuster". So far this has led to films like "Don't Drink the Ghost" and (I swear this happened during testing) "Ghostbuster Dad".

Here's how I pick a cast for a given year: I line up all the actors for that year by my calculated variable "tendency towards being a Ghostbuster", and then I use random.expovariate to choose from different places near the front of the list (to bias the output towards actors you won't have to look up). This is the same trick I use for Serial Entrepreneur to choose common (but not too common) adjectives and nouns for its inventions. My means are 0.85, 0.8, 0.75, and 0.7, which will, on average, give me someone who's at the 85th percentile, someone at the 80th percentile, 75th percentile and 70th percentile.

This is the best I could do to recreate the dynamic of 1984 Ghostbusters where Bill Murray and Dan Aykroyd were very well-known actors even before Ghostbusters, where Ernie Hudson and Harold Ramis were not. At this point you might object that Ernie Hudson and Harold Ramis weren't even 75th or 70th percentile. Ghostbusters was Ramis's second movie ever as an actor; I think there was an oral history that said he gave himself the part of Egon Spengler because no one else was a big enough dork. So for pure accuracy I should be doing, like, 0.90/0.85/0.35/0.30. But that gives you way too many obscure actors and the output isn't as fun. It also doesn't feel accurate, because 1984 Ghostbusters was a real movie, and all by itself it made Hudson and Ramis pretty famous actors. So now we expect "Ghostbuster" to be sort of a prestige comedy role.

A more valid point is that 0.8/0.8/0.75/0.7 also doesn't really capture the dynamic of the 2016 Ghostbusters, where all four actors are well-known but Kristen Wiig has twice the credits of the other three. So I also created an 0.85/0.8/0.8/0.75 mode, which will tend to give you more big-name ensembles.

As always, there's a lot of behind-the-scenes data munging. Going from a bunch of "xth billing in movie with y votes" entries to a single "tendency towards being a Ghostbuster" number required a lot of semi-arbitrary decisions, and I think my algorithm still undercounts television actors. Whenever there was a power law, I smoothed it out a little to increase the variety of the output. I smoothed out the overrepresentation of post-IMDB comedies compared to pre-IMDB comedies; of superstar directors like Chaplin who overshadow everyone else in their time; and of men directors vastly outnumbering women.

Representation of women comedic actors vs. men was not an issue because I followed the lead of the Ghostbusters remake. 45% of the ghostbusting teams are all women, and 45% are all men. (10% of teamups are coed, just to add variety.) There's no code that makes sure all the actors speak the same language or anything like that—I could extract that data from IMDB but it would be a lot of work to make the output of the bot less interesting.

And there you go. It's not source code, but you should be able to see more or less how I took this bot from concept to execution, and how I negotiated the tricky space between "this is an accurate representation of what would happen in an alternate universe where the primary cinematic comedy genre is films about busting ghosts" and "this is a fun output for this bot to have."

Filed under: ,


Post a comment

Your name:

Your home page:

Remember this information

Comments:

Allowed HTML tags: <a>, <b>, <i>. Blank lines become paragraph separators.


[Main] [Edit]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.