<M <Y
Y> M>

September Film Roundup: Didn't see a lot of movies this month, so I'm going to add a new mini-feature that will run for the next few months. I'll be briefly reviewing some TV shows that, although I haven't seen (and may never see) absolutely every episode, I feel like I can evaluate the show as a whole. But first, our feature presentations:

And now the TV section. Obviously my technique of waiting until I can evaluate the show as a whole, creates a selection bias towards good television shows. I'll sit through a bad movie and then pan it in Film Roundup, but a bad TV show is outa here, especially since I watch movies on my own but I only watch TV with Sumana. But what's the problem with talking about good TV? Try this on for size:

(Before you ask, Religious Huckster Trick #1 is "God told me to tell you to give me money.")

[Comments] (1) To Stop Disturbance: I was reading to Sumana the most interesting bits from Washington Goes To War, a book by David Brinkley about the changes to Washington D.C. over the course of World War II. It's full of interesting historical tidbits, including:

But the thing Sumana wanted me to record verbatim was the policy that Washington D.C.'s Casino Royal put into place for dealing with the inevitable fistfights between soldiers and sailors. "Night after night," these inter-service resentments boiled over, and so the Casino Royal wrote down these rules and posted them "on a wall backstage under the heading TO STOP DISTURBANCE."

  1. Lower the house lights
  2. Turn the spotlight on a large American flag hanging from the ceiling
  3. Start up an electric fan aimed at the flag, causing it to flutter
  4. Have the band instantly stop playing dance music and strike up "The Star-Spangled Banner".
  5. Call in the military police and the navy's shore patrol
It always worked. The soldiers and sailors stopped swinging at each other, faced the flag and stood at attention while the band played. There was no way a uniformed military man in wartime could refuse to do this, however angry he was. Before the anthem was finished, the military police and the shore patrol were walking up the steps from Fourteenth Street.

The one that really gets me is #3. I can see how this behavior would be drilled into you as a reflex action, but #3 makes it feel like they're trying to inspire you, remind you what you're fightin' for. And then the MPs show up.

[Comments] (4) : Recently I gave a talk called "The Enterprise Media Distribution Platform At The End Of This Book". It summarizes my first eighteen months on the Library Simplified project at NYPL Labs. The goal of Library Simplified is to make it as easy to check ebooks out from a public library as it is to buy them from Amazon.

We've just secured a multi-year grant to expand the project, and we are hiring up from two developers to eight. We are quadrupling the size of our development team.

This is a really satisfying job for me because I'm making life substantially better for people who aren't already well off. If you like that prospect, if you like what I say in the "Enterprise Media Distribution" talk, and you want to work on this project, you should apply for one of these position by sending your resume to info@librarysimplified.com.

I'm going to link to the job listings in a minute, but first I want to make it real clear that we put up these listings largely to have entry points into the HR system. As the team lead I'm not concerned with counting how many terms on your resume match terms used in the job listing. We need two Android developers and four people to write server-side code and HTML and Javascript. I don't think we need a team made up entirely of Senior Developers. Other skills might be more important.

For instance, we need someone with devops experience. We'll be dealing with e-commerce, cryptography, and machine learning—all things I know little about. We don't care if you have a CS degree, but if you have a Library Science degree or have worked in the publishing industry, that would be useful. We have big collections in Spanish, Chinese and Russian, but nobody on our team reads those languages. Stuff like that.

With that in mind, here are the job listings:

As you can see if you click around, getting into the HR system to formally "apply" for these jobs requires filling out a really long form. (Update: and now these links don't even work anymore because the jobs got shifted around.) Instead of doing that, send your resume to info@librarysimplified.com and we'll only ask you to fill out the form if we want to bring you in for an interview.

All these positions are in New York City, in the big building on 42nd Street with the lions. This is a project funded by grants, and the salaries we offer are not competitive with Facebook or Goldman Sachs, but they are competitive with other nonprofits. The benefits are good. This is not a job that ruins your life. It's 35 hours a week and you get four weeks of vacation per year. I work from home about one day a week. Send me email or leave a comment if you have any questions about benefits.

Auditioning: Sampling a Dataset to Maximize Diversity: My latest bot is Roller Derby Names, which takes its data from a list of about 40,000 distinct names chosen by roller derby participants. 40,000 is a lot of names, and although a randomly selected name is likely to be hilarious, if you look at a bunch of them they can get kind of repetitive. My challenge was to cut it down to a maximally distinctive subset of names. I used a simple technique I call 'auditioning' (couldn't find a preexisting name for it) which I first used with Minecraft Signs:

  1. Shuffle the list.
  2. Create a counter of words seen
  3. For each string in the list:
    1. Split the string into words.
    2. Assume the string is not distinctive.
    3. For each word in the string:
      1. If this word has been seen fewer than n times, the string is distinctive.
      2. Increment the counter for this word.
    4. If the string is distinctive, output it.

My mental idea of this process is that each string is auditioning before the talent agent from the classic Chuck Jones cartoon One Froggy Evening. One word at a time, the string tries to impress the talent agent, but the agent has seen it all before. In fact, the agent has seen it all n times before! But then comes that magical word that the agent has seen only n-1 times. Huzzah! The string passes its audition. But the next string is going to have a tougher time, because with each successful audition the agent becomes more jaded.

You don't have to worry about stopwords because the string only needs one rare word to pass its audition. By varying n you can get a smaller or larger output set. For Minecraft Signs I set n=5, which gave a wide variety of signs while eliminating the ones that say "White Wool". For Roller Derby Names I decided on n=1.

Here's the size of the Roller Derby Names dataset, n-auditioned for varying values of n:
nDataset size
∞ (original data)40198
10040191
5040089
1037860
636104
535307
434203
332751
230387
125710

Auditioning the Roller Derby Names with n=50 excludes only the most generic sounding names: "Crash Baby", "Bad Lady", "Queen Bitch", etc. Setting n=1 restricts the dataset to the most distinctive names, like "Battlestar Kick Asstica" and "Collideascope". But it still includes over half the dataset. There's not really a lot of difference between n=10 and n=4, it's just, how many names do you want in the corpus.

I want to note that this is this is not a technique for picking out the 'good' items. It's a technique for maximizing diversity or distinctiveness. You can say that a name excluded by a lower value of n is more distinctive, but for a given value of n it can be totally random whether or not a name makes the cut. "Angry Beaver" made it into the final corpus and "Captain Beaver" didn't. As "beaver" jokes go, I'd say they're about the same quality. When the algorithm encountered "Captain Beaver", it had already seen "captain" and "beaver". If the list had been shuffled differently, the string "Captain Beaver" would have nailed its audition and "Angry Beaver" would be a has-been. That's show biz. This technique also magnifies the frequency of misspellings, as anyone who follows Minecraft Signs knows.

Also note that "Dirty Mary" is excluded by n=50. It's not the greatest name but it is a legitimate pun, so in terms of quality it should have made the corpus, but "Dirty" and "Mary" are both very common name components, so it didn't pass.

PS: Boat Name Bot (Roller Derby Names's sister bot) does not use this technique. There's no requirement that a boat name be unique, and TBH most boat-namers aren't terribly creative. Picking boat names that have only been used once (and are not names for human beings) cuts the dataset down plenty.

Bot Techniques: The Wandering Monster Table: In preparation for the talk I'm giving Friday at Allison's unofficial Bot Summit, I'm writing little essays explaining some of the techniques I've used in bots. Today: the Wandering Monster Table!

In D&D, the Wandering Monster Table is a big situation-specific table that makes it possible for you, the Dungeon Master, to derail your carefully planned campaign on a random mishap. You roll the dice and a monster just kind of shows up and has to be dealt with. There are different tables for different scenarios and different biomes, but they're generally based on this probability distribution (from AD&D 1st Edition):

This doesn't mean you're going to run into Ygorl (Lord of Entropy) once every twenty-five adventures. There are a ton of Very Rare monsters, and Ygorl is just one chaos lord. He can't be everywhere. What this means is that most of the time the PCs are going to experience normal, boring wandering monsters. Die rolls form a normal distribution, and 68% (~65%) of die rolls will fall within one standard deviation of the mean. Those are your common monsters.

Go out two standard deviations (95%, ~65%+20%+11%) and things might get a little hairy for the PCs. Go out three standard deviations (99.7%, ~65%+20%+11%+4%) and you're looking at something really weird that even the Dungeon Master didn't really plan for. But what, exactly? That depends on the situation, and it may require another dice roll.

The WMT is a really good abstraction for creating variety. I use it in my bots all the time. Here's a sample of the WMT for Serial Entrepreneur:

common = ["%(product)s", "%(product)s!", "%(product)s...\n%(variant)s...", "%(product)s? %(variant)s?", ... ] uncommon = [ "%(product)s... %(variant)s...? Just throwing some ideas around.", "%(product)s... or maybe %(variant)s...", "%(product)s or %(variant)s?", "Eureka! %(product)s!", ... ] rare = [ "I don't think I'll ever be happy with my %(product)s...", "Got a meeting with some VCs to pitch my %(product)s!", "I'm afraid that my new %(product)s is cannibalizing sales of my %(variant)s.", "The %(product)s flopped in my %(state)s test market... back to the draw ing board.", ... ] very_rare = [ "Am I to be remembered as the inventor of the %(product)s?", "Sometimes I think about Edison's famous %(product)s and I wonder... can my %(product2)s compare?", "I haven't sold a single %(product)s...", "I hear %(billionaire)s is working on %(a_product)s...", ... ]

This creates a personality that most of the time just mutters project ideas to itself, but sometimes (uncommonly) gets a little more verbose, or (rarely) talks about where it is in the product development process, or (very rarely) compares itself to other inventors. The 'common' bucket contains nine entries which are slight variants; the 'rare' bucket contains 32 entries which are worded very differently.

The WMT works the same way in Smooth Unicode and Euphemism Bot. All these bots have their standbys: common constructs they return to over and over. Then they have three more tiers of constructs where the result is aesthetically riskier, or the joke is less likely to land, or a little of that construct goes a long way.

I also use the WMT in A Dull Bot to a more subtle purpose. Each tweet contains a random number of typos, and each typo is chosen from a WMT. One of the common typos is to transpose two letters. A very rare typo is to uppercase one word while leaving the rest of the sentence alone.

The WMT fixes one of the common aesthetic problems with bots, where every output is randomly generated but it gets dull quickly because the presentation is always the same. Since you can always dump more stuff into a WMT, it's an easy way to keep your bot's output fresh. In particular, whenever I get an idea like emoji mosaics, I can add it to Smooth Unicode's WMT instead of creating a whole new bot.

There's a Python implementation of a Wandering Monster Table in olipy.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.