# News You Can Bruise for 2015October16 (entry 0)

Previous
Bot Techniques: The Wandering Monster Table

Auditioning: Sampling a Dataset to Maximize Diversity: My latest bot is Roller Derby Names, which takes its data from a list of about 40,000 distinct names chosen by roller derby participants. 40,000 is a lot of names, and although a randomly selected name is likely to be hilarious, if you look at a bunch of them they can get kind of repetitive. My challenge was to cut it down to a maximally distinctive subset of names. I used a simple technique I call 'auditioning' (couldn't find a preexisting name for it) which I first used with Minecraft Signs:

1. Shuffle the list.
2. Create a counter of words seen
3. For each string in the list:
1. Split the string into words.
2. Assume the string is not distinctive.
3. For each word in the string:
1. If this word has been seen fewer than n times, the string is distinctive.
2. Increment the counter for this word.
4. If the string is distinctive, output it.

My mental idea of this process is that each string is auditioning before the talent agent from the classic Chuck Jones cartoon One Froggy Evening. One word at a time, the string tries to impress the talent agent, but the agent has seen it all before. In fact, the agent has seen it all n times before! But then comes that magical word that the agent has seen only n-1 times. Huzzah! The string passes its audition. But the next string is going to have a tougher time, because with each successful audition the agent becomes more jaded.

You don't have to worry about stopwords because the string only needs one rare word to pass its audition. By varying n you can get a smaller or larger output set. For Minecraft Signs I set n=5, which gave a wide variety of signs while eliminating the ones that say "White Wool". For Roller Derby Names I decided on n=1.

Here's the size of the Roller Derby Names dataset, n-auditioned for varying values of n:
nDataset size
∞ (original data)40198
10040191
5040089
1037860
636104
535307
434203
332751
230387
125710

Auditioning the Roller Derby Names with n=50 excludes only the most generic sounding names: "Crash Baby", "Bad Lady", "Queen Bitch", etc. Setting n=1 restricts the dataset to the most distinctive names, like "Battlestar Kick Asstica" and "Collideascope". But it still includes over half the dataset. There's not really a lot of difference between n=10 and n=4, it's just, how many names do you want in the corpus.

I want to note that this is this is not a technique for picking out the 'good' items. It's a technique for maximizing diversity or distinctiveness. You can say that a name excluded by a lower value of n is more distinctive, but for a given value of n it can be totally random whether or not a name makes the cut. "Angry Beaver" made it into the final corpus and "Captain Beaver" didn't. As "beaver" jokes go, I'd say they're about the same quality. When the algorithm encountered "Captain Beaver", it had already seen "captain" and "beaver". If the list had been shuffled differently, the string "Captain Beaver" would have nailed its audition and "Angry Beaver" would be a has-been. That's show biz. This technique also magnifies the frequency of misspellings, as anyone who follows Minecraft Signs knows.

Also note that "Dirty Mary" is excluded by n=50. It's not the greatest name but it is a legitimate pun, so in terms of quality it should have made the corpus, but "Dirty" and "Mary" are both very common name components, so it didn't pass.

PS: Boat Name Bot (Roller Derby Names's sister bot) does not use this technique. There's no requirement that a boat name be unique, and TBH most boat-namers aren't terribly creative. Picking boat names that have only been used once (and are not names for human beings) cuts the dataset down plenty.

Filed under: