<M <Y
M>

November Film Roundup: What a month! Mainly due to a huge film festival, but I also got another chance to see my favorite film of all time on the big screen. What might that film be? Clearly you haven't been reading my weblog for the past fifteen years.

@pony_strategies: My new bot, @pony_strategies, is the most sophisticated one I've ever created. It is the @horse_ebooks spambot from the Constellation Games universe.

Unlike @horse_ebooks, @pony_strategies will not abruptly stop publishing fun stuff, or turn out to be a cheesy tie-in trying to get you interested in some other project. It is a cheesy tie-in to some other project (Constellation Games), but you go into the relationship knowing this fact, and the connection is very subtle.

When explaining this project to people as I worked on it, I was astounded that many of them didn't know what @horse_ebooks was. But that just proves I inhabit a bubble in which fakey software has outsized significance. So a brief introduction:

@horse_ebooks was a spambot created by a Russian named Alexei Kouznetsov. It posted Twitter ads for crappy ebooks, some of which (but not all, or even most) were about horses. Its major innovative feature was its text generation algorithm for the things it would say between ads.

Are you ready? The amazing algorithm was this: @horse_ebooks ripped strings more or less randomly from the crappy ebooks it was selling and presented them with absolutely no context.

Trust me, this is groundbreaking. I'm sure this technique had been tried before, but @horse_ebooks was the first to make it popular. And it's great! Truncating a sentence in the right place generates some pretty funny stuff. Here are four consecutive @horse_ebooks tweets:

There was a tribute comic and everything.

I say @horse_ebooks "was" a spambot because in 2011 the Twitter account was acquired by two Americans, Jacob Bakkila and Thomas Bender, who took it over and started running it not to sell crappy ebooks, but to promote their Alternate Reality Game. This fact was revealed back in September 2013, and once the men behind the mask were revealed, @horse_ebooks stopped posting.

The whole conceit of @horse_ebooks was that there was no active creative process, just a dumb algorithm. But in reality Bakkila was "impersonating" the original algorithm—most likely curating its output so that you only saw the good stuff. No one likes to be played for a sucker, and when the true purpose of @horse_ebooks was revealed, folks felt betrayed.

As it happens, the question of whether it's artistically valid to curate the output of an algorithm is a major bone of contention in the ongoing Vorticism/Futurism-esque feud between Adam Parrish and myself. He is dead set against it; I think it makes sense if you are using an algorithm as the input into another creative process, or if your sole object is to entertain. We both agree that it's a little sketchy if you have 200,000 fans whose fandom is predicated on the belief that they're reading the raw output of an algorithm. On the other hand, if you follow an ebook spammer on Twitter, you get up with fleas. I think that's how the saying goes.

In any event, the fan comics ceased when @horse_ebooks did. There was a lot of chin-stroking and art-denial and in general the reaction was strongly negative. But that's not the end of the story.

You see, the death of @horse_ebooks led to an outpouring of imitation *_ebooks bots on various topics. (This had been happening before, actually.) As these bots were announced, I swore silent vengeance on each and every one of them. Why? Because those bots didn't use the awesome @horse_ebooks algorithm! Most of them used Markov chains, that most hated technique, to generate their text. It was as if the @horse_ebooks algorithm itself had been discredited by the revelation that two guys from New York were manually curating its output. (Confused reports that those guys had "written" the @horse_ebooks tweets didn't help matters--they implied that there was no algorithm at all and that the text was original.)

But there was hope. A single bot escaped my pronouncements of vengeance: Adam's excellent @zzt_ebooks. That is a great bot which you should follow, and it uses an approximation of the real @horse_ebooks algorithm:

  1. The corpus is word-wrapped at 35 characters per line.
  2. Pick a line to use as the first part of a tweet.
  3. If (random), append the next line onto the current line.
  4. Repeat until (random) is false or the line is as large as a tweet can get.

And here are four consecutive quotes from @zzt_ebooks:

Works great.

The ultimate genesis of @pony_strategies was this conversation I had with Adam about @zzt_ebooks. Recently my anger with *_ebooks bots reached the point where I decided to add a real *_ebooks algorithm to olipy to encourage people to use it. Of course I'd need a demo bot to show off the algorithm...

The @pony_strategies bot has sixty years worth of content loaded into it. I extracted the content from the same Project Gutenberg DVD I used to revive @everybrendan. There's a lot more where that came from--I ended up choosing about 0.0001% of the possibilities found in the DVD.

I have not manually curated the PG quotes and I have no idea what the bot is about to post. But the dataset is the result of a lot of algorithmic curation. I focused on technical books, science books and cookbooks--the closest PG equivalents to the crap that @horse_ebooks was selling. I applied a language filter to get rid of old-timey racial slurs. I privileged lines that were the beginnings of sentences over lines that were the middle of sentences. I eliminated lines that were boring (e.g. composed entirely of super-common English words).

I also did some research into what distinguished funny, popular @horse_ebooks tweets from tweets that were not funny and less popular. Instead of trying to precisely reverse-engineer an algorithm that had a human at one end, I tried to figure out which outputs of the process gave results people liked, and focused my algorithm on delivering more of those. I'll post my findings in a separate post because this is getting way too long. Suffice to say that I'll pit the output of my program against the curated @horse_ebooks feed any day. Such as today, and every day for the next sixty years.

Like its counterpart in our universe, @pony_strategies doesn't just post quotes: it also posts ads for ebooks. Some of these books are strategy guides for the "Pôneis Brilhantes" series described in Constellation Games, but the others have randomly generated titles. Funny story: they're generated using Markov chains! Yes, when you have a corpus of really generic-sounding stuff and you want to make fun of how generic it sounds by generating more generic-sounding stuff, Markov chains give the best result. But do you really want to have that on your resume, Markov chains? "Successfully posed as unimaginative writer." Way to go, man.

Anyway, @pony_strategies. It's funny quotes, it's fake ads, it's an algorithm you can use in your own projects. Use it!

[Comments] (2) Secrets of (peoples' responses to) @horse_ebooks—revealed!: As part of my @pony_strategies project (see previous post), I grabbed the 3200 most recent @horse_ebooks tweets via the Twitter API, and ran them through some simple analysis scripts to figure out how they were made and which linguistic features separated the popular ones from the unpopular.

This let me prove one of my hypotheses about the secret to _ebooks style comedy gold. I also disproved one of my hypotheses re: comedy gold, and came up with an improved hypotheses that works much better. Using these as heuristics I was able to make @pony_strategies come up with more of what humans consider the good stuff.

Timing

The timing of @horse_ebooks posts formed a normal distribution with mean of 3 hours and a standard deviation of 1 hour. Looking at ads alone, the situation was similar: a normal distribution with mean of 15 hours and standard deviation of 2 hours. This is pretty impressive consistency since Jacob Bakkila says he was posting @horse_ebooks tweets by hand. (No wonder he wanted to stop it!)

My setup is much different: I wrote a cheap scheduler that approximates a normal distribution and runs every fifteen minutes to see if it's time to post something.

Beyond this point, my analysis excludes the ads and focuses exclusively on the quotes. Nobody actually liked the ads.

Length

The median length of a @horse_ebooks quote is 50 characters. Quotes shorter than the median were significantly more popular, but very long quotes were also more popular than quotes in the middle of the distribution.

Capitalization

I think that title case quotes (e.g. "Demand Furniture") are funnier than others. Does the public agree? For each quote, I checked whether the last word of the quote was capitalized.

43% of @horse_ebooks quotes end with a capitalized word. The median number of retweets for those quotes was 310, versus 235 for quotes with an uncapitalized last word. The public agrees with me. Title-case tweets are a little less common, but significantly more popular.

The punchword

Since the last word of a joke is the most important, I decided to take a more detailed look each quote's last word. My favorite @horse_ebooks tweets are the ones that cut off in the middle of a sentence, so I anticipated that I would see a lot of quotes that ended with boring words like "the".

I applied part-of-speech tagging to the last word of each quote and grouped them together. Nouns were the most common by far, followed by verb of various kinds, determiners ("the", "this", "neither"), adjectives and adverbs.

I then sorted the list of parts of speech by the median number of retweets a @horse_ebooks quote got if it ended with that part of speech. Nouns and verbs were not only the most common, they were the most popular. (Median retweets for any kind of noun was over 300; verbs ranged from 191 retweets to 295, depending on the tense of the verb.) Adjectives underperformed relative to their frequency, except for comparative adjectives like "more", which overperformed.

I was right in thinking that quotes ending with a determiner or other boring word were very common, but they were also incredibly unpopular. The most popular among these were quotes that repeated gibberish over and over, e.g. "ORONGLY DGAGREE DISAGREE NO G G NO G G G G G G NO G G NEIEHER AGREE NOR DGAGREE O O O no O O no O O no O O no neither neither neither". A quote like "of events get you the" did very poorly. (By late-era @horse_ebooks standards, anyway.)

It's funny when you interrupt a noun

I pondered the mystery of the unpopular quotes and came up with a new hypothesis. People don't like interrupted sentences per se; they like interrupted noun phrases. Specifically, they like it when a noun phrase is truncated to a normal noun. Here are a few @horse_ebooks quotes that were extremely popular:

Clearly "computer", "science", "house", "and "meal" were originally modifying some other noun, but when the sentence was truncated they became standalone nouns. Therefore, humor.

How can I test my hypothesis without access to the original texts from which @horse_ebooks takes its quotes? I don't have any automatic way to distinguish a truncated noun phrase from an ordinary noun. But I can see how many of the @horse_ebooks quotes end with a complete noun phrase. Then I can compare how well a quote does if it ends with a noun phrase, versus a noun that's not part of a noun phrase.

About 4.5% of the total @horse_ebooks quotes end in complete noun phrases. This is comparable to what I saw in the data I generated for @pony_strategies. I compared the popularity of quotes that ended in complete noun phrases, versus quotes that ended in standalone nouns.

Quote ends in Median number of retweets
Standalone noun 330
Noun phrase 260
Other 216

So a standalone noun does better than a noun phrase, which does better than a non-noun. This confirms my hypothesis that truncating a noun phrase makes a quote funnier when the truncated phrase is also a noun. But a quote that ends in a complete noun phrase will still be more popular than one that ends with anything other than a noun.

Conclusion

At the time I did this research, I had about 2.5 million potential quotes taken from the Project Gutenberg DVD. I was looking for ways to rank these quotes and whittle them down to, say, the top ten percent. I used the techniques that I mentioned in my previous post for this, but I also used quote length, capitalization, and punchword part-of-speech to rank the quotes. I also looked for quotes that ended in complete noun phrases, and if truncating the noun phrase left me with a noun, most of the time I would go ahead and truncate the phrase. (For variety's sake, I didn't do this all the time.)

This stuff is currently not in olipy; I ran my filters and raters on the much smaller dataset I'd acquired from the DVD. There's no reason why these things couldn't go into olipy as part of the ebooks.py module, but it's going to be a while. I shouldn't be making bots at all; I have to finish Situation Normal.

[Comments] (3) Markov vs. Queneau: Sentence Assembly Smackdown: I mentioned earlier that when assembling strings of words, Markov chains do a better job than Queneau assembly. In this post I'd like to a) give the devil his due by showing what I mean, and b) qualify what I mean by "better job".

Markov wins when the structure is complex

I got the original idea for this post when generating the fake ads for @pony_strategies. My corpus is the titles of about 50,000 spammy-sounding ebooks, and this was the first time I did a head-to-head Markov/Queneau comparison. Here are ten of Markov's entries, using the Markov chain implementation I ended up adding to olipy:

  1. At Gas Pump!
  2. The Guy's Guide To The Atkins Diet
  3. Home Internet Business In The World.
  4. 101 Ways to Sharpen Your Memory
  5. SEO Relationship Building for Beginners
  6. Gary Secrets - Project Management Made Easy!
  7. Weight Success
  8. How get HER - Even If It's Just Money, So Easy and Effective Treatment Options
  9. Sams Yourself
  10. Define, With, Defeat! How To Get Traffic To Your Health

The Markov entries can get a little wacky ("Define, With, Defeat!"), which is good. But about half could be real titles without seeming weird at all, which is also good.

By contrast, here are ten of Queneau's entries:

  1. Adsense I Collection Profits: The bottom Guide Income!
  2. Reliable Your Earning Estate Develop Home And to life Fly Using Don't Your Partnership to Death
  3. Help the Your Causes, Successfully Business Vegetarian
  4. Connect New New Cooking
  5. 1 Tips, Me Life Starting to Simple Ultimate On Wills How Years Online With Living
  6. How Practice Health Best w/ Beauty
  7. Amazing Future & Codes Astrology to Definitive Green Carbs, Children Methods JV Engine Dollars And Effective Beginning Minutes NEW!
  8. I and - Gems Secrets Making Life Today!
  9. Succeeding For Inspiring Life
  10. Fast Survival Baby (Health Loss) Really How other of Look Symptoms, Your Business Encouragement: drive Health to Get with Easy Guide

At their very best ("Suceeding For Inspiring Life, "How Practice Health Best w/ Beauty"), these read like the work of a non-native English speaker. But most of them are way out there. They make no sense at all or they sound like a space alien wrote them to deal with space alien concerns. Sometimes this is what you want in your generated text! But usually not.

A Queneau assembler assumes that every string in its corpus has different tokens that follow an identical grammar. This isn't really true for spammy ebook titles, and it certainly isn't true for English sentences in general. A sentence is made up of words, sure, but there's nothing special about the fourth word in a sentence, the way there is about the fourth line of a limerick.

A Markov chain assumes nothing about higher-level grammar. Instead, it assumes that surprises are rare, that the last few tokens are a good predictor of the next token. This is true for English sentences, and it's especially true for spammy ebook titles.

Markov chains don't need to bother with the overall structure of a sentence. They focus on the transitions between words, which can be modelled probabilistically. (And the good ones do treat the first and last tokens specially.)

Markov wins when the corpus is large, Queneau when the corpus is tiny

Consider what happens to the two algorithms as the corpus grows in size. Markov chains get more believable, because the second word in a title is almost always a word commonly associated with the first word in the title. Queneau assemblies get wackier, because the second word in a title can be anything that was the second word in any title.

I have a corpus of 50,000 spammy titles. What if I chose a random sample of ten titles, and used those ten titles to construct a new title via Queneau assembly? This would make it more likely that the title's structure would hint at the structure of one or two of the source titles.

This is what I did in Board Game Dadaist, one of my first Queneau experiments. I pick a small number of board games and generate everything from that limited subset, increasing the odds that the result will make some kind of twisted sense.

If you run a Markov chain on a very small corpus, you'll probably just reproduce one of your input strings. But Queneau assembly works fine on a tiny corpus. I ran Queneau assembly ten times on ten samples from the spammy ebook titles, and here are the results:

  1. Beekeeping by Keep Grants
  2. Lose to Audience Business to to Your Backlink Physicists Environment
  3. HOT of Recruit Internet Because Financial the Memories
  4. Senior Guide Way! Business Way!
  5. Discover Can Power Successful Life How Steps
  6. Metal Lazy, Advice
  7. Insiders Came Warts Weapons Revealed
  8. 101 Secrets & THE Joint Health Than of Using Marketing! Using Using More Imagine
  9. Top **How Own 101**
  10. Multiple Spiritual Dynamite to Body - To Days

These are still really wacky, but they're better than when Queneau was choosing from 50,000 titles each time. For the @pony_strategies project, I still prefer the Markov chains.

Queneau wins when the outputs are short

Let's put spammy ebook titles to the side and move on to board game titles, a field where I think Queneau assembly is the clear winner. My corpus is here about 65,000 board game titles, gathered from BoardGameGeek. The key to what you're about to see is that the median length of a board game title is three words, versus nine words for a spammy ebook title.

Here are some of Markov's board game titles:

  1. Pointe Hoc
  2. Thieves the Pacific
  3. Illuminati Set 3
  4. Amazing Trivia Game
  5. Mini Game
  6. Meet Presidents
  7. Regatta: Game that the Government Played
  8. King the Rock
  9. Round 3-D Stand Up Game
  10. Cat Mice or Holes and Traps

A lot of these sound like real board games, but that's no longer a good thing. These are generic and boring. There are no surprises because the whole premise of Markov chains is that surprises are rare.

Here's Queneau:

  1. The Gravitas
  2. Risk: Tiles
  3. SESSION Pigs
  4. Yengo Edition Deadly Mat
  5. Ubongo: Fulda-Spiel
  6. Shantu Game Weltwunder Right
  7. Black Polsce Stars: Nostrum
  8. Peanut Basketball
  9. The Tactics: Reh
  10. Velvet Dos Centauri

Most of these are great! Board game names need to be catchy, so you want surprises. And short strings have highly ambiguous grammar anyway, so you don't get the "written by an alien" effect.

Conclusion

You know that I've been down on Markov chains for years, and you also know why: they rely on, and magnify, the predictability of their input. Markov chains turn creative prose into duckspeak. Whereas Queneau assembly simulates (or at least stimulates) creativity by manufacturing absurd juxtapositions.

The downside of Queneau is that if you can't model the underlying structure with code, the juxtapositions tend to be too absurd to use. And it's really difficult to model natural-language prose with code.

So here's my three-step meta-algorithm for deciding what to do with a corpus:

  1. If the items in your corpus follow a simple structure, code up that structure and go with Queneau.
  2. If the structure is too complex to be represented by a simple program (probably because it involves natural-language grammar), and you really need the output to be grammatical, go with Markov.
  3. Otherwise, write up a crude approximation of the complex structure, and go with Queueau.


[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.