News You Can Bruise for 2013 December

Mon Dec 02 2013 09:36 November Film Roundup: What a month! Mainly due to a huge film festival, but I also got another chance to see my favorite film of all time on the big screen. What might that film be? Clearly you haven't been reading my weblog for the past fifteen years.

Wives (1975): This movie has a 4.9 IMDB rating, and although it's not as good as Ishtar, it deserves a lot better than a 4.9. I mean, John Cassavetes's Husbands has a 7.3, and who needs that guy?
Uh, anyway, Wives is a fun cinema verité piece where three ladies blow off married life for a while and goof off. Columbia professor Jane Gaines introduced the movie by describing the main characters' activities as a "rampage", and I think that's a little strong, but maybe by 1975 Norway standards it was a real barn-burner. The film is sort of a more commercial Celine and Julie go Boating. The humor is less reliant on in-jokes, the men are offscreen instead of totally absent, and it's ninety minutes long instead of three hours. It was pretty fun, but Celine and Julie is still the gold standard.
Next of Kin (1979): a.k.a. "Heritage". A ha-ha-only-serious farce that prefigures Arrested Development in its depiction of the magnetic power of money to keep a dysfunctional family together. Also has a 4.9 IMDB rating, and since all the movie info is in Norwegian I gotta figure it's Norwegians hating on their own filmmakers. Why the hate, Norwegians? Did you know that Kon-Tiki is the only Norwegian film people outside of Norway have ever heard of? Show some pride and get your name out there.
I guess I'm just stirring up trouble now, so I'll go back to Next of Kin. The centerpiece of the film for me was a long sequence in the house of the late paterfamilias, in which the family argues over who inherits what, then takes everything down off the walls, puts stickers on everything, and carries all the furniture out to their cars. That must have been incredibly difficult to film, and as someone who has lived through that event (minus the arguing) I gotta say Anja Breien nailed it.
Breien attended the screening and after the movie I asked her to talk about that bit. She said she likes "people carrying things" and the "surrealistic piles" you see in Heironymus Bosch paintings. It symbolizes the alienating effect of materialism, you see. She mentioned that it was really difficult to find all those props; it had to be real expensive silver, paintings by big-name artists, etc. Sounds like they didn't insure it, either. The perfect time-travel heist!
Gentlemen Prefer Blondes (1953): Man, that was saucy. Jane Russell and Marilyn Monroe really tear it up. Russell's "Anyone Here For Love?" number ("The gayest thing I've ever seen." -Hal) annihilates the male gaze, which spends the rest of the movie trying to recover.
I must admit I'm warming to Marilyn Monroe. I also admit that's a weird thing for a heterosexual man to say, but keep in mind that for most of my life I experienced Marilyn Monroe entirely through the medium of cardboard cutouts used as decor for fake 50s diners. Then I saw her in Love Happy, where she's terrible, and Some Like it Hot, where she's not that great. But as I mentioned a year ago, she's awesome in All About Eve, and she's great in this movie as someone determined to get hers out of a sexist society.
Uh, the worst thing I can say about this movie is the plot bogs it down. I don't really care about the machinations or the milquetoast dudes or the tiara; I just want to see Russell and Monroe hit on some more dumb jocks and maybe commit a little light insurance fraud. Plus, we have a French courtroom conducting an inquiry in English, which may be the most unrealistic thing I've ever seen in a movie.
Finally, I'd just like to point out that this movie ends with the two female characters getting married to their milquetoast dudes, but then it zooms in and cuts the dudes out of frame, so it's just Russell and Monroe standing next to each other in their wedding dresses. I can only imagine what this film would have looked like with the Subtext Glasses they handed out during its original theatrical run.
The Wind Rises (2013) This was so close to being a good movie that I'm having a hard time pinning down the problem. I think it stems from the fact that this is one of the only Miyazaki films about an adult man. Does that make sense? Because the main character himself is fine but because he's a grown man I guess he's got to have this love interest who is sickly and angelic and apparently highly fictionalized. This would be okay if she was the mostly-offscreen mom from Totoro, but here she's supposed to carry the entire feminine side of the film and it's not good.
The other problem is that the movie doesn't tell its actual, interesting story--it obliquely tells the space around the story. Which, okay, it's a Japanese film and I'm not opposed to this technique in general, and I liked the way the actual story was told through foreshadowing and implication, but it also means we never see the main character directly struggle with the central problem of the film: the fact that he's designing beautiful things that will kill people. It skips past that part to focus on a cheesy fictionalized love story. I did not consider that a good trade.
Kiki's Delivery Service (1989): Rewatched on DVD as a palate cleanser from The Wind Rises. I think it drags in the middle but the beginning is SO GOOD, the way it assumes you already know the rules of its fantasy world. And it's a world that's better than the real world, which I feel is usually more a science fiction thing.
Good news, highbrow artists! I figured out how to get me to watch your avant-garde abstract film. Just use a computer to make it before 1988! The museum had a festival of early computer films, and I didn't see any of the features, but I watched almost all the shorts. It was a mix of really great films and incredibly boring films. (Making your film with a computer before 1988 does not guarantee I will give a good review. Offer still not valid for Andy Warhol.)
The worst offender was Woody Vasulka's Explanation (1974), a twelve-minute film in which a mesh is deformed and rotated before your eyes, over and over again. The mesh is the visual representation of a waveform which is also played aurally, and which always manifests as an obnoxious droning noise. Twelve minutes, folks. Explanation beats out Trent's Last Case to become the worst movie I've ever seen at the museum.
In the Q&A afterwards someone spoke up for the audience and demanded an explanation for Explanation. The answer actually made sense! Films like Explanation weren't meant to be screened in a theater. They were meant to be looped on a television in an art gallery. The essential affordance of an art gallery being that you can leave when you get tired of it, rather than sitting it out because there's an hour of hopefully better stuff afterwards.
It also would have helped if we'd seen the copyright date at the beginning of Explanation instead of the end, because most of the time I was thinking "This mesh deformation stuff would be groundbreaking for the early 70s, but if this turns out to be from 1986 I'm going to hack Woody Vasulka's Twitter account and make him follow Unicode Ebooks."
The other big sonic annoyance was that most of the films up to about 1972 had soundtracks featuring gratuitous sitar/gamelan/Japanese flute music that often didn't even match the animation. With no other point of reference, the new genre of computer graphics was comparable only to the wonders of LSD, so... toss in some hippy Eastern music! This interview about the film series puts it more diplomatically:
Science and Film: Can you discuss the early films’ fascination with Asian music and imagery?
Gregory Zinman: The influence of Asian music and imagery in early computer films can be traced to a couple of intertwining concerns. Following the horrors of the second world war, many people, including artists, were searching for different belief systems and ways of thinking about humanity’s place in the universe. This resulted, in part, in a flowering of interest in Eastern religions and philosophies, which in turn resulted in a number of cinematic works that simultaneously referenced other worlds and altered consciousnesses.

In a bit of cross-cultural revenge, we also saw a Japanese film (1969's Computer Movie No. 2), in which the soundtrack was Wendy Carlos's version of the third Brandenburg from Switched-On Bach, constantly interrupted by modem handshaking sounds. Make it stop!
Enough negativity. Let's cover the highlights, with links to full video or clips or at least semi-official pages about the films where possible.
First, the abstract stuff. I loved Mary Ellen Bute's very early, good-natured Abstronic (1952) and Mood Contrasts (1953). Especially the narrator at the beginning of Abstronic who explains the concept of computer art and then says "Enjoy yourself!" Here's a page with a couple clips of Mood Contrasts and I also discovered another great Bute film called Dada. Probably the cheeriest thing ever to be called Dada.
The Whitney family--John Sr., John Jr., and James, but sadly not my uncle Jon Whitney--were well represented and seem to have set the standard with films like Side Phase Drift (1965) and Lapis (1966) and Permutations (1968) and Arabesque (1975). The standard being "pointilism because otherwise the computer can't handle the math" and "slap some Asian music on the soundtrack."
But the champion of the abstract section IMO was Larry Cuba's work. 1978's 3/78 (Objects and Transformations) has a clear Whitney influence (moving dots + Japanese flute soundtrack), but by 1985 computer power had advanced to the point where he was able to create what ranks alongside Composition in Blue (1935) as one of my favorite abstract films of all time, the gloriously isometric Calculated Movements (here's a 30-second excerpt).
Cuba made Calculated Movements with a system called GRASS, which I believe he also used to create the animated Death Star infographic in Star Wars (1977). He was present for the screening, and in the Q&A I asked him if he still had the Calculated Movements source code and if there was a framework for running GRASS on modern computers. He dodged the first question and said no to the second--someone was working on something for Windows but the project died. He did mention that he considered Processing to be the successor to GRASS.
Between abstract and representative film sits the surreal, neon candy-colored demo reel for the computer graphics studio of Robert Abel and Associates. Their work was apparently described as "a psychedelic trip gone straight," and if I'm misremembering that quote, I'll use those exact words to describe it right now. We saw the 1974 reel and I can't find that exact one online, but here are a few later ones: 1981 and 1982
I especially enjoyed RAA's bonkers 1974 ad for 7-Up, which really lightened the mood after a half-hour of the Whitneys, I tell you what. Here's a YouTube playlist of their stuff. Here's a sequel to the 7-Up commercial with a McDonalds tie-in. Outstanding. This studio seems to have driven a big chunk of the late-70s early-80s aesthetic.
And now, my perrenial favorite, representative film. Yay!
- La Faim (1974) used computer animation and morphing to create a traditional-style (albeit avant-garde) animated short. I'm surprised the disturbing, grotesque faces on display in this film aren't used in more memes. (See sample meme to the right.)
- Vol Libre (1980): This one really wowed 'em at SIGGRAPH with its fractal geometry. Bonus sci-fi connection: director Loren Carpenter says, "I used an antialiased version of this software to create the fractal planet in the Genesis Sequence of Star Trek 2, the Wrath of Khan."
- Voyager 2 Flyby (1981): We saw the second Saturn flyby, but YouTube also has the first Saturn flyby, as well as the 1986 sequel about Uranus and 1989's chiling "Neptune and Triton".
  Jim Blin, creator of the Saturn flyby film, said, "Our storyboard was the NASA flight plan." (He wasn't there; the guy introducing the films told us that he said this.) The Voyager flyby film was apparently the first time computer graphics were shown on the nightly news as part of the news, rather than just in interstitals and 7-up commercials from Robert Abel and Associates.
- Human Vectors (1982): This isn't a great work of art, but it was filmed off of a Vectrex, so it looks like nothing else in the show. It was apparently rescued by the New Museum's recent XFR STN project. I laughed at the C debugging joke.
- Big Electric Cat (1982): An 80s rock video. Not that great but I'm including it here because it's so weird. One of the directors was present and he introduced the video by saying: "It was the 80s." It sure was.
- Adventures in Success (1983): Now this is more like it! A funny music video for a good rock song. It's catchy and toe-tapping and satirical and also very 80s. Highly recommended.
- No No Nooky TV (1987): The journal of a love affair between a woman and her Amiga 1000. Funny and dirty and filled with the 16-color joy that flows from late-1980s computer paint programs. A triumph! Vimeo says the video is only 2:40, but the entire film is there.
I would be really interested to hear about the relationship between the demoscene and the computer film scene. I'm pretty sure there was no connection whatsoever, for a variety of reasons, but I would like to hear some people who came in to computer art through the "art" side talk about the stuff that came out from the "computer" side. I'm talking about the tension between Human Vectors (which is technically very skilled but nothing special artistically) and No No Nooky TV (which is clearly the work of a professional filmmaker but was made using only the programs that come loaded on the Amiga).
I didn't bring this up in Q&A because I figured no one would know what I was talking about, and if they did it would derail the whole Q&A. Perhaps I should have had more faith in computer animators. I guess I'll have to wait for the Jason Scott documentary.
I also think the museum did a good job of showcasing excellent work by women in a medium dominated (?) by male artists. The earliest films shown were Mary Ellen Bute's, and my two favorite films of the show were made by women: Lynn Goldsmith (who co-directed and sang Adventures in Success) and Barbara Hammer (No No Nooky TV). There was also a whole discussion with Lillian Schwartz which I didn't attend.
If this has whetted your appetite for old-fashioned computer animation, there's plenty more where that came from (the past).
The Big Lebowski (1998): I'm not someone who rewatches movies, and I've now seen The Big Lebowski six times. What can I say now that I haven't already said?
Well, how about this. My favorite thing about Thomas Pynchon is that each of his characters is surrounded by a protective bubble of literary genre, which colors the way the narrative is reported and even shapes the plot. This is most obvious with the Chums of Chance in Against the Day, who start off having a carefree Tom Swift adventure that, as they grow up, gradually becomes a WWI military novel. The Big Lebowski does the same thing for film.
I admit it took the publication of Inherent Vice, Thomas Pynchon's own version of The Big Lebowski, for me to realize this, but there it is. Walter is in an action movie. Maude Lebowski is in an arty Eurofilm where people trade wisecracks and laugh about nothing. The Stranger is in a Western. Bunny Lebowski is in an acausal porno. Jeffrey Lebowski is in a biopic of himself, with classical music and a narrator sonoriously recounting his accomplishments. The Dude doesn't want to be in a movie at all, but his decision to get revenge for the death of his ~~partner~~ rug puts him into a bubble of film noir. And Donny is like a child who wanders into the middle of a movie and wants to know what's going on.
And I don't know what else to say. The Big Lebowski is my favorite movie. It's very nearly the perfect fiasco comedy, and since that's the best kind of movie, it's very nearly the perfect movie. But how many times can you watch the perfect movie? How can I laugh at a really funny joke knowing that my laughter rings hollow because I knew the joke's exact timing?
Here it stands, like Shakespeare's Hamlet or Larry Cuba's Star Wars, the source of cliches that will last a thousand years. Can I set down The Big Lebowski and walk away without betraying my love for it? Nay, and yet I must! For this is not 'Nam. This is Film Roundup. There are rules.

Wed Dec 04 2013 09:14 @pony_strategies: My new bot, @pony_strategies, is the most sophisticated one I've ever created. It is the @horse_ebooks spambot from the Constellation Games universe.

Unlike @horse_ebooks, @pony_strategies will not abruptly stop publishing fun stuff, or turn out to be a cheesy tie-in trying to get you interested in some other project. It is a cheesy tie-in to some other project (Constellation Games), but you go into the relationship knowing this fact, and the connection is very subtle.

When explaining this project to people as I worked on it, I was astounded that many of them didn't know what @horse_ebooks was. But that just proves I inhabit a bubble in which fakey software has outsized significance. So a brief introduction:

@horse_ebooks was a spambot created by a Russian named Alexei Kouznetsov. It posted Twitter ads for crappy ebooks, some of which (but not all, or even most) were about horses. Its major innovative feature was its text generation algorithm for the things it would say between ads.
Are you ready? The amazing algorithm was this: @horse_ebooks ripped strings more or less randomly from the crappy ebooks it was selling and presented them with absolutely no context.
Trust me, this is groundbreaking. I'm sure this technique had been tried before, but @horse_ebooks was the first to make it popular. And it's great! Truncating a sentence in the right place generates some pretty funny stuff. Here are four consecutive @horse_ebooks tweets:

Not only that, but whether you believe it (or want to believe it) the car salesmen will continue to laugh
Demand Furniture
Including simplified four part arrangements for the novice student and
Just look at everything that I am going

There was a tribute comic and everything.
I say @horse_ebooks "was" a spambot because in 2011 the Twitter account was acquired by two Americans, Jacob Bakkila and Thomas Bender, who took it over and started running it not to sell crappy ebooks, but to promote their Alternate Reality Game. This fact was revealed back in September 2013, and once the men behind the mask were revealed, @horse_ebooks stopped posting.
The whole conceit of @horse_ebooks was that there was no active creative process, just a dumb algorithm. But in reality Bakkila was "impersonating" the original algorithm—most likely curating its output so that you only saw the good stuff. No one likes to be played for a sucker, and when the true purpose of @horse_ebooks was revealed, folks felt betrayed.

As it happens, the question of whether it's artistically valid to curate the output of an algorithm is a major bone of contention in the ongoing Vorticism/Futurism-esque feud between Allison Parrish and myself. She is dead set against it; I think it makes sense if you are using an algorithm as the input into another creative process, or if your sole object is to entertain. We both agree that it's a little sketchy if you have 200,000 fans whose fandom is predicated on the belief that they're reading the raw output of an algorithm. On the other hand, if you follow an ebook spammer on Twitter, you get up with fleas. I think that's how the saying goes.

In any event, the fan comics ceased when @horse_ebooks did. There was a lot of chin-stroking and art-denial and in general the reaction was strongly negative. But that's not the end of the story.

You see, the death of @horse_ebooks led to an outpouring of imitation *_ebooks bots on various topics. (This had been happening before, actually.) As these bots were announced, I swore silent vengeance on each and every one of them. Why? Because those bots didn't use the awesome @horse_ebooks algorithm! Most of them used Markov chains, that most hated technique, to generate their text. It was as if the @horse_ebooks algorithm itself had been discredited by the revelation that two guys from New York were manually curating its output. (Confused reports that those guys had "written" the @horse_ebooks tweets didn't help matters--they implied that there was no algorithm at all and that the text was original.)

But there was hope. A single bot escaped my pronouncements of vengeance: Allison's excellent @zzt_ebooks. That is a great bot which you should follow, and it uses an approximation of the real @horse_ebooks algorithm:

The corpus is word-wrapped at 35 characters per line.
Pick a line to use as the first part of a tweet.
If (random), append the next line onto the current line.
Repeat until (random) is false or the line is as large as a tweet can get.

And here are four consecutive quotes from @zzt_ebooks:

SHAPIRO: Ouch! SHAPIRO: Shapiro cares not! SHAPIRO: Hooray!
things, but I saw some originality in it. The art was very simple, but it was good
You're tackled by the opponent!
Gender: Male Height: 5'9" Pilot? Yes Ph.D.? Yes

Works great.

The ultimate genesis of @pony_strategies was this conversation I had with Allison about @zzt_ebooks. Recently my anger with *_ebooks bots reached the point where I decided to add a real *_ebooks algorithm to olipy to encourage people to use it. Of course I'd need a demo bot to show off the algorithm...

The @pony_strategies bot has sixty years worth of content loaded into it. I extracted the content from the same Project Gutenberg DVD I used to revive @everybrendan. There's a lot more where that came from--I ended up choosing about 0.0001% of the possibilities found in the DVD.

I have not manually curated the PG quotes and I have no idea what the bot is about to post. But the dataset is the result of a lot of algorithmic curation. I focused on technical books, science books and cookbooks--the closest PG equivalents to the crap that @horse_ebooks was selling. I applied a language filter to get rid of old-timey racial slurs. I privileged lines that were the beginnings of sentences over lines that were the middle of sentences. I eliminated lines that were boring (e.g. composed entirely of super-common English words).

I also did some research into what distinguished funny, popular @horse_ebooks tweets from tweets that were not funny and less popular. Instead of trying to precisely reverse-engineer an algorithm that had a human at one end, I tried to figure out which outputs of the process gave results people liked, and focused my algorithm on delivering more of those. I'll post my findings in a separate post because this is getting way too long. Suffice to say that I'll pit the output of my program against the curated @horse_ebooks feed any day. Such as today, and every day for the next sixty years.

Like its counterpart in our universe, @pony_strategies doesn't just post quotes: it also posts ads for ebooks. Some of these books are strategy guides for the "Pôneis Brilhantes" series described in Constellation Games, but the others have randomly generated titles. Funny story: they're generated using Markov chains! Yes, when you have a corpus of really generic-sounding stuff and you want to make fun of how generic it sounds by generating more generic-sounding stuff, Markov chains give the best result. But do you really want to have that on your resume, Markov chains? "Successfully posed as unimaginative writer." Way to go, man.

Anyway, @pony_strategies. It's funny quotes, it's fake ads, it's an algorithm you can use in your own projects. Use it!

(2) Wed Dec 04 2013 14:55 Secrets of (peoples' responses to) @horse_ebooks—revealed!: As part of my @pony_strategies project (see previous post), I grabbed the 3200 most recent @horse_ebooks tweets via the Twitter API, and ran them through some simple analysis scripts to figure out how they were made and which linguistic features separated the popular ones from the unpopular.

This let me prove one of my hypotheses about the secret to _ebooks style comedy gold. I also disproved one of my hypotheses re: comedy gold, and came up with an improved hypotheses that works much better. Using these as heuristics I was able to make @pony_strategies come up with more of what humans consider the good stuff.

Timing

The timing of @horse_ebooks posts formed a normal distribution with mean of 3 hours and a standard deviation of 1 hour. Looking at ads alone, the situation was similar: a normal distribution with mean of 15 hours and standard deviation of 2 hours. This is pretty impressive consistency since Jacob Bakkila says he was posting @horse_ebooks tweets by hand. (No wonder he wanted to stop it!)

My setup is much different: I wrote a cheap scheduler that approximates a normal distribution and runs every fifteen minutes to see if it's time to post something.

Beyond this point, my analysis excludes the ads and focuses exclusively on the quotes. Nobody actually liked the ads.

Length

The median length of a @horse_ebooks quote is 50 characters. Quotes shorter than the median were significantly more popular, but very long quotes were also more popular than quotes in the middle of the distribution.

Capitalization

I think that title case quotes (e.g. "Demand Furniture") are funnier than others. Does the public agree? For each quote, I checked whether the last word of the quote was capitalized.

43% of @horse_ebooks quotes end with a capitalized word. The median number of retweets for those quotes was 310, versus 235 for quotes with an uncapitalized last word. The public agrees with me. Title-case tweets are a little less common, but significantly more popular.

The punchword

Since the last word of a joke is the most important, I decided to take a more detailed look each quote's last word. My favorite @horse_ebooks tweets are the ones that cut off in the middle of a sentence, so I anticipated that I would see a lot of quotes that ended with boring words like "the".

I applied part-of-speech tagging to the last word of each quote and grouped them together. Nouns were the most common by far, followed by verb of various kinds, determiners ("the", "this", "neither"), adjectives and adverbs.

I then sorted the list of parts of speech by the median number of retweets a @horse_ebooks quote got if it ended with that part of speech. Nouns and verbs were not only the most common, they were the most popular. (Median retweets for any kind of noun was over 300; verbs ranged from 191 retweets to 295, depending on the tense of the verb.) Adjectives underperformed relative to their frequency, except for comparative adjectives like "more", which overperformed.

I was right in thinking that quotes ending with a determiner or other boring word were very common, but they were also incredibly unpopular. The most popular among these were quotes that repeated gibberish over and over, e.g. "ORONGLY DGAGREE DISAGREE NO G G NO G G G G G G NO G G NEIEHER AGREE NOR DGAGREE O O O no O O no O O no O O no neither neither neither". A quote like "of events get you the" did very poorly. (By late-era @horse_ebooks standards, anyway.)

It's funny when you interrupt a noun

I pondered the mystery of the unpopular quotes and came up with a new hypothesis. People don't like interrupted sentences per se; they like interrupted noun phrases. Specifically, they like it when a noun phrase is truncated to a normal noun. Here are a few @horse_ebooks quotes that were extremely popular:

Don t worry if you are not computer
Don t feel stupid and doomed forever just because you failed on a science
You constantly misplace your house
I have completely eliminated your meal

Clearly "computer", "science", "house", "and "meal" were originally modifying some other noun, but when the sentence was truncated they became standalone nouns. Therefore, humor.

How can I test my hypothesis without access to the original texts from which @horse_ebooks takes its quotes? I don't have any automatic way to distinguish a truncated noun phrase from an ordinary noun. But I can see how many of the @horse_ebooks quotes end with a complete noun phrase. Then I can compare how well a quote does if it ends with a noun phrase, versus a noun that's not part of a noun phrase.

About 4.5% of the total @horse_ebooks quotes end in complete noun phrases. This is comparable to what I saw in the data I generated for @pony_strategies. I compared the popularity of quotes that ended in complete noun phrases, versus quotes that ended in standalone nouns.

Quote ends in Median number of retweets

Standalone noun 330

Noun phrase 260

Other 216

Quote ends in	Median number of retweets
Standalone noun	330
Noun phrase	260
Other	216

So a standalone noun does better than a noun phrase, which does better than a non-noun. This confirms my hypothesis that truncating a noun phrase makes a quote funnier when the truncated phrase is also a noun. But a quote that ends in a complete noun phrase will still be more popular than one that ends with anything other than a noun.

Conclusion

At the time I did this research, I had about 2.5 million potential quotes taken from the Project Gutenberg DVD. I was looking for ways to rank these quotes and whittle them down to, say, the top ten percent. I used the techniques that I mentioned in my previous post for this, but I also used quote length, capitalization, and punchword part-of-speech to rank the quotes. I also looked for quotes that ended in complete noun phrases, and if truncating the noun phrase left me with a noun, most of the time I would go ahead and truncate the phrase. (For variety's sake, I didn't do this all the time.)

This stuff is currently not in olipy; I ran my filters and raters on the much smaller dataset I'd acquired from the DVD. There's no reason why these things couldn't go into olipy as part of the ebooks.py module, but it's going to be a while. I shouldn't be making bots at all; I have to finish Situation Normal.

(3) Mon Dec 16 2013 13:10 Markov vs. Queneau: Sentence Assembly Smackdown: I mentioned earlier that when assembling strings of words, Markov chains do a better job than Queneau assembly. In this post I'd like to a) give the devil his due by showing what I mean, and b) qualify what I mean by "better job".

Markov wins when the structure is complex

I got the original idea for this post when generating the fake ads for @pony_strategies. My corpus is the titles of about 50,000 spammy-sounding ebooks, and this was the first time I did a head-to-head Markov/Queneau comparison. Here are ten of Markov's entries, using the Markov chain implementation I ended up adding to olipy:

At Gas Pump!
The Guy's Guide To The Atkins Diet
Home Internet Business In The World.
101 Ways to Sharpen Your Memory
SEO Relationship Building for Beginners
Gary Secrets - Project Management Made Easy!
Weight Success
How get HER - Even If It's Just Money, So Easy and Effective Treatment Options
Sams Yourself
Define, With, Defeat! How To Get Traffic To Your Health

The Markov entries can get a little wacky ("Define, With, Defeat!"), which is good. But about half could be real titles without seeming weird at all, which is also good.

By contrast, here are ten of Queneau's entries:

Adsense I Collection Profits: The bottom Guide Income!
Reliable Your Earning Estate Develop Home And to life Fly Using Don't Your Partnership to Death
Help the Your Causes, Successfully Business Vegetarian
Connect New New Cooking
1 Tips, Me Life Starting to Simple Ultimate On Wills How Years Online With Living
How Practice Health Best w/ Beauty
Amazing Future & Codes Astrology to Definitive Green Carbs, Children Methods JV Engine Dollars And Effective Beginning Minutes NEW!
I and - Gems Secrets Making Life Today!
Succeeding For Inspiring Life
Fast Survival Baby (Health Loss) Really How other of Look Symptoms, Your Business Encouragement: drive Health to Get with Easy Guide

At their very best ("Suceeding For Inspiring Life, "How Practice Health Best w/ Beauty"), these read like the work of a non-native English speaker. But most of them are way out there. They make no sense at all or they sound like a space alien wrote them to deal with space alien concerns. Sometimes this is what you want in your generated text! But usually not.

A Queneau assembler assumes that every string in its corpus has different tokens that follow an identical grammar. This isn't really true for spammy ebook titles, and it certainly isn't true for English sentences in general. A sentence is made up of words, sure, but there's nothing special about the fourth word in a sentence, the way there is about the fourth line of a limerick.

A Markov chain assumes nothing about higher-level grammar. Instead, it assumes that surprises are rare, that the last few tokens are a good predictor of the next token. This is true for English sentences, and it's especially true for spammy ebook titles.

Markov chains don't need to bother with the overall structure of a sentence. They focus on the transitions between words, which can be modelled probabilistically. (And the good ones do treat the first and last tokens specially.)

Markov wins when the corpus is large, Queneau when the corpus is tiny

Consider what happens to the two algorithms as the corpus grows in size. Markov chains get more believable, because the second word in a title is almost always a word commonly associated with the first word in the title. Queneau assemblies get wackier, because the second word in a title can be anything that was the second word in any title.

I have a corpus of 50,000 spammy titles. What if I chose a random sample of ten titles, and used those ten titles to construct a new title via Queneau assembly? This would make it more likely that the title's structure would hint at the structure of one or two of the source titles.

This is what I did in Board Game Dadaist, one of my first Queneau experiments. I pick a small number of board games and generate everything from that limited subset, increasing the odds that the result will make some kind of twisted sense.

If you run a Markov chain on a very small corpus, you'll probably just reproduce one of your input strings. But Queneau assembly works fine on a tiny corpus. I ran Queneau assembly ten times on ten samples from the spammy ebook titles, and here are the results:

Beekeeping by Keep Grants
Lose to Audience Business to to Your Backlink Physicists Environment
HOT of Recruit Internet Because Financial the Memories
Senior Guide Way! Business Way!
Discover Can Power Successful Life How Steps
Metal Lazy, Advice
Insiders Came Warts Weapons Revealed
101 Secrets & THE Joint Health Than of Using Marketing! Using Using More Imagine
Top **How Own 101**
Multiple Spiritual Dynamite to Body - To Days

These are still really wacky, but they're better than when Queneau was choosing from 50,000 titles each time. For the @pony_strategies project, I still prefer the Markov chains.

Queneau wins when the outputs are short

Let's put spammy ebook titles to the side and move on to board game titles, a field where I think Queneau assembly is the clear winner. My corpus is here about 65,000 board game titles, gathered from BoardGameGeek. The key to what you're about to see is that the median length of a board game title is three words, versus nine words for a spammy ebook title.

Here are some of Markov's board game titles:

Pointe Hoc
Thieves the Pacific
Illuminati Set 3
Amazing Trivia Game
Mini Game
Meet Presidents
Regatta: Game that the Government Played
King the Rock
Round 3-D Stand Up Game
Cat Mice or Holes and Traps

A lot of these sound like real board games, but that's no longer a good thing. These are generic and boring. There are no surprises because the whole premise of Markov chains is that surprises are rare.

Here's Queneau:

The Gravitas
Risk: Tiles
SESSION Pigs
Yengo Edition Deadly Mat
Ubongo: Fulda-Spiel
Shantu Game Weltwunder Right
Black Polsce Stars: Nostrum
Peanut Basketball
The Tactics: Reh
Velvet Dos Centauri

Most of these are great! Board game names need to be catchy, so you want surprises. And short strings have highly ambiguous grammar anyway, so you don't get the "written by an alien" effect.

Conclusion

You know that I've been down on Markov chains for years, and you also know why: they rely on, and magnify, the predictability of their input. Markov chains turn creative prose into duckspeak. Whereas Queneau assembly simulates (or at least stimulates) creativity by manufacturing absurd juxtapositions.

The downside of Queneau is that if you can't model the underlying structure with code, the juxtapositions tend to be too absurd to use. And it's really difficult to model natural-language prose with code.

So here's my three-step meta-algorithm for deciding what to do with a corpus:

If the items in your corpus follow a simple structure, code up that structure and go with Queneau.
If the structure is too complex to be represented by a simple program (probably because it involves natural-language grammar), and you really need the output to be grammatical, go with Markov.
Otherwise, write up a crude approximation of the complex structure, and go with Queueau.