<Y

The Crummy.com Review of Things 2018, Part One: Hey, how are you doing? I've been putting off writing this post because there's books and plays and etc. from 2018 I'd been meaning to write about, and I never did. Now I've got to get it out by way of explaining why these things I've never mentioned before are on my best-of-the-year list. So I'm just going to put the little essays I was going to write in here. It'll be a good time. Let's start with the easy one, where I already have detailed records on my consumption:

Film - There's nineteen new films on Film Roundup Roundup, but only films I hadn't seen before are eligible for the best-of awards, so no The Apartment or Fargo. Here's my top seven for 2018:

  1. The Court Jester (1955)
  2. Big Business (1988)
  3. The Death of Stalin (2017)
  4. your name. (2017)
  5. Sorry to Bother You (2018)
  6. Spider-Man: Into the Spider-Verse (2018)
  7. Lots of Kids, a Monkey, and a Castle (2017)

Kind of a surprising result for me; I remember reading the screenplay for The Court Jester back in the BBS days and thinking it wasn't funny at all. Even now, if you look at the IMDB quotes page it doesn't seem like a terribly funny movie. But what they filmed is funny as hell. The "flagon with the dragon" bit is a good example. It's a famous movie line that I find tiring in and of itself, but that line isn't the main joke; the jokes focus on the folly of using an annoying tongue twister as a mnemonic.

Theater - Sumana and I saw a few shows in 2018, and the one I liked the best was "The Play that Goes Wrong", which we saw on Broadway. Like Big Business in the Film section, this play shows a mastery of different types of comedy—verbal, physical, character, meta... It's constantly switching things up, setting up and claiming callbacks, and exploring every variant of its simple premise. Hits all my comedy buttons, basically.

Books - Two books I read recently that really stand out for me are And There I Stood With my Piccolo and But He Doesn't Know the Territory by Meredith Willson. Willson's main claim to fame is that he composed "The Music Man", and NYCB readers know how much I love that musical. After we watched The Apartment, Sumana said: "You know, the saddest part is he didn't get to use those 'Music Man' tickets."

Territory is an inspirational book about the incredibly frustrating eight-year process of writing and producing "The Music Man". It's really nice to read as someone who's trying to work on large long-term projects. But nearly as inspirational is Piccolo, a book Willson wrote and published in 1948, almost a decade before releasing the project he's remembered for today. At this point Willson is close to nobody in show biz, just a guy who works in radio, mostly behind the scenes. But he puts out this book of hilarious stories and hot takes anyway, because who cares? The work speaks for itself. Both of these are outstanding books full of great anecdotes.

In similar "funny person makes random observations" territory I really enjoyed the second volume of Mark Twain's autobiography. I read the first volume as a huge hardcover book and it was a big chore, but reading it as an ebook is a much better experience, especially since there's lots of good stuff in the end notes. Volume 2 has lots of Twain's thoughts on copyright, and his not exactly Mr. Rogers-esque experience of giving Congressional testimony on the topic. I was saving volume 3 for the new year, but guess what—this is the new year!

In 2018 I started reading Vikram Seth's Indian epic A Suitable Boy. Sumana is a huge fan, and this gives us a fun topic to discuss while she waits for the serially-delayed sequel, A Suitable Girl. It's really funny! I'm a couple hundred pages in and finally getting comfortable with all the characters and their relationships. But they keep adding more characters! BTW A Suitable Boy is one of those late-twentieth-century works where there just isn't an ebook available. It's pretty common, but not usually a big deal unless the book is both well-known and really long. The Power Broker is another example—I haven't read that one because it isn't physically compatible with the way I read now.

Other great books I read in 2018 include Hemmingway's A Moveable Feast, Picking Up by Robin Nagle, Broad Band by Claire L. Evans, Wartime by Paul Fussell, and Lying For Money by Daniel Davies.

Broad Band starts off rehashing stuff I already knew about Ada Lovelace, but it really started surprising me after the end of WWII. There's a bit in Chapter 4 that gives me pause relating to the creation of COBOL. Like Javascript, COBOL was developed under an accelerated schedule. Unlike Javascript, the committee developing COBOL knew that everyone would be stuck for a really long time with whatever they came up with. But they decided to represent years as 2 digits anyway! I'd always assumed the Y2K problem was caused by a lack of foresight. But there was foresight, and they did it anyway! They weren't looking far enough ahead.

On that cheery note, I'll see you... in the future! Right now I'm going to go eat some food.

The Crummy.com Review of Things 2018, Part Two: Again, taking this post as an opportunity to discuss some things that maybe should have had their own entries, but let's take what we can get, huh?

Audio - Two recently discovered podcasts are worth your time. Farm to Taber, which focuses on the nuts and bolts of sustainable agriculture, and Gimme That Star Trek.

There are a ton of Star Trek podcasts that go episode-by-episode, but who has the time? In fact, I record an episode-by-episode Star Trek podcast and don't even release it, that's how much respect I have for your time. (If you do have the time, try Treks and the City.) "Gimme That Star Trek" mainly talks about the larger themes of Trek and ancillary material like the comics. Try "Is Starfleet Military?" and see if it grabs you.

Games - The Crummy.com Game of the Year is "Slay the Spire", which delivers my favorite part of roguelikes—emergent properties coming from random combinations of a large set of items. Honorable mention to "Dead Cells", which doesn't have much combo going on but is a fun feat of procedural generation.

I got a Switch in 2018 and haven't done anything super unusual with it but I have had a good time with the first-party games, especially "Breath of the Wild". I know I swore off Zelda games but the huge open world and side quests of Breath of the Wild made it easy to swallow the main arc, where a kid goes to four dungeons. "Nintendo games are fun" is an accurate but boring thing to say, so I'll say it but not dwell on it.

On my phone, I had a great time playing a game called Freeways, which I think will appeal to people who like Mini Metro. To me the darkness, the lonely desert, the directions identified only by highway numbers, brings back the nighttime Central California landscape I drove as a teenager. Honorable mention to Holedown. Dishonorable mention to another game that I won't name, which is a really good game but turns into gacha hell if you dare try to complete the main storyline.

Personal accomplishments - I finished a draft of Mine but it needs some serious work and I don't want to think about it right now, so moving on... I started putting my short fiction out there again and sold a story! ("Only g62 Kids Will Remember These Five Moments" from back in 2016.) Presumably will be published this year. Wrote five stories in 2018: "The Blanket Thief", "Why You Deserved to Die", "The Universe Pump", "The Wheel of Chores", and "The Procedure Sign". Got a good feeling about three of those, at least.

I'm coming up on the five-year mark of the Library Simplified project. It's an uphill battle, and 2018 didn't bring the breakthroughs I was hoping for, but we are making progress and there's no technical reason why this thing can't work, so I'm still hopeful.

The year in bots: I was mainly focused on other things, but I was inspired by the Internet Archive's holdings and API to create four new bots: Junk Mail Bot, Yorebooks, Podcast Roulette, and Almanac for New Yorkers, which premièred on January 1.

"Almanac for New Yorkers" is a replaying of an "urban almanac" for 1938 by the Federal Writers' Project. Advice on when to plant soybeans is replaced by info on what's playing at Carnegie Hall, and it's all written with that dry midcentury American wit that is better-known today from the WWII Army field guides these people would be writing in a couple years. There are two more of these -- 1939 for New York and 1938 for San Francisco -- so if the Almanac proves popular this year, I'll queue up another chunk for 2020.

Okay, I think that covers everything. If not... I'll just write another blog post! See you around!

January Film Roundup: Howdy-doo. I've completed my collection of Coen Brothers movies and I'm ready to pass judgement on the oeuvre as a whole. Also saw some disappointing Bollywood epics with Sumana. Let's get started!

The Art of Python: For a couple years Sumana has been mixing up the tech conference experience by adding aspects of performance and dramaturgy to her talks (see e.g. Python Grab Bag and Code Review, Forwards and Back). Now she's scaling it up by running an arts festival at this year's PyCon North America: "The Art of Python". You can submit proposals until the end of the month — music, dramatic performance, visual art, and so on.

I would love to see this became a regular feature of technical conferences. Many aspects of programming can't be expressed in traditional talks (xkcd does a lot of this), and it's also just fun to talk about programming in ways other than lectures—I like to do it in fiction, for instance. If you're interested, check out the CFP!

February Film Roundup:

March Film Roundup: Just finished some rewrites for a novel, so... time to do more writing! At least you get to see this stuff right away!

April Film Roundup: It's been an action-packed April, as I watched the biggest blockbusters of 22, 28, and 48 years ago!

[Comments] (2) May Film Roundup: Missed a chance to see Claude Shannon doc The Bit Player (2018) at the museum, just making a note of it here so I remember to see it later if and when it becomes available online. Here are the movies I did see in May, often to my detriment:

June Film Roundup:

Addendum: After last month's The Bit Player experiment, I've found that Film Roundup is the best place to list interesting films that I can't put on a wishlist because they're not yet products you can wishlist. This month's entry: Dance with Me, the tragedy (?) of a woman who's cursed to live in a musical. It's showing at the Japan Cuts festival later this month, but I was slow on the draw and all the tickets sold out. We'll see it later... and I'll see you later!

Beautiful Soup 4.8.0: I'm getting back into the swing of putting up a NYCB post when I complete a project. Yesterday I published a feature release of Beautiful Soup, 4.8.0. This release makes it easy to make fine-grained customizations to the input mechanism (the TreeBuilder class) and the output mechanism (the Formatter class).

This makes it easy to do things like change the rules about which attributes are treated as multi-value attributes. If you don't like how Beautiful Soup parses class into a list of CSS classes, this is the release for you. It's not a huge release, but this project's now fifteen years old so I'm relieved at how stable it's been.

Speaking of CSS, although this is a feature release, it's a little smaller than the 4.7.0 release I put out at the end of 2018. That one took out the lackluster implementation of CSS selectors, based on Simon Willison's "soupselect" project from the early 2010s. I replaced it with a dependency on Isaac Muse's SoupSieve project, which has a nearly complete CSS selector implementation. The old implementation was a common cause of complaints, but—like the HTML5 parsing algorithm—it's not something I have a strong interest in and I'm happy to give the whole job to an external dependency.

There was a period of about a year in 2017-2018 when I wasn't interested in doing Beautiful Soup work, but Tidelift changed that. Tidelift gathers subscription money from companies that rely on free software, and distributes the money to the developers in exchange for a level of support that I find sustainable.

Nobody builds an entire product around Beautiful Soup (or at least nobody will admit do doing this), but thousands of people have used Beautiful Soup to save time at their day jobs. Bundling Beautiful Soup together with bigger projects like Flask and numpy is a solution that works really well for me.

[Comments] (15) Secretly Public Domain: "Fun facts" are, sadly, often less than fun. But here's a genuinely fun fact: most books published in the US before 1964 are in the public domain! Back then, you had to send in a form to get a second 28-year copyright term, and most people didn't bother.

This is how Project Gutenberg is able to publish all these science fiction stories from the 50s and 60s. Those stories were published in issues of magazines that didn't send in the renewal form. But up til now this hasn't been a big factor, because 1) the big publishers generally made sure to send in their renewals, and 2) it's been impossible to check renewal status in bulk.

Up through the 1970s, the Library of Congress published a huge series of books listing all the registrations and the renewals. All these tomes have been scanned -- Internet Archive has the registration books—but only the renewal information was machine-readable. Checking renewal status for a given book was a tedious job, involving flipping back and forth between a bunch of books in a federal depository library or, more recently, a bunch of browser tabs. Checking the status for all books was impossible, because the list of registrations was not machine-readable.

But! A recent NYPL project has paid for the already-digitized registration records to be marked up as XML. (I was not involved, BTW, apart from saying "yes, this would work" four years ago.) Now for anything that's unambiguously a "book", we have a parseable record of its pre-1964 interactions with the Copyright Office: the initial registration and any potential renewal.

The two datasets are in different formats, but a little elbow grease will mesh them up. It turns out that eighty percent of 1924-1963 books never had their copyright renewed. More importantly, with a couple caveats about foreign publication and such, we now know which 80%.

This was announced back in May, but I don't think it got the attention it deserved. This is a really big deal, so I had no choice but to create a bot. Here's Secretly Public Domain, which highlights unrenewed works that have already been scanned for Hathi Trust. This only represents 10% of the 80%, but it's the ten percent most likely to be interesting, and these books have the easiest path towards being available online.

August 9 update: topline number is closer to 73%, next steps for the public domain books, and how to get the data on your own computer.

[Comments] (1) July Film Roundup:

Secretly Public Domain: Update: My "Secretly Public Domain" project got a lot of attention, which is great, but it also gave me a lot more work to do and pointed to some things that hadn't been explained very well. I've done that work, and here's an update:

Topline number is 73%

My original estimate was that 80% of pre-1963 books were not renewed. This was based on a couple of inaccurate assumptions, the big one being that I was counting works originally published in a foreign country. Those works might have lapsed into the public domain at some point, but the US copyright has since been restored by treaty. So their renewal status isn't really relevant.

Of the books where renewal status is relevant, here are the most recent statistics:

Credits

The "Secretly Public Domain" bot was a publicity stunt to draw attention to the machine-readable registration records. It worked great, but it also drew attention to me, the person doing the publicity stunt, even though I had basically nothing to do with the original work. For the record, here are the people who actually did the work. The project inside NYPL was run by Sean Redmond, Greg Cram, and Josh Hadro (now of IIIF). The work of making the copyright records machine-readable was done by Data Conversion Laboratory.

Buried treasure

Most of the books whose copyright wasn't renewed are really obscure titles, but without looking very hard I found a very well-known science fiction novel that has no renewal record. I'm not mentioning the name as an incentive to get people to look at the data themselves. It's probably not the only well-known work whose copyright wasn't renewed.

How to make your own list

My original estimate of 80% was based on the quick and dirty script I used to write the Mastodon bot. To fix the "foreign works" problem and to produce a dataset that would stand up to scrutiny, I published a Python library specifically for handling this data. It's got business logic for making determinations like "was this book published in a foreign country" and "how well does this renewal record match this registration record". You run the scripts and at the end you have a bunch of JSON files with consolidated data. If you think there are bad assumptions, you can change the business logic and run the scripts again.

How to see the data

There were a number of requests for this data in a tabular form. I totally understand where this is coming from, and it's certainly the easiest way to get into the data, but it's tricky, because converting the JSON to tabular data destroys information that would be useful for taking the next step (see below).

So, I've done the best I can. I added a script to the end of my Python workflow which generates three huge tab-separated files, and I put those files in the cce-spreadsheets project. This should be good for getting an overview of which books were renewed, which weren't, and which are foreign publications.

What's next?

Discovering that a book published in 1950 is in the public domain, doesn't make a free digitized version of that book automatically appear. Somebody has to do the work. At this point we go from fast data processing to really slow research and digitization work. You or I can now make a near-complete list of unrenewed books in a few minutes, but that list just represents an enormous to-do list for someone.

There are basically three "someones" who might step up here: Project Gutenberg, Hathi Trust, and Internet Archive.

Project Gutenberg

As I mentioned earlier, Project Gutenberg digitized the copyright renewal records some time ago, and they use them all the time. They have a section of their Copyright How-To explaining how to check whether a particular title was renewed, and whether the renewal matters. There are other steps to clear a pre-1963 work: you have to verify that the author lived in the US at the time, stuff like that. The newly digitized registration records can help with some of this, and my data processing script that combines registration and renewal can help with more of it, but there's still some manual work you have to do for each book.

Once that work is done, Project Gutenberg volunteers will locate a copy of the book, scan it, and OCR it (assuming there's no existing scan). Then they'll proofread it and put out HTML and plain-text editions. As you can imagine, this process takes a really long time, but the result is a clean, accurate copy of the book that can be read on its own or reused in other projects. The catch is that somebody has to care enough about a specific book to go through all this trouble.

Hathi Trust

Hathi Trust already has scans of a lot of these 1924-1963 books. They just don't make these scans available to the public, because as far as they know, all these books are still under copyright. If they were convinced otherwise, they'd open up the scans—they opened up almost all of their 1923 stuff this January when the 95-year copyright term finally expired. So we have to make a case for opening up these books.

Earlier, NYPL took the highest-circulating 1924-1963 books in our research collection and checked to see which ones lacked a renewal record. We sent the list to Hathi Trust, and they did their own verification and opened up some of the books: The Americans in Santo Domingo from 1928 is an example. Once Hathi opens up a scan, it's available to the public. It also becomes possible for Gutenberg et al. to turn the raw scan into something more readable.

In the near future, people at NYPL (not me) will be talking to people at Hathi Trust about what kind of evidence is necessary, in general, to convince them that the copyright on a 1924-1963 book has lapsed. Then we'll be able to give them a list of all the books where we can find that kind of evidence. There'll still be a verification process on the Hathi Trust side -- at the very least, they have to go through the book and make sure it doesn't contain unauthorized reprints from other books -- but it should streamline things quite a bit.

Internet Archive

Internet Archive is a wild card here. They scan a lot of books, and I could see them treating the "unrenewed" list as a big list of additional books to scan, but it would be a new undertaking. Making unrenewed works available is something Project Gutenberg volunteers do already, and it's something that Hathi Trust could do relatively easily, but with Internet Archive it's more the sort of thing they'd do.

Data problems

That 8% of grey area, where it's not clear whether or not a book was renewed, points to the general difficulty of meshing together two sets of public records published across half a century and digitized by different people. The grey area represents a lot of manual work that has to be done, and of course there's always the fear that a book that seems to be free and clear actually isn't: the title page says "printed in Canada", or the smoking-gun copyright renewal didn't show up because its ID number was typed wrong.

There's going to be a lot of manual work in the process of clearing these books, but there's no reason to wait until everything's perfect to get started. My preference is to cast a very wide net, try to find any renewal that might possibly be related to a registration, and make the grey area as big as possible. We know that a majority of 1924-1963 books will always come up "no renewal", because there are way more registrations than renewals. We can deal with those and then take a closer look at the grey area.

Other media

A couple of people asked whether it was possible to do this for other media. The good news is that there are volumes of the Catalog of Copyright Entries for:

All of these books have scans hosted at the Internet Archive. You can get an overview by looking at Penn's index of the CCE from a specific year, let's say 1960.

As far as I know--and I do know about one big exception--the rules here are the same as for books. If something wasn't registered, or the registration wasn't renewed, then the copyright on a work first published in the US 1924-1963 has lapsed.

Now, the bad news. We have scans of the Catalog of Copyright Entries, but the only bits where both the registration and renewals are machine-readable is "Part 1 Class A". That's the "Books" part of "Books, Pamphlets, Serials, and Contributions to Periodicals", and it represents only about 30% of the total.

If you want to see whether there's a renewal record for a fishing map of Kansas, or a magazine article, or a cool retro ad, or a classic film noir, or a vintage restaurant placemat, it is quite possible, but it's a huge pain. And you can forget about running the numbers on all the movies or all the restaurant placemats. We don't have a good picture of what's in there.

The situation is this way because the Catalog of Copyright Entries is huge, and digitizing it is boring/expensive. Up to this point, book nerds are the only nerds who've put in the time and money to make "their" part of the CCE machine-readable. NYPL has plans to give this same treatment to the entire CCE, but the crucial part of the plan where we have money to pay someone to do this is currently missing; it's a matter for fundraising.

The second piece of bad news regards music. When we in 2019 think about "music", we think of sound recordings. When the CCE thinks about "music", it's thinking about the underlying composition—basically the stuff that would go on the sheet music. Until 1972 there was no federal-level copyright on sound recordings, and the result is that music copyrights are a bigger mess than other types of copyright. I do not want to get into territory I don't understand, but suffice to say that for a vinyl record to be in the public domain, it's necessary but not sufficient that the copyright on the underlying composition have expired. So the CCE can only help so much.

August Film Roundup: "Our shows" have either ended (Jane the Virgin, satisfying ending IMO) or are on summer break, so in August, Sumana and I ended up watching a lot of movies together.

September Film Roundup: This is not a film, but in September, Sumana and I played Untitled Goose Game and loved it. Check it out. Honk!

[Comments] (1) October Film Roundup: I saw a ton of movies this month and there was something fun or interesting in almost all of them! Here's the scoop:

NaNoGenMo 2019: "Linked by Love": This year I'm writing and announcing my NaNoGenMo project before November is over! "Linked by Love" is made from cunningly juxtaposed paragraphs of romance novel back-cover copy. Back-cover copy is some of the hardest stuff for an author to write, and it's basically treated as ephemeral, so it was fun to sort of give it its due in this project.

I originally had a much different book planned, something that would take a single individual on a universe-shifting journey, but it proved very difficult to determine the relationship between the referent of a sentence and the gendered pronouns in the sentence. Gender is very important to romance novels, so instead I let the proper nouns do the work and left the precise relationship between Carlottan and Carlottan+1 a mystery for the reader to fill in.

[No comments] November Film Roundup:

<Y

[Main]

Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.