<M <Y

June Film Roundup:

Addendum: After last month's The Bit Player experiment, I've found that Film Roundup is the best place to list interesting films that I can't put on a wishlist because they're not yet products you can wishlist. This month's entry: Dance with Me, the tragedy (?) of a woman who's cursed to live in a musical. It's showing at the Japan Cuts festival later this month, but I was slow on the draw and all the tickets sold out. We'll see it later... and I'll see you later!

Beautiful Soup 4.8.0: I'm getting back into the swing of putting up a NYCB post when I complete a project. Yesterday I published a feature release of Beautiful Soup, 4.8.0. This release makes it easy to make fine-grained customizations to the input mechanism (the TreeBuilder class) and the output mechanism (the Formatter class).

This makes it easy to do things like change the rules about which attributes are treated as multi-value attributes. If you don't like how Beautiful Soup parses class into a list of CSS classes, this is the release for you. It's not a huge release, but this project's now fifteen years old so I'm relieved at how stable it's been.

Speaking of CSS, although this is a feature release, it's a little smaller than the 4.7.0 release I put out at the end of 2018. That one took out the lackluster implementation of CSS selectors, based on Simon Willison's "soupselect" project from the early 2010s. I replaced it with a dependency on Isaac Muse's SoupSieve project, which has a nearly complete CSS selector implementation. The old implementation was a common cause of complaints, but—like the HTML5 parsing algorithm—it's not something I have a strong interest in and I'm happy to give the whole job to an external dependency.

There was a period of about a year in 2017-2018 when I wasn't interested in doing Beautiful Soup work, but Tidelift changed that. Tidelift gathers subscription money from companies that rely on free software, and distributes the money to the developers in exchange for a level of support that I find sustainable.

Nobody builds an entire product around Beautiful Soup (or at least nobody will admit do doing this), but thousands of people have used Beautiful Soup to save time at their day jobs. Bundling Beautiful Soup together with bigger projects like Flask and numpy is a solution that works really well for me.

[Comments] (15) Secretly Public Domain: "Fun facts" are, sadly, often less than fun. But here's a genuinely fun fact: most books published in the US before 1964 are in the public domain! Back then, you had to send in a form to get a second 28-year copyright term, and most people didn't bother.

This is how Project Gutenberg is able to publish all these science fiction stories from the 50s and 60s. Those stories were published in issues of magazines that didn't send in the renewal form. But up til now this hasn't been a big factor, because 1) the big publishers generally made sure to send in their renewals, and 2) it's been impossible to check renewal status in bulk.

Up through the 1970s, the Library of Congress published a huge series of books listing all the registrations and the renewals. All these tomes have been scanned -- Internet Archive has the registration books—but only the renewal information was machine-readable. Checking renewal status for a given book was a tedious job, involving flipping back and forth between a bunch of books in a federal depository library or, more recently, a bunch of browser tabs. Checking the status for all books was impossible, because the list of registrations was not machine-readable.

But! A recent NYPL project has paid for the already-digitized registration records to be marked up as XML. (I was not involved, BTW, apart from saying "yes, this would work" four years ago.) Now for anything that's unambiguously a "book", we have a parseable record of its pre-1964 interactions with the Copyright Office: the initial registration and any potential renewal.

The two datasets are in different formats, but a little elbow grease will mesh them up. It turns out that eighty percent of 1924-1963 books never had their copyright renewed. More importantly, with a couple caveats about foreign publication and such, we now know which 80%.

This was announced back in May, but I don't think it got the attention it deserved. This is a really big deal, so I had no choice but to create a bot. Here's Secretly Public Domain, which highlights unrenewed works that have already been scanned for Hathi Trust. This only represents 10% of the 80%, but it's the ten percent most likely to be interesting, and these books have the easiest path towards being available online.

August 9 update: topline number is closer to 73%, next steps for the public domain books, and how to get the data on your own computer.


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.