< March Film Roundup
Board Game Dadaist Improvements >

In Search of the Beautiful Soup Double-Dippers: Recently I noticed that certain IPs were using distribute or setuptools to download the Beautiful Soup tarball multiple times in a row. For one thing, I'm not sure why distribute and setuptools are downloading Beautiful Soup from crummy.com instead of using PyPI, especially since PyPI registers almost 150k downloads of the latest BS4--why are some people using PyPI and not others?

If anyone knows how to convince everyone to use PyPI, I'd appreciate the knowledge. But it's not a big deal right now, and it gives me some visibility into how people are using Beautiful Soup. Visibility which I will share with you.

Yesterday, the 17th, the Beautiful Soup 4.1.3 tarball was downloaded 2223 times. It is by far the most popular thing on crummy.com. The second most popular thing is the Beautiful Soup 3.2.1 tarball, which was downloaded 381 times. The vast majority of the downloads were from installation scripts: distribute or setuptools.

1516 distinct IP addresses were responsible for the 2223 downloads of 4.1.3. I wrote a script to find out how many IP addresses downloaded Beautiful Soup more than once. The results:

Downloads from a single IP Number of times this happened

Naturally my attention was drawn to the outliers at the top of the table. I investigated them individually. The IP address responsible for 55 downloads is a software company of the sort that might be deploying to a bunch of computers behind a proxy. The 35 is an individual on a cable modem who, judging from their other traces on the Internet, is deploying to a bunch of computers using Puppet. The 15, the 13, and the 11 are all from Travis CI, a continuous integration service.

One of the two 5s was an Amazon EC2 instance. Five of the twelve 4s were Amazon EC2 instances. Thirty-seven of the forty-three 3s were Amazon EC2 instances. And 395 of the 453 double-dippers were Amazon EC2 instances. Something's clearly going on with EC2. (There was also one download from within Amazon corporate, among other BigCo downloaders.)

I hypothesized that the overall majority of duplicate requests are from Amazon EC2 instances being wiped and redeployed. To test this hypothesis I went through all the double-dippers and calculated the time between the first request and the second. My results are in this scatter plot. Each point on the plot represents an IP address that downloaded Beautiful Soup twice yesterday.

For EC2 instances, the median time between requests is 11 hours and 45 minutes. So EC2 instances are being automatically redeployed twice a day. For non-EC2 instances, the median time between requests is 51 minutes, and the modal time is about zero. Those people set up a dev environment, discover that something doesn't work, and try it again from scratch.

Filed under:


Unless otherwise noted, all content licensed by Leonard Richardson
under a Creative Commons License.