UPDATE, 20161210

When I originally wrote this post, the only way to download collections of files from the Internet Archive in bulk was to perform a manual search, process the resulting CSV, and feed that into wget in a rather inefficient process. Thankfully, that’s no longer the case: there’s now an official Internet Archive software package which includes a command-line too, ia, for performing a variety of actions – including downloading content in bulk.

So, here’s how you really download bulk content from the Internet Archive. You start by installing the tools:

sudo apt-get install python-pip
pip install internetarchive

You log in to your Internet Archive account via the tool:

ia configure

Then, finally, you trigger the download for your chosen collection name:

ia download --search='collection:personalcomputerworld' --no-directories --glob=\*pdf

Change the file extension at the end if you’d prefer formats other than PDF. If you’re a power-user (oooh) and have GNU Parallel installed, you can even run multiple download streams at once to speed things up:

ia search 'collection:personalcomputerworld' --itemlist | parallel 'ia download {} --no-directories --glob=\*pdf'

The original script and post are included below, for posterity.

Read More →


I upgraded my Galaxy SII to a custom version of Android 4.0.4 Ice Cream Sandwich the other night, and it’s been brilliant – except for one little bug. If I go into the Gallery app, images on my microSD are in a semi-random order. If I tell Android to group by time, the problem becomes apparent: large numbers of images, always in contiguous blocks, are thought to have been taken between 1992 and 2018. I didn’t have an Android phone in 1992, and 2018 hasn’t happened yet – clearly, it’s a bug.

A quick search turned up this Google Code issue, which matches what I’m seeing exactly: GPS data is sometimes corrupted as it writes to the image. Worryingly, the bug dates back to at least Android 2.3 Gingerbread – which I can confirm, as the images in question were taken while the Galaxy SII was still running 2.3.4 – but Google promises a fix is incoming.

Here’s an example of the problem. The GPS Date tag from a non-corrupt image reads 2012:10:30 – year, month, day with a colon delimiter. The GPS Date tag from a corrupt image reads 0112:09:26 – a clearly invalid date which is confusing the heck out of the Gallery app. (You can check this out yourself with the command exiftool -gps:GPSDateStamp filename.jpg)

But why have I only just spotted the problem? Simple: the ICS – and Jelly Bean – Gallery app has changed its behaviour from previous versions. Where the old Gingerbread Gallery used to use the file modified date to sort the images, the ICS and newer Gallery uses the GPS Date Stamp tag from the EXIF data – the very tag that is being corrupted by the bug. Why it does that, I have no idea: there are numerous other EXIF tags which have the right date stamp and would be more appropriate for sorting, such as the Date Time (Original) tag. Perhaps that’s Google’s fix: modify the Gallery to sort on a different tag.

A pending fix doesn’t solve the fact that I’ve got a few hundred images with corrupt EXIF information, though. Switching to a third-party image viewer, like QuickPic, orders the images correctly, but that’s a workaround at best. So, I’ve written a quick tool to correct the invalid EXIF data on all my images.

It’s a bash shell script, and simple enough: it calls the ExifTool Perl package to copy the date from the Date Time (Original) tag over the top of the GPS Date Stamp tag. The result: the invalid tag goes away, Android knows what year the image was taken, and it orders the gallery correctly. It’s a hack: it overwrites the GPS Date Stamp of all JPG images in the current directory, regardless of whether the year makes sense or not, but it fixes the problem – albeit slowly.

You can find the script on GitHub. Hope it helps!

When you’ve corrected your images, you’ll need to go into Settings and Applications on your Android phone. Choose ‘All,’ find the Gallery app and tap it to get into the menu. Hit ‘Force Stop’ and then ‘Clear Data’ – don’t worry, you won’t lose any images, this just deletes the file database and thumbnail cache. If you don’t, the Gallery will still use the old ordering data and you’ll still have a problem. When that’s done, load the Gallery, let it find your files, and enjoy having all your images in the right order.

Oh, and one other thing: take a backup of your images before running the script. Although it worked fine for me, I can’t guarantee it won’t corrupt images further – and I don’t want to be responsible for you losing your precious photos.