sed – Gareth Halfacree

UPDATE, 20161210

When I originally wrote this post, the only way to download collections of files from the Internet Archive in bulk was to perform a manual search, process the resulting CSV, and feed that into wget in a rather inefficient process. Thankfully, that’s no longer the case: there’s now an official Internet Archive software package which includes a command-line too, ia, for performing a variety of actions – including downloading content in bulk.

So, here’s how you really download bulk content from the Internet Archive. You start by installing the tools:

sudo apt-get install python-pip

pip install internetarchive

You log in to your Internet Archive account via the tool:

ia configure

Then, finally, you trigger the download for your chosen collection name:

ia download --search='collection:personalcomputerworld' --no-directories --glob=\*pdf

Change the file extension at the end if you’d prefer formats other than PDF. If you’re a power-user (oooh) and have GNU Parallel installed, you can even run multiple download streams at once to speed things up:

ia search 'collection:personalcomputerworld' --itemlist | parallel 'ia download {} --no-directories --glob=\*pdf'

The original script and post are included below, for posterity.

Tag Archives: Sed

Bulk Downloading Collections from Archive.org