Archive.org is one of my favourite sites on the whole wide interwibble. It’s a massive storage facility – an archive, if you will – for digitised media of all kinds. Books, magazines, music, video – it’s all on there and completely free of charge.
Trouble is, actually getting it isn’t always easy.
Don’t get me wrong: the Archive.org website is great. Clicking on a magazine from 1988 and being able to read it right there in my browser, with page-turning animations and all, is something special – but I want to be able to download the files, stick ‘em on my tablet and read them in the bath.
Thankfully, Archive.org is cognisant of this requirement: the site officially supports downloading in bulk using the excellent wget utility, but the list of instructions isn’t exactly straightforward. According to Archive.org, you need to use the advanced search function to return a list of identifiers in CSV format, then strip out a bunch of formatting and an excess line at the top, then feed the modified file to wget…
Wait, change the formatting of the file? Manually download the CSV through your web browser? Isn’t there a better way?
No, it turns out. Well, no it turned out at the time. Yes, now, because I’ve written one: archivedownload.sh. It’s a hacked-together shell script that chains a few useful GNU tools – wget, sed and tail primarily – and makes it possible to download an entire collection from Archive.org at the command line, no messin’.
All you need is the script – available on GitHub – and the name of the collection you want to download. For argument’s sake, let’s say I want a copy of every issue that Archive.org holds of Acorn Programs Magazine. First, I need the collection name: looking at an individual entry page, I see that’s “acorn-programs” – which I could probably have guessed, to be honest.
At my terminal – at which I’ve downloaded a copy of archivedownload.sh and made it executable – I need only issue the following command:
Voila: the script grabs the CSV identifier list from Archive.org’s advanced search, formats it in the way wget expects, and then tells wget to go off and download the PDF format files. Due to the way Archive.org is laid out, this takes a while: the CSV doesn’t contain the locations of the files, just their identifier, so wget has to spider its way around the site to find them. For the five issues of Acorn Programs – for which the script downloads ten PDF files, five featuring scanned images and five featuring OCRd reflowable text – wget will actually end up downloading nearly 200 files, most of which it discards. Wasteful, yes, but Archive.org seems to be happy doing it that way.
EDIT: Thanks to Archive.org’s Hank Bromley, the author of the original instructions for bulk downloads, the wget command is now a lot more efficient: instead of downloading 200 files for 10 PDFs, it now downloads 20 files. See this comment for details.
The script downloads the PDF format files by default; you can change this in line 18 to download a different format like ePub if you’d prefer. Just find the section that says “-A .pdf” and change it to “-A .epub” and re-run the script.
Just a few things to note: the script downloads the PDF files to the current directory, and if it finds any text-format PDFs – not all collections have them – moves them into a sub-directory called “text,” which it will create if it doesn’t already exist; also, it’s pretty indiscriminate in what it does, so make sure you run it in a folder that doesn’t have any existing PDFs if you don’t want weird things to happen. If you want the script to continue downloading even after you close your terminal, call it via screen as “screen /directoryname/archivedownload.sh collection-name” then CTRL-A and CTRL-D to detach the session from your controlling terminal. Oh, and if the script gets interrupted part-way through, delete the most recent PDF file – which will likely only be partially complete – and re-run the script: wget is set to no-clobber, so it won’t download any PDF files that have already been downloaded, which is handy if it fails towards the end of a large collection.