Archive.org is one of my favourite sites on the whole wide interwibble. It’s a massive storage facility – an archive, if you will – for digitised media of all kinds. Books, magazines, music, video – it’s all on there and completely free of charge.

Trouble is, actually getting it isn’t always easy.

Don’t get me wrong: the Archive.org website is great. Clicking on a magazine from 1988 and being able to read it right there in my browser, with page-turning animations and all, is something special – but I want to be able to download the files, stick ‘em on my tablet and read them in the bath.

Thankfully, Archive.org is cognisant of this requirement: the site officially supports downloading in bulk using the excellent wget utility, but the list of instructions isn’t exactly straightforward. According to Archive.org, you need to use the advanced search function to return a list of identifiers in CSV format, then strip out a bunch of formatting and an excess line at the top, then feed the modified file to wget…

Wait, change the formatting of the file? Manually download the CSV through your web browser? Isn’t there a better way?

No, it turns out. Well, no it turned out at the time. Yes, now, because I’ve written one: archivedownload.sh. It’s a hacked-together shell script that chains a few useful GNU tools – wget, sed and tail primarily – and makes it possible to download an entire collection from Archive.org at the command line, no messin’.

All you need is the script – available on GitHub – and the name of the collection you want to download. For argument’s sake, let’s say I want a copy of every issue that Archive.org holds of Acorn Programs Magazine. First, I need the collection name: looking at an individual entry page, I see that’s “acorn-programs” – which I could probably have guessed, to be honest.

At my terminal – at which I’ve downloaded a copy of archivedownload.sh and made it executable – I need only issue the following command:

./archivedownload.sh acorn-programs

Voila: the script grabs the CSV identifier list from Archive.org’s advanced search, formats it in the way wget expects, and then tells wget to go off and download the PDF format files. Due to the way Archive.org is laid out, this takes a while: the CSV doesn’t contain the locations of the files, just their identifier, so wget has to spider its way around the site to find them. For the five issues of Acorn Programs – for which the script downloads ten PDF files, five featuring scanned images and five featuring OCRd reflowable text – wget will actually end up downloading nearly 200 files, most of which it discards. Wasteful, yes, but Archive.org seems to be happy doing it that way.

EDIT: Thanks to Archive.org’s Hank Bromley, the author of the original instructions for bulk downloads, the wget command is now a lot more efficient: instead of downloading 200 files for 10 PDFs, it now downloads 20 files. See this comment for details.

The script downloads the PDF format files by default; you can change this in line 18 to download a different format like ePub if you’d prefer. Just find the section that says “-A .pdf” and change it to “-A .epub” and re-run the script.

Just a few things to note: the script downloads the PDF files to the current directory, and if it finds any text-format PDFs – not all collections have them – moves them into a sub-directory called “text,” which it will create if it doesn’t already exist; also, it’s pretty indiscriminate in what it does, so make sure you run it in a folder that doesn’t have any existing PDFs if you don’t want weird things to happen. If you want the script to continue downloading even after you close your terminal, call it via screen as “screen /directoryname/archivedownload.sh collection-name” then CTRL-A and CTRL-D to detach the session from your controlling terminal. Oh, and if the script gets interrupted part-way through, delete the most recent PDF file – which will likely only be partially complete – and re-run the script: wget is set to no-clobber, so it won’t download any PDF files that have already been downloaded, which is handy if it fails towards the end of a large collection.

Enjoy!

6 Thoughts on “Bulk Downloading Collections from Archive.org

  1. I think the “wastefulness” of downloading 200 files and discarding most of them is due to a misunderstanding about the PDF naming conventions. Your script is keeping only PDFs named *_text.pdf, and relatively few of the PDFs we make have that “_text” preextension. It occurs only when the original source file is a user-uploaded PDF that has no hidden text layer. In that case, after doing OCR, we make a new PDF with hidden text, and insert the “_text” preextension to avoid a name conflict with the original file.

    If the user-uploaded PDF contains a text layer, we don’t make an additional PDF – the original serves fine for those who want a PDF. And when the original source files are in some form other than a PDF (typically a zip or tar containing individual page images), the PDF we make has no “_text” preextension.

    In short, you generally want any file named *.pdf, and only if you get both a plain *.pdf and a *_text.pdf do you want to skip the first and keep the second (because it’ll have a text layer and the other one won’t).

  2. Thanks for the comment, Hank, but I think you’re actually misunderstanding the script: the script downloads all PDFs, but moves any that have the _text suffix into the “text” folder. All other PDFs remain in the directory from which archivedownload.sh was called, and definitely aren’t discarded – those are the ones I actually want!

    The “wastefulness” comes from the spidering, not any misunderstanding about PDF naming conventions: wget is downloading lots of HTML files to find the location of the PDFs to download, then discarding said HTML files. That’s what I was referring to in my post.

    Try it yourself: of the 162-odd files wget downloads for collection name “acorn-programs” (either using my script, or manually using Archive.org’s official instructions) only 10 are actually PDF files.

  3. My apologies, I read the post too quickly and filled in with some wrong assumptions, based on the fact that the wget command didn’t behave as you describe when I first formulated it 2+ years ago – at that time it downloaded only the requested files. But clearly you’re right, it now grabs all kinds of extraneous stuff.

    I get the expected results with this command:

    wget -r -H -nc -np -nH –cut-dirs=2 -e robots=off -i ../acorn -B ‘http://archive.org/download/’ -A .pdf

    i.e., the same as my original, but with “www.” dropped from the base URL. Some time ago – but after I last ran this wget – we arranged for http://www.archive.org requests to redirect to archive.org. I haven’t tracked down exactly why, but that redirect is apparently causing wget to see lots of additional links it wasn’t intended to. Dropping the “www.” makes it behave correctly again, at least with the version given above. (There are some differences between that version and yours, but I think you may have made those changes in response to the misbehavior, and may be able to revert some of them now.)

  4. Blimey, that’s significantly better: instead of downloading a total of 162 files to get ten PDFs, it’s now downloading 20 files! I’ve updated the script to include your revised wget with the different base URL, but I’ve retained one change: -nd instead of –cut-dirs=2, because I like it to drop the files in the current directory. Personal preference, nothing more.

    Thanks for the heads-up – and thanks for the Internet Archive, too, it’s an absolutely amazing project.

  5. Sure, all in one directory makes perfect sense here. The command was originally being used to download all the files in the specified items, and maintaining the directories was more suitable there.

    Thank *you* for pointing out the problem, and for making it easier to do the bulk downloads. I need to get that blog entry of ours updated, and make sure your script gets plugged there, too.

  6. Will this help me downloading my whole lost website?

Leave a Reply

Post Navigation