Bulk Downloading Collections from Archive.org

UPDATE, 20161210

When I originally wrote this post, the only way to download collections of files from the Internet Archive in bulk was to perform a manual search, process the resulting CSV, and feed that into wget in a rather inefficient process. Thankfully, that’s no longer the case: there’s now an official Internet Archive software package which includes a command-line too, ia, for performing a variety of actions – including downloading content in bulk.

So, here’s how you really download bulk content from the Internet Archive. You start by installing the tools:

sudo apt-get install python-pip

pip install internetarchive

You log in to your Internet Archive account via the tool:

ia configure

Then, finally, you trigger the download for your chosen collection name:

ia download --search='collection:personalcomputerworld' --no-directories --glob=\*pdf

Change the file extension at the end if you’d prefer formats other than PDF. If you’re a power-user (oooh) and have GNU Parallel installed, you can even run multiple download streams at once to speed things up:

ia search 'collection:personalcomputerworld' --itemlist | parallel 'ia download {} --no-directories --glob=\*pdf'

The original script and post are included below, for posterity.

Archive.org is one of my favourite sites on the whole wide interwibble. It’s a massive storage facility – an archive, if you will – for digitised media of all kinds. Books, magazines, music, video – it’s all on there and completely free of charge.

Trouble is, actually getting it isn’t always easy.

Don’t get me wrong: the Archive.org website is great. Clicking on a magazine from 1988 and being able to read it right there in my browser, with page-turning animations and all, is something special – but I want to be able to download the files, stick ’em on my tablet and read them in the bath.

Thankfully, Archive.org is cognisant of this requirement: the site officially supports downloading in bulk using the excellent wget utility, but the list of instructions isn’t exactly straightforward. According to Archive.org, you need to use the advanced search function to return a list of identifiers in CSV format, then strip out a bunch of formatting and an excess line at the top, then feed the modified file to wget…

Wait, change the formatting of the file? Manually download the CSV through your web browser? Isn’t there a better way?

No, it turns out. Well, no it turned out at the time. Yes, now, because I’ve written one: archivedownload.sh. It’s a hacked-together shell script that chains a few useful GNU tools – wget, sed and tail primarily – and makes it possible to download an entire collection from Archive.org at the command line, no messin’.

All you need is the script – available on GitHub – and the name of the collection you want to download. For argument’s sake, let’s say I want a copy of every issue that Archive.org holds of Acorn Programs Magazine. First, I need the collection name: looking at an individual entry page, I see that’s “acorn-programs” – which I could probably have guessed, to be honest.

At my terminal – at which I’ve downloaded a copy of archivedownload.sh and made it executable – I need only issue the following command:

./archivedownload.sh acorn-programs

Voila: the script grabs the CSV identifier list from Archive.org’s advanced search, formats it in the way wget expects, and then tells wget to go off and download the PDF format files. Due to the way Archive.org is laid out, this takes a while: the CSV doesn’t contain the locations of the files, just their identifier, so wget has to spider its way around the site to find them. For the five issues of Acorn Programs – for which the script downloads ten PDF files, five featuring scanned images and five featuring OCRd reflowable text – wget will actually end up downloading nearly 200 files, most of which it discards. Wasteful, yes, but Archive.org seems to be happy doing it that way.

EDIT: Thanks to Archive.org’s Hank Bromley, the author of the original instructions for bulk downloads, the wget command is now a lot more efficient: instead of downloading 200 files for 10 PDFs, it now downloads 20 files. See this comment for details.

The script downloads the PDF format files by default; you can change this in line 18 to download a different format like ePub if you’d prefer. Just find the section that says “-A .pdf” and change it to “-A .epub” and re-run the script.

Just a few things to note: the script downloads the PDF files to the current directory, and if it finds any text-format PDFs – not all collections have them – moves them into a sub-directory called “text,” which it will create if it doesn’t already exist; also, it’s pretty indiscriminate in what it does, so make sure you run it in a folder that doesn’t have any existing PDFs if you don’t want weird things to happen. If you want the script to continue downloading even after you close your terminal, call it via screen as “screen /directoryname/archivedownload.sh collection-name” then CTRL-A and CTRL-D to detach the session from your controlling terminal. Oh, and if the script gets interrupted part-way through, delete the most recent PDF file – which will likely only be partially complete – and re-run the script: wget is set to no-clobber, so it won’t download any PDF files that have already been downloaded, which is handy if it fails towards the end of a large collection.

Enjoy!

Posted in: linux, software ⋅ Tagged: archive.org, bash, github, sed, shell script, wget

17 thoughts on “Bulk Downloading Collections from Archive.org”

Hank Bromley (Internet Archive) on Thursday, April 4, 2013 at 18:20 said:
I think the “wastefulness” of downloading 200 files and discarding most of them is due to a misunderstanding about the PDF naming conventions. Your script is keeping only PDFs named *_text.pdf, and relatively few of the PDFs we make have that “_text” preextension. It occurs only when the original source file is a user-uploaded PDF that has no hidden text layer. In that case, after doing OCR, we make a new PDF with hidden text, and insert the “_text” preextension to avoid a name conflict with the original file.
If the user-uploaded PDF contains a text layer, we don’t make an additional PDF – the original serves fine for those who want a PDF. And when the original source files are in some form other than a PDF (typically a zip or tar containing individual page images), the PDF we make has no “_text” preextension.
In short, you generally want any file named *.pdf, and only if you get both a plain *.pdf and a *_text.pdf do you want to skip the first and keep the second (because it’ll have a text layer and the other one won’t).
Gareth on Thursday, April 4, 2013 at 18:23 said:
Thanks for the comment, Hank, but I think you’re actually misunderstanding the script: the script downloads all PDFs, but moves any that have the _text suffix into the “text” folder. All other PDFs remain in the directory from which archivedownload.sh was called, and definitely aren’t discarded – those are the ones I actually want!
The “wastefulness” comes from the spidering, not any misunderstanding about PDF naming conventions: wget is downloading lots of HTML files to find the location of the PDFs to download, then discarding said HTML files. That’s what I was referring to in my post.
Try it yourself: of the 162-odd files wget downloads for collection name “acorn-programs” (either using my script, or manually using Archive.org’s official instructions) only 10 are actually PDF files.
Hank Bromley (Internet Archive) on Thursday, April 4, 2013 at 19:02 said:
My apologies, I read the post too quickly and filled in with some wrong assumptions, based on the fact that the wget command didn’t behave as you describe when I first formulated it 2+ years ago – at that time it downloaded only the requested files. But clearly you’re right, it now grabs all kinds of extraneous stuff.
I get the expected results with this command:
wget -r -H -nc -np -nH –cut-dirs=2 -e robots=off -i ../acorn -B ‘http://archive.org/download/’ -A .pdf
i.e., the same as my original, but with “www.” dropped from the base URL. Some time ago – but after I last ran this wget – we arranged for http://www.archive.org requests to redirect to archive.org. I haven’t tracked down exactly why, but that redirect is apparently causing wget to see lots of additional links it wasn’t intended to. Dropping the “www.” makes it behave correctly again, at least with the version given above. (There are some differences between that version and yours, but I think you may have made those changes in response to the misbehavior, and may be able to revert some of them now.)
Gareth on Thursday, April 4, 2013 at 19:20 said:
Blimey, that’s significantly better: instead of downloading a total of 162 files to get ten PDFs, it’s now downloading 20 files! I’ve updated the script to include your revised wget with the different base URL, but I’ve retained one change: -nd instead of –cut-dirs=2, because I like it to drop the files in the current directory. Personal preference, nothing more.
Thanks for the heads-up – and thanks for the Internet Archive, too, it’s an absolutely amazing project.
Hank Bromley (Internet Archive) on Thursday, April 4, 2013 at 19:40 said:
Sure, all in one directory makes perfect sense here. The command was originally being used to download all the files in the specified items, and maintaining the directories was more suitable there.
Thank *you* for pointing out the problem, and for making it easier to do the bulk downloads. I need to get that blog entry of ours updated, and make sure your script gets plugged there, too.
Rom on Wednesday, April 9, 2014 at 16:50 said:
Will this help me downloading my whole lost website?
MaxM on Monday, June 1, 2015 at 14:52 said:
Thanks for the great job! I’m no expert in Linux or with scripts, but why is the script “going throu” so many different websites? Shouldn’t it find all the *.pdf that I’m looking for under archive.org/downloads? I started the script and it downloaded the first 6 files I was looking for, but in the collection there are much more…Is is suppposed to do that? I mean I don’t really understand why it isn’t downloading #7, #8,… insead of searching on so many different websites. Thanks!
sky4055 on Sunday, April 24, 2016 at 12:56 said:
works perfect, a very big thank you!
Hank Bromley (Internet Archive) on Sunday, April 24, 2016 at 15:48 said:
Since the time of this post, one of my colleagues has developed and released an open-source command-line tool that supports bulk operations on Archive items: uploading or downloading files, reading or modifying metadata, searching, etc. It’s likely of interest to Gareth and users of his archivedownload.sh script. See:
https://github.com/jjjake/internetarchive
https://internetarchive.readthedocs.org/en/latest/
av_archivist on Wednesday, December 21, 2016 at 18:51 said:
I am a digital archivist and teacher, really loved the archivedownload.sh and used it quite a bit before was disabled. Now running the script yields Terminal output that refers me to “https://github.com/jjjake/internetarchive”.
I am not a programmer and don’t know Python, so I couldn’t figure out how to run the official tool from jjjake in Ubuntu, and couldn’t even install the program in macOS. The archivedownload.sh script appears to be easier for a non-programmer to use.
If anyone else is in this position, delete lines 3 – 11 of the archivedownload.sh script and it will work (as of today).
Gareth on Wednesday, December 21, 2016 at 18:54 said:
The instructions at the top of this post should get you started with the official tool – it’s really much faster, more powerful, and more flexible than the old script I hacked together, so I’d definitely recommend making the leap at some point!
Books to Download on Tuesday, July 11, 2017 at 01:35 said:
This is a gem!
Wow Gareth you are amazing! How can we make something like a web to apk app or something that could pull downloadables using your script?
Hyram on Thursday, July 20, 2017 at 18:29 said:
What I’d like to see is a nice GUI app to handle all this, instead of grubbing around in Terminal.
Pingback: archive.org tools and stuff | Blog | ____
krisstravelblog on Sunday, November 19, 2017 at 00:04 said:
There is a Chrome Extension for that it’s called “Archive Downloader” you can found it here https://chrome.google.com/webstore/detail/archive-downloader/elhoagejfapekjaefenmngphliikoace
AliM on Friday, April 13, 2018 at 21:35 said:
Excellent reading article thanks for sharing
Teoma N on Thursday, January 28, 2021 at 14:00 said:
Excellent tool. The difference between normal and `parallel` method is huge

Bulk Downloading Collections from Archive.org

Related

17 thoughts on “Bulk Downloading Collections from Archive.org”

Leave a ReplyCancel reply

Post navigation