Automating ArchivesSpace exports, or Better Living Through APIs

In preparation for upcoming changes to the display of digital objects in DIMES, I’ve been pursuing some enhancements to data export from ArchivesSpace. This began with a plugin to improve METS exports, including embedded MODS records, but then grew into a more comprehensive project to automate the export of updated resource records, version that data, and then push it to DIMES.

In the past, when a staff member made a change to a resource record, they’d have to email our Head of Processing, who would then export the finding aid, move it to a folder on our shared network, and from there a number of scripts would take over and move it to DIMES and kick off an indexing process.

While this process has worked well for us to this point, allowing for good quality control and communication of updates, it’s started to become a little burdensome. The number of finding aids in DIMES has grown exponentially over the last few years, and with the introduction of ArchivesSpace, we have more staff than ever finding and correcting minor errors in our finding aids. In addition, we didn’t have good version control over the exported data – updated files would simply replace older versions in DIMES. It was time for a streamlined solution that provided us with a more robust, automated and transparent workflow.

To do this, we used a couple of tools that are becoming more commonly used in archives: Git for version control and the ArchivesSpace REST API. Git is a tool that can track changes in files at a character level, which is very helpful if you want to see exactly how a file has changed over time, or if you want to step back to an earlier version. Although its primary use has traditionally been for software developers or other people writing code, an increasing number of archives and libraries are using to version their data as well. An API, which is short for Application Programming Interface, basically provides a means to interact with a system without having to alter the code of that system or write a program in the same language that system is written in. In the case of the ArchivesSpace API, it leverages REST technology to communicate over a network using HTTP commands and predictable URL patterns. What this means is that I just need to send ArchivesSpace a URL it recognizes and it will give me back the data I want.

So after some work, I’ve come up with a Python script that uses the ArchivesSpace API to run through our repository, look for published resource records that have been updated since the last time the script ran, and then export EAD for those records. It will also generate PDF finding aids for those records, using a slightly modified version of the EAD to PDF converter developed by ArchivesSpace. Then it runs through all of the archival objects in ArchivesSpace to see if any of them have been updated, and if so exports EAD for the associated resource record and generates a PDF file. After that, it exports METS files for any digital object records associated with updated resources.

Once it’s done that, it will version those files using Git, then push to a Github repository.^¹ From there, a Github webhook^² will copy the updated EAD, METS and PDF files to DIMES and trigger an indexing job.

Here’s a key point: because the script looks for published resource records, we have to be careful not to mark the top level of any resource records in ArchivesSpace as published that we don’t actually want published! In order to make sure we started out with the right resource records published, I wrote a script (again using the ArchivesSpace API) that matches resource IDs against a list to determine which should be published and which should be unpublished.

I’d be remiss if I didn’t thank a number of people who helped me with this along the way. First of all, a big thank you to Sibyl Schaefer, our former Head of Digital Programs, for suggesting this workflow a few months back. We hope you like how we’ve implemented it! Andromeda Yelton, Dave Mayo, Mark Matienzo and Mark Triggs reviewed this code and provided excellent suggestions which made the script much easier to maintain, far more robust, and just generally better. I’m very grateful to have such a fantastic and generous professional network!

We’d be really happy to have other people give this code a spin and, if you discover any problems, create an issue or pull request in the repository!

^{^[1]} Github is a service that integrates with Git, and provides some really nice tools on top of that. It’s also an easy way to make your code public, and to allow other people to contribute to a project. For example, the recently-released schema for EAD3 was developed using Github.

^{^[2]} Github Webhooks allow you to trigger additional actions based on actions applied to a repository, for example when a new commit is pushed. They are really cool and I wish I’d known about them earlier.

#APIs #ArchivesSpace #version control

Software and Systems XTF