Overbooked: Removing Library Data from ArchivesSpace

Over the last few months, Patrick Galligan of the digital strategies team along with Amy Berish, Darren Young, and myself of the processing team have worked on a collaborative project to export JSON representations of bibliographic data and delete all corresponding resource, subject, agent, and top container records in ArchivesSpace. This was just the first step of a larger project to create a new interim discovery system for library data separate from our archival discovery system, DIMES.

The RAC has a small, non-circulating library of more than 10,000 (mostly) non-unique published volumes that are accessible to researchers and staff. Over the years, the library data has been stored in systems designed to describe archival material (migrated from Re:Discovery to Archivists’ Toolkit, and from Archivists’ Toolkit to ArchivesSpace). While these systems worked to store this data, they were never quite the right tool for the job. The discrepancy between each system’s intended data type and information has caused some data quality issues, and, as will be outlined in a future post, caused unnecessary confusion for both researchers and RAC staff members.

Setting Up the Project

To provide structure to the library project, we decided to have weekly sprint standups to keep pushing the project forward. The weekly meetings helped us divide up tasks, talk about where we were stuck, and set reasonable goals. The processing team members were responsible for writing and running the scripts that would be used for export and deletion, and Patrick was on hand to answer questions and serve as project manager running the weekly meetings. We wanted to complete the library export and deletion sometime in August.

Exporting Library Data and Associated Records

For the export process, we needed to write a script to retrieve library resource records and all accompanying agents, subjects, and top containers linked to those records. The script needed to save exported records as individual JSON files. The links between records would be preserved via the URI information within the library resource records’ JSON file.

The inspiration for the initial export script came from an RAC script called get_data.py with a few key fixes, namely restricting the script to only get the library JSON. The script we wrote, export_library_json.py, retrieves only resources with identifiers that do not begin with AC (our accession records) or FA (finding aid records), leaving only our library records that have identifiers that begin with LI (library records). Like get_data.py, the script exports the records as individual JSON files saved as ‘identifier.json’ based on the record’s unique identifier in ArchivesSpace, for example, 1712.json. These individual JSON files are saved in folders based on their jsonmodel_type (agent_corporate_entity, subjects, etc.). The trickiest piece of this script was dealing with subjects. We had several duplicate subjects with identical titles, but with different URIs. When looking at a library resource record, only one of these duplicate subjects would actually be linked within the JSON. However, when visiting both subject pages in the ArchivesSpace user interface, the resource would appear to be linked to both of these subject records. After some research, we found that linking in the AS user interface is based on matching the string not respecting the actual link. So despite this initial scare, the script was actually working as intended. This issue has been addressed in a newer version of ArchivesSpace. After running the script, the JSON files were then added to a GitHub repository for storage purposes. We ended up with a GitHub repo containing 10,575 library resource records, 4,569 corporate agent records, 5 family agent records, 8,882 person agent records, 7,750 subject records, and 12,763 top container records. Once we had the library data exported and stored on GitHub, we created a set of sample data that could be used by the digital strategies team in the development of the library discovery system.

Removing Library Resource Records from ArchivesSpace

After the data was exported and versioned on GitHub, we could remove the library resource records from ArchivesSpace. However, we did not want to remove bibliographic data from DIMES because there would be a gap in between removing the bibliographic data from ArchivesSpace and the launch of the new interim discovery system. Thankfully, our DIMES publication pipeline allows us to delete records without removing them automatically, but we will have to go back and remove them once the entire project is completed.

Following Patrick’s advice, the easiest and safest approach to removing the library resource records was to create a CSV containing the identifier information of every library resource record and delete from that list. The CSV would ensure that we remove the exact resource records and does not result in any accidental deletions. We wrote a script that could fetch the URIs of the library resource records and write them to a CSV file. This script is called export_library_CSV.py. Concurrently, we wrote a script that could be used to delete objects from a CSV file, delete_library.py. The delete_library.py script expects a CSV file containing just the numerical portion of record URIs and a column heading based on the jsonmodel_type of the records being deleted. After we tested the deletion script on our ArchivesSpace development server and created a backup, we let the script run over a weekend to delete more than 10,000 library resource records from ArchivesSpace production. This removed the most visible aspect of the library’s presence from ArchivesSpace, but we still needed to clean up the associated records.

Cleaning Up Remaining Orphan Subjects, Agents, and Top Containers

We did not need to remove all of the subjects, agents, and top containers we originally exported for the interim library system because many of the agents and subjects were still linked to finding aids and accessions that would stay in ArchivesSpace. To remove extraneous data we wrote more scripts that could get identifier information for orphan subjects, agents, and top containers from ArchivesSpace and write that information to a CSV file. Completing this work would also tie up some loose ends on a previous agents cleanup project that had unlinked thousands of agents from Ford Foundation Grants and Catalogued Reports finding aids. We wrote the getorphan_subjects.py, getOrphan_agents.py, and adapted the previously-written RAC script, delete_orphan_containers.py for this purpose. Once we had the CSVs of identifiers with orphan objects, we added functions to the delete_library.py script to remove agents, subjects, and top containers from a CSV containing identifiers. From previous experience running the deletion script to remove library resource records, we knew that it could take several days to run to completion. But unlike the resource records, we decided that staff could continue working on ArchivesSpace production while we ran the script to remove the remaining orphan objects so that freed us to run the script during normal working hours. We still tested the deletion functions on development and made sure to create a backup before running the script each time. Overall, we removed 36,862 total orphan agents (20,264 corporate agents, 16,593 person agents, and 3 family agents), 7,284 orphan subjects, and 16,168 orphan top containers.

Project Reflections

The project went smoothly and we were thrilled we completed all the tasks within the proposed timeline. The project benefited from the previous experience Amy, Darren, and I had working together on ArchivesSpace cleanup work. Since we started working from home back in March, the three of us have been able to accomplish projects related to cleaning up agents, legacy access notes, and dates. The coding and project management skills gleaned from these projects translated easily to this project. It was also enormously helpful to have Patrick on hand to answer questions we had while writing the scripts. This made the process painless and moved the project forward each week. Stay tuned to the blog as the digital team takes on the next steps of the interim library project!

#access #ArchivesSpace #data

Data Cleanup Collaboration