10th International Conference on Preservation of Digital Objects 

I was fortunate to attend this conference a few weeks ago, and want to share the interesting things I learned. This was definitely an interesting experience! For starters, it was the first conference I’ve attended that was focused solely on the topic of digital preservation. It also was much more research-focused than many conferences I normally attend, which I found really refreshing. Another notable fact was the type of attendees – there was a mix of technologists and information specialists, but it did seem like many attendees represented large national libraries. While some archivists were present, I felt a lot of presentations approached preservation topics from a library, rather than archival, perspective. For example, in certain web archiving presentations, it seemed the focus was on preserving and presenting the information on a web page, rather than preserving the information in context and as evidence.

Some highlights from the presentations:

Archives New Zealand Migration from Fedora Commons to the Rosetta Digital Preservation System This presentation from Archives New Zealand discussed the migration of archival objects from Fedora to Rosetta. They had originally implemented Fedora as a temporary measure until a more permanent solution was developed. Part of the reason Fedora wasn’t considered a suitable solution is its limitation on archival objects – it caps out at a million objects and Archives New Zealand has far many more objects than that. By the time the Rosetta system was up and running, ANZ had 48TB of data to migrate into it. They split the data into multiple ingests, the entire ingest processing lasting almost a year. During ingest about half a TB of data was identified as problematic. The presentation (and linked paper) details some of the issues they discovered when delving into these problematic materials.  Apparently, quite a few of them could be traced to a scanner that created faulty tiffs, which then were not validated correctly by JHOVE during the migration. This underscores the necessity of checking files produced by vendors (in their case, the files came from other branches of the Archive and weren’t validated upon receipt).

Studies on the scalability of web presentation Also interesting was this study, presented by Tessella folks, on characterizing embedded files within WARC containers.  WARCs contain the aggregated files plus metadata from web crawls. Web crawls are essentially collections of web page captures; within these content blocks are files like images, pdfs, etc., which are all packaged into the WARC format. This study involved seeing if those embedded files could be characterized (using DROID, which identifies the file formats and versions) in order to determine migration strategies. The study then migrated files deemed at preservation risk to different formats, and then re-packed the WARC file. This presentation sparked a bit of debate – there were questions on how the migrated file formats would affect the integrity of the WARC file as an archival object. Another issue is also that Wayback, the predominant viewer for WARC files doesn’t work well with the migrated files, creating an access issue for those WARCs. The upshot is that although we’ve developed tools for capture and playback of web crawls, we’re still grappling with their inherent complexity.

Database Preservation Evaluation Report – SIARD vs CHRONOS I found this presentation, evaluating database archiving tools, pretty amazing, probably partially due to the fact that I knew so little about the tools discussed prior to the presentation. I confess there’s still a lot of complexity I don’t quite understand, but it was helpful to see what types of criteria they identified for evaluation and comparison of the database archiving tools.

An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration I sometime find it amazing how, in the realm of digital preservation, something seemingly straightforward untangles into incredible complexity upon closer investigation. Take the JPG2000 format, for example. JPG2000 is a standard that evolved to improve on the JPG format. It’s been adopted by the preservation world to address two main problems with tiffs:

  1. The tiff file format is proprietary: it’s controlled by Adobe, and Adobe hasn’t made updates on it since 1992. Although the format is still widely supported, it will eventually be superseded by more actively developed formats.
  2. Tiffs take up a lot of space. JPG2000’s were developed to take up less space while maintaining image fidelity.

Given the adoption of JPG2000 in libraries for digital preservation purposes, it was unexpected to find that there are multiple JPG2000s codecs libraries, and that some of them are commercial. This study reviews an experiment testing the quality of each after several lossy compressions. It highlights that not all JPG2000 codecs are equal, and that some are better at maintaining image fidelity through iterative compressions.

There were many, many other great presentations and unfortunately I don’t have time to cover all of them. A few final recommended reads are:

Cloudy Emulation – Efficient and Scalable Emulation-based Services (the paper is a bit technical, but it’s a fascinating project).

Supporting practical preservation work and making it sustainable with SPRUCE (good overview of this collaborative project)

You can find all the papers here.

Overall, I thought this was an excellent conference. It’s in Australia next year, so it’s unlikely that I’ll attend. However, in 2015 it will be in Chapel Hill, NC, so I hope to see you there!