Skip to content

UC3 New Year Series: The Merritt Digital Preservation Repository in 2025

Posted in Digital Preservation, and Merritt

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2025.

With the close of 2024 and beginning of 2025, the Merritt team is preparing for a new year of exciting projects and engagements with libraries and organizations across the UC system. We wrapped up 2024 by fully revising our Ingest queueing system to allow for more granular control over submissions processing, an effort which laid the groundwork for upcoming features, many of which we’ll discuss here. These include potential user-facing metrics, submission status and a new, easier to use manifest format for batch submissions. 

Meanwhile, the team is always hard at work making the repository’s existing functionality more robust and future proof. We’ll also review some of these efforts and how they both provide reassurance to collection owners and improve users’ overall experience with the system.

Submission Status and Collection Reports

Implementing Merritt’s revamped queueing system was the most significant change to the way the repository ingests content that’s taken place since its inception. The new system establishes and allows for visibility into many more stages of the ingest process. 

Every individual job now moves through these phases:

Merritt job ingest stages

Internally, the Merritt team can monitor these steps through the repository’s administrative layer and intervene as necessary. For example, if one or more jobs in a batch fails, our team can assist by restarting a job or investigating what happened. Once a job is processing again, we can now also send additional notifications regarding the batch it belongs to.

While it’s possible for the team to view when a job or batch of jobs reaches any stage, Merritt currently does not provide this same insight to its users. We’ve already obtained preliminary feedback that such visibility would be helpful and are planning to gather additional input via an upcoming depositor tools survey regarding which stages are most useful and how to present them. In this vein, the survey will seek to capture information pertaining to your workflows for submitting content to the repository, and what new information and tools would make these more efficient for you. Our goal will be to identify trends in responses and focus on a key set of features to optimize Merritt’s usability.

Alongside visibility into a submission’s status, we would also like to provide additional information about the corresponding collection, post-submission. On every collection home page, we display specific metrics about the number of files, objects and object versions in the collection as well as storage space used. Given our recent work on object analysis and data visualization, we plan to surface additional data on the composition of a collection in its entirety. Our current thinking is to bubble up the MIME types of files, a date range that takes into account the oldest and newest objects in the collection, as well as the most recently updated object. A downloadable report with this data may also be an option, directly from the collection home. Again, stay tuned as we’ll be looking for your input on what’s most important to you in this case.

New Manifest Format

There are any number of ways to ingest content into Merritt – from single submissions through its UI, to API-based submissions and also use of several types of manifests that tell the system from where it should download a depositor’s content for processing.  

In essence, a manifest is a list of file locations on the internet saved to a simple but proprietary format. Depending on the type of manifest, it may or may not contain object metadata. Although Merritt’s current range of manifest options provides a great deal of flexibility for structuring submissions, it can present a steep learning curve if a user’s goal is to deposit large batches of complex objects.

To help make working with manifests a more intuitive process, we’ve drafted a new manifest schema that records objects in YAML. Unlike current manifests, this single schema can be adapted to support definition of any object or batch of objects. For example, it allows for defining a batch of objects, where each object can have its own list of files and object metadata, all in one .yml file. Currently, this approach requires use of multiple manifests – one that lists a series of manifest files, and the other manifests to which it refers (a.k.a. a “manifest-of-manifests”). Each of these latter files records an object and its individual metadata. The new YAML schema should allow for more efficient definition of multiple objects in a single manifest, while also being more intuitive and easier to approach.

Available Services

Object Analysis

The Merritt team has endeavored to provide collection owners the means to obtain more insights into the composition of the content being preserved in the repository. More specifically, we’ve leveraged Amazon OpenSearch to record and categorize characteristics of both the file hierarchy in objects as well metadata associated with them. The Object Analysis process is one by which we gather this data from the objects in a specific collection upon request, and then surface the results of subsequent categorization and tests in an OpenSearch dashboard for review. We then walk through the dashboard with the collection owner and library staff to explore insights. For more information, have a look at the Object Analysis Reference or our presentation at the Library of Congress and let us know if you would like to explore your collections! Our goal is to provide you with the critical information needed for taking preservation actions that promote the longevity of your content.

Ingest Workspace

Last fall we brainstormed on how we could more effectively help library partners ingest their collections into Merritt while also minimizing their overhead associated with staging content. For example, if a collection owner’s digitization project results in files saved to hard drives, staging those files shouldn’t necessarily require they have on-premises storage and local IT staff who can administer it while staging and validation actions occur. Through use of a workspace our team has designed, it’s possible to assist with copying content to S3 and leverage a custom process (implemented via AWS Lambda) to automate the generation of submission manifests.

If you’re spinning up a digitization project in the near future which entails a digital preservation element, let us know!

What We’re Working On

As mentioned, we’re always working to make Merritt more robust, secure and easy to maintain for the long run. There are many more granular reasons for undertaking this work, but this year we’re keen to buckle down on our disaster recovery plans, ensure we’re making use of the latest AWS SDK version to interact with cloud storage at multiple service providers and also continuing to set up the building blocks for autoscaling Merritt’s microservices.

Disaster Recovery Plans

It goes without saying that any digital preservation service and its accompanying policies are inherently geared to prevent data loss. However, when truly unexpected events happen, the best thing we can do is strategize for the worst. In our case, because the content we steward lives entirely in the cloud, we need to be prepared for any of those services to become suddenly unavailable. Discussion in the larger community surrounding digital preservation in the cloud has continued, especially as organizations consider moving forward without a local, on-prem copy of their collections. Given Merritt already operates in this manner, it should be able to pivot and recognize either of its two online copies as its primary copy, while continuing to run fixity checks and replicate data to new storage services. Note the system’s third copy in Glacier is not a primary copy candidate due to the excessive costs associated with pulling data out of Glacier for replication. So while we already have the means to back up critical components such as Merritt’s databases, recreate microservice hosts in a new geographic region, and a well known approach to redefining collection configuration (i.e. primary vs. secondary object copies), we need a more efficient implementation to perform the latter reconfiguration on the fly. We intend to work on this storage management implementation in late 2025, detailing its operation not only as part of a disaster recovery plan, but also in our upcoming CoreTrustSeal renewal for 2026.

Talking to the Cloud

Although it doesn’t happen often (rewind to 2010 when the AWS SDK for Java 1.0 was introduced), major version updates to Amazon’s SDK that enables programmatic communication with S3 are inevitable. That’s why we’ve already spent considerable time testing use of v2 of the SDK for Java across all three of our cloud storage providers. So we’ll be ready Amazon ends support for v1 in December of 2025.

Autoscaling for Efficiency

As you may have gathered from some of our regular CDL INFO newsletter posts, a long term goal for our team is to enable certain Merritt microservices to autoscale. In other words, the number and size of submissions to the repository elements that are constantly in flux. Most days we see hundreds of small metadata updates process for e.g. eScholarship publications being preserved in the system. On other days, depositors may submit upwards of a terabyte a piece of new content while the service concurrently processes direct deposits whose updates stem from collections in Nuxeo. Given this variability, our consideration of the impact to the environment of excessive compute resources in the cloud, and compute and storage costs, Merritt should only be running as many servers as it needs anytime. Particularly, because we run every microservice in a high-availability fashion, we should be able to increase the amount of compute on days of heavy submissions, and more critically, reduce the number of hosts when minimal content is being ingested. Revamping our queueing process was a major requirement for autoscaling. With this complete, we (and all UC3 teams) now need to migrate into our own AWS account to further refine control over repository infrastructure. This next step will move us ever closer to fulfilling our goal to autoscale. We’ll continue to share more information on this in our regular posts going forward, so keep an eye out for the latest news!

In closing, we’re excited to work with our library partners throughout the coming year on both potential new features and by collaborating on new projects via recently introduced tools. Reach out to us anytime!

Comments are closed.