(index page)

UC3 New Year Series: The Merritt Digital Preservation Repository in 2025

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2025.

With the close of 2024 and beginning of 2025, the Merritt team is preparing for a new year of exciting projects and engagements with libraries and organizations across the UC system. We wrapped up 2024 by fully revising our Ingest queueing system to allow for more granular control over submissions processing, an effort which laid the groundwork for upcoming features, many of which we’ll discuss here. These include potential user-facing metrics, submission status and a new, easier to use manifest format for batch submissions.

Meanwhile, the team is always hard at work making the repository’s existing functionality more robust and future proof. We’ll also review some of these efforts and how they both provide reassurance to collection owners and improve users’ overall experience with the system.

Submission Status and Collection Reports

Implementing Merritt’s revamped queueing system was the most significant change to the way the repository ingests content that’s taken place since its inception. The new system establishes and allows for visibility into many more stages of the ingest process.

Every individual job now moves through these phases:

Internally, the Merritt team can monitor these steps through the repository’s administrative layer and intervene as necessary. For example, if one or more jobs in a batch fails, our team can assist by restarting a job or investigating what happened. Once a job is processing again, we can now also send additional notifications regarding the batch it belongs to.

While it’s possible for the team to view when a job or batch of jobs reaches any stage, Merritt currently does not provide this same insight to its users. We’ve already obtained preliminary feedback that such visibility would be helpful and are planning to gather additional input via an upcoming depositor tools survey regarding which stages are most useful and how to present them. In this vein, the survey will seek to capture information pertaining to your workflows for submitting content to the repository, and what new information and tools would make these more efficient for you. Our goal will be to identify trends in responses and focus on a key set of features to optimize Merritt’s usability.

Alongside visibility into a submission’s status, we would also like to provide additional information about the corresponding collection, post-submission. On every collection home page, we display specific metrics about the number of files, objects and object versions in the collection as well as storage space used. Given our recent work on object analysis and data visualization, we plan to surface additional data on the composition of a collection in its entirety. Our current thinking is to bubble up the MIME types of files, a date range that takes into account the oldest and newest objects in the collection, as well as the most recently updated object. A downloadable report with this data may also be an option, directly from the collection home. Again, stay tuned as we’ll be looking for your input on what’s most important to you in this case.

New Manifest Format

There are any number of ways to ingest content into Merritt – from single submissions through its UI, to API-based submissions and also use of several types of manifests that tell the system from where it should download a depositor’s content for processing.

In essence, a manifest is a list of file locations on the internet saved to a simple but proprietary format. Depending on the type of manifest, it may or may not contain object metadata. Although Merritt’s current range of manifest options provides a great deal of flexibility for structuring submissions, it can present a steep learning curve if a user’s goal is to deposit large batches of complex objects.

To help make working with manifests a more intuitive process, we’ve drafted a new manifest schema that records objects in YAML. Unlike current manifests, this single schema can be adapted to support definition of any object or batch of objects. For example, it allows for defining a batch of objects, where each object can have its own list of files and object metadata, all in one .yml file. Currently, this approach requires use of multiple manifests – one that lists a series of manifest files, and the other manifests to which it refers (a.k.a. a “manifest-of-manifests”). Each of these latter files records an object and its individual metadata. The new YAML schema should allow for more efficient definition of multiple objects in a single manifest, while also being more intuitive and easier to approach.

Available Services

Object Analysis

The Merritt team has endeavored to provide collection owners the means to obtain more insights into the composition of the content being preserved in the repository. More specifically, we’ve leveraged Amazon OpenSearch to record and categorize characteristics of both the file hierarchy in objects as well metadata associated with them. The Object Analysis process is one by which we gather this data from the objects in a specific collection upon request, and then surface the results of subsequent categorization and tests in an OpenSearch dashboard for review. We then walk through the dashboard with the collection owner and library staff to explore insights. For more information, have a look at the Object Analysis Reference or our presentation at the Library of Congress and let us know if you would like to explore your collections! Our goal is to provide you with the critical information needed for taking preservation actions that promote the longevity of your content.

Ingest Workspace

Last fall we brainstormed on how we could more effectively help library partners ingest their collections into Merritt while also minimizing their overhead associated with staging content. For example, if a collection owner’s digitization project results in files saved to hard drives, staging those files shouldn’t necessarily require they have on-premises storage and local IT staff who can administer it while staging and validation actions occur. Through use of a workspace our team has designed, it’s possible to assist with copying content to S3 and leverage a custom process (implemented via AWS Lambda) to automate the generation of submission manifests.

If you’re spinning up a digitization project in the near future which entails a digital preservation element, let us know!

What We’re Working On

As mentioned, we’re always working to make Merritt more robust, secure and easy to maintain for the long run. There are many more granular reasons for undertaking this work, but this year we’re keen to buckle down on our disaster recovery plans, ensure we’re making use of the latest AWS SDK version to interact with cloud storage at multiple service providers and also continuing to set up the building blocks for autoscaling Merritt’s microservices.

Disaster Recovery Plans

It goes without saying that any digital preservation service and its accompanying policies are inherently geared to prevent data loss. However, when truly unexpected events happen, the best thing we can do is strategize for the worst. In our case, because the content we steward lives entirely in the cloud, we need to be prepared for any of those services to become suddenly unavailable. Discussion in the larger community surrounding digital preservation in the cloud has continued, especially as organizations consider moving forward without a local, on-prem copy of their collections. Given Merritt already operates in this manner, it should be able to pivot and recognize either of its two online copies as its primary copy, while continuing to run fixity checks and replicate data to new storage services. Note the system’s third copy in Glacier is not a primary copy candidate due to the excessive costs associated with pulling data out of Glacier for replication. So while we already have the means to back up critical components such as Merritt’s databases, recreate microservice hosts in a new geographic region, and a well known approach to redefining collection configuration (i.e. primary vs. secondary object copies), we need a more efficient implementation to perform the latter reconfiguration on the fly. We intend to work on this storage management implementation in late 2025, detailing its operation not only as part of a disaster recovery plan, but also in our upcoming CoreTrustSeal renewal for 2026.

Talking to the Cloud

Although it doesn’t happen often (rewind to 2010 when the AWS SDK for Java 1.0 was introduced), major version updates to Amazon’s SDK that enables programmatic communication with S3 are inevitable. That’s why we’ve already spent considerable time testing use of v2 of the SDK for Java across all three of our cloud storage providers. So we’ll be ready Amazon ends support for v1 in December of 2025.

Autoscaling for Efficiency

As you may have gathered from some of our regular CDL INFO newsletter posts, a long term goal for our team is to enable certain Merritt microservices to autoscale. In other words, the number and size of submissions to the repository elements that are constantly in flux. Most days we see hundreds of small metadata updates process for e.g. eScholarship publications being preserved in the system. On other days, depositors may submit upwards of a terabyte a piece of new content while the service concurrently processes direct deposits whose updates stem from collections in Nuxeo. Given this variability, our consideration of the impact to the environment of excessive compute resources in the cloud, and compute and storage costs, Merritt should only be running as many servers as it needs anytime. Particularly, because we run every microservice in a high-availability fashion, we should be able to increase the amount of compute on days of heavy submissions, and more critically, reduce the number of hosts when minimal content is being ingested. Revamping our queueing process was a major requirement for autoscaling. With this complete, we (and all UC3 teams) now need to migrate into our own AWS account to further refine control over repository infrastructure. This next step will move us ever closer to fulfilling our goal to autoscale. We’ll continue to share more information on this in our regular posts going forward, so keep an eye out for the latest news!

In closing, we’re excited to work with our library partners throughout the coming year on both potential new features and by collaborating on new projects via recently introduced tools. Reach out to us anytime!

Exploring Merritt’s Recent Growth

Taking a Look at Merritt’s New Collections, Project Updates, and Collaborations

Merritt is CDL’s digital preservation repository. Merritt’s goal is to provide a comprehensive platform for digital preservation, ensuring valuable materials are securely stored and accessible for the long term. We regularly post updates on our projects and collaborations on the CDLInfo website. This post is a round-up of recent activities to showcase the growth of our collections, project updates, and campus collaborations.

Amazing Growth

While Merritt has been available since 2005, this year we have seen incredible interest in deposits. In recent months, our collections have grown significantly. Since the beginning of the year we have grown from 411 TB to 532 TB.

We have seen this happen through several new and ongoing collaborations. From ongoing projects to noteworthy collaborations, here are some highlights of our current work.

New Content and Collections

UCLA Library – We collaborated with our colleagues at UCLA and the Palestinian Museum to move forward with programmatically collecting preservation metadata for over 146,000 objects in the Palestinian Museum’s archive. Through extensive investigation we were able to extract a variety of metadata for each image that will be mapped to Merritt’s object-level metadata fields and ingest the entire collection.

UC Riverside – Our collaboration with UC Riverside Library is progressing well. We’re working on establishing Nuxeo-to-Merritt direct deposits, which will automate the process of capturing metadata from Library Special Collections and University Archives, as well as Water Resources Collections & Archives holdings. This initiative is aimed at streamlining curation efforts and enhancing our repository’s offerings.

UC Berkeley – UC Berkeley Library has been actively contributing to our collections. A variety of objects, including materials from the UC Berkeley Library Examiner collection, UC Berkeley Library JVAC, UCB Library California Audio-Visual Preservation Project, and more, have been deposited into Merritt. This diverse array of materials enriches the repository’s content.

UCSF Library – We’ve been collaborating with UCSF Archives & Special Collections to manage the ingest of the COVID Tracking Project (CTP) Archive into Merritt. We’re excited to say that we’ve completed depositing all content into this collection in Merritt.

Many others – The list above is just a few of the many projects we continue to work on with communities across all UC campuses and their partners. Stay tuned on more by checking our our regular updates on the CDL INFO website.

Digital Preservation Community

NDSA’s Storage Survey – The NDSA community is embarking on the next phase of its Storage survey, a project that examines storage practices within digital preservation organizations. The survey aims to provide insights into evolving practices and trends. Updates on the group’s progress will be available as they begin their work.

Digital Preservation Conferences – Fall brings a series of digital preservation conferences, including iPRES and NDSA’s DigiPres. These gatherings offer opportunities for professionals to come together, share knowledge, and discuss the latest developments in the field.

CoreTrust Seal – We are delighted to renew our CoreTrustSeal certification. The certification, which we first received four years ago, recognizes our commitment to ensuring the long-term preservation and accessibility of University of California archival collections. The certification process involved a rigorous evaluation of our policies, infrastructure and processes to ensure that we meet the highest standards for preservation and data management.

Many others – The list above is just a few of the many collaborations and partnerships we continue to stay involved in. Stay tuned on more by checking our our regular updates on the CDL INFO website.

Merritt’s Current Projects

OpenSearch Integration – Merritt’s microservices are undergoing a transformation to integrate OpenSearch for logging events and issues. This project facilitates the capture of operational logs and system events, providing valuable insights into the repository’s functioning. Additionally, data visualization and dashboard functionalities are being implemented to enhance tracking and analysis.

Data Import and Visualization – We’ve made progress in generating JSON data from Merritt’s inventory and billing databases for import into OpenSearch. This data provides insights into collection ownership, file attributes, and deposit timelines. The visualizations resulting from this effort will offer enhanced understanding and reporting capabilities.

DevOps Improvements for Merritt Services – The Merritt Team has made a number of improvements to our DevOps (development operations) practices in preparation for a migration to a new server environment, migration to a new version of java, and to incorporate additional security best practices.

Many others – The list above is just a few of the many improvements to the Merritt platform that we continue to work on. Stay tuned on more by checking our our regular updates on the CDL INFO website.

Future Endeavors

Merritt is committed to ongoing projects and collaborations with UC campuses and our partners. Stay tuned for more updates as we continue to navigate these initiatives and work towards providing a seamless and comprehensive preservation repository.

Merritt Renews its CoreTrustSeal Certification

We are delighted to announce that the Merritt preservation repository has recently received a renewal of its CoreTrustSeal certification. The certification, which we first received four years ago, recognizes our commitment to ensuring the long-term preservation and accessibility of University of California archival collections.

At this time, the Merritt repository holds over 400 collections composed of UC library special collections content, eScholarship journal publications, as well as digital content from libraries across the state of California and other memory institutions.

The CoreTrustSeal is an internationally recognized standard that demonstrates our repository’s compliance with best practices in digital preservation. The certification process involved a rigorous evaluation of our policies, infrastructure and processes to ensure that we meet the highest standards for preservation and data management.

By renewing our CoreTrustSeal certification, we are signaling to librarians, archivists, researchers, and data producers that they can trust our repository to preserve digital content for the long-term. This certification helps us to demonstrate our commitment to maintaining high-quality processes that ensure the integrity, authenticity, and usability of UC’s content.

We would like to extend our sincere thanks to the organizers and volunteers of the CoreTrustSeal community project for their invaluable work in developing and maintaining this important certification. Their dedication to promoting best practices in digital preservation is an essential contribution to the scholarly community and ensures the long-term availability of digital content.

The Merritt preservation repository is committed to providing the highest quality preservation services. We look forward to continuing to serve as a trusted repository for the University of California and our partners around the world.

Merritt is awarded a Mojgan Amini Operational Excellence Award at UC Tech 2022

We are excited to announce that Merritt, CDL’s digital preservation repository, has been awarded the Mojgan Amini Operational Excellence Award. This recognition comes at a time when the Merritt team continues to witness exponential growth in content submission from across the UC system.

As the Awards committee highlighted: “Today, Merritt manages close to two and a half times the amount of data it did in 2019. It does so through more reliable, transparent, efficient and cost-effective means that aim to mitigate the risk of data loss while providing content access in perpetuity.”

Storage Occupied by One Object Copy of all Content in Merritt

Through this award, we are demonstrating that success as a digital preservation system is not measured solely by an increase in content. The Merritt team’s efforts between 2019 and present day highlight how a focus on technical refactoring, workflow automation, and process optimization efforts benefit not only our current users, but those in the future as well.

We are thankful to the UC Tech Award committee for recognizing our transformative work and technical achievements. We know that Merritt must not only be a robust and stable platform that serves the University community, but it should also be able to adapt to new policies and advances in technologies, all the while being a straightforward system to maintain. Though our tasks as content and system stewards will never be finished – after all, digital preservation is truly a journey rather than a race – what follows is a summary of our recent efforts and their results.

Merritt’s usage cost was lowered from $650/TB to $150/TB, while the number of replicas for every digital object in the system increased from two to three copies.

At the beginning of 2019, the majority of collections in Merritt incorporated two object copies and the storage cost per TB of data was set at $650. Through evaluation of new cloud storage providers and technology, the team decided to migrate Merritt’s content off of OpenStack Swift storage to a newer, more economical Qumulo storage solution provided by SDSC. We then replicated the corpus to another low-cost cloud storage provider, Wasabi. Both Qumulo and Wasabi storage adhere to industry-leading data durability metrics.
Given the above two actions, plus an existing content copy in AWS Glacier, the total cost per TB for storage (provided at-cost) is now $150 for three object copies – a savings passed directly to end users.
The combination of the above factors allows Merritt to attain the gold standard of risk mitigation strategies for digital preservation – three object copies stored with three different service providers, across two separate geographic regions (U.S. East and West coast), with at least one copy being in nearline or offline storage technology.

The standing file audit (a.k.a. fixity check) cycle was reduced to a third of the time necessary to fixity check all data in the repository.

A critical operation in all bit-level digital preservation repositories is file fixity checking. This ensures detection of bit rot before it’s too late to address. With three digital object copies in the cloud, the system’s cycle time for auditing content had grown to 150 days. Through an innovative approach that examines bitstreams in memory, we’ve reduced that cycle to approximately 50 days (66% decrease), even with the increased growth in holdings.

The rate of new deposits has increased significantly since 2019, such that holdings have more than doubled.

Since 2019, holdings in the repository have increased from 120TB for a single object copy, to 295TB as of May 2022.
Taking into consideration all three object copies that the system manages at a file level, Merritt is now managing close to 1PB of data.

Reduction of the number of local data copies in play across cloud storage and microservices has increased the rate of new content ingest.

As a system, Merritt incorporates nine microservices, each of which runs in a high availability configuration. By introducing a shared cloud-based file system, running fixity calculations on-the-fly and moving specific operations directly into the cloud, we’ve reduced the number of local host copies utilized by the microservices. Altogether these changes have made Merritt’s ingest process 4 times faster than it was, as the system can now move more than 4TB/day.

Increased sustainability has been achieved through implementation of system assertions and added system transparency.

As a digital preservation system, Merritt is focused on the sustainability of content produced by the UC community. The Merritt Team undertook a number of initiatives to improve the sustainability of the system itself to ensure it can be managed and maintained by the current team as well as by future team members.
To this end, an administrative layer that sits over the system has been implemented. It enables the team to monitor and interact with recently processed and processing content respectively. It has allowed us to build a series of assertions, or tests, that run nightly and deliver a system consistency report each morning. Highly granular data at byte, file and object level is automatically presented for review. These data display the results of assertions associated with system health while queueing, ingesting, replicating, and auditing content, as well as with its ability to provide object access.
Beyond these daily reports, anyone on the team can also accomplish a wide range of tasks with Merritt’s admin layer – from more easily diagnosing operational issues, to correcting those that may arise through automated actions, to providing up-to-date content metrics for our users, to extending the repository’s configuration by setting up new collections and adjusting collection configuration. All of these interactions play key roles in promoting the sustainability of the content that the system manages.

Steps have been taken toward system scalability, including the streamlining of microservice configuration.

A longer term goal for the Merritt system entails an ability to autoscale its microservices accordingly, in times of both heightened and reduced submission loads. Two steps have been taken toward this goal, each of which have important side effects of bolstering system security.
Merritt’s microservices make use of extensive configuration data. By moving this data into a central, cloud-based parameter store the team has not only streamlined configuration and enabled dynamic, runtime configuration changes, but it has also introduced enhanced security for the data itself.
Builds for Merritt’s microservices have been refactored in order to enable simplified and more frequent application of updates across the system, thus increasing our ability to readily deploy security enhancements among the many dependencies used throughout Merritt’s codebase.

UCLA Digital Library Program & UC3 Collaborate to Reaffirm a Preservation Solution for the Strachwitz Frontera Collection

UCLA has long collaborated with the Arhoolie Foundation to provide digital access to and enable the preservation of the world’s largest known musical collection of historical Mexican and Mexican American commercial recordings: The Frontera Collection. Now, in a renewed effort and in collaboration with UC3, UCLA’s Digital Library Program (DLP) has begun to enhance and expand the actual number of recordings that are being preserved in CDL’s Merritt digital preservation repository.

Owned by the Arhoolie Foundation, the physical collection began in the early 1970s when Chris Strachwitz, already a decade into his lifelong career as a record producer and collector of regional music, was introduced to corridos, Tejano, Norteño and other regional styles of Mexican and Mexican American music. This introduction, and Strachwitz’s devotion to song catching, is the origin of the Frontera Collection. Over time, the collection has grown to nearly 170,000 recordings, has started to include Latin American music, and, in its entirety, spans the better part of the 20th century, from 1905 to the 1990s.

Over twenty years ago, the UCLA Library, the UCLA Chicano Studies Research Center, and the Arhoolie Foundation formed a partnership to digitize and make available online this unparalleled collection. Leveraging an initial gift made by the Los Tigres Del Norte Foundation to the Chicano Studies Research Center, the digitization process was started in earnest. Continued collaboration in this effort has been funded through several grants provided by the National Endowment for the Humanities, as well as funding grants from the National Endowment for the Arts, the GRAMMY Foundation, the Fund for Folk Culture, Arhoolie Records, Mr. and Mrs. E. W. Littlefield Jr., the Edmund & Jeannik Littlefield Foundation, and others.

Recently, Elizabeth McAulay, head of the UCLA DLP, approached UC3 with a request to expand the existing collection in Merritt, which at the time housed more than twenty thousand Frontera digital objects. However, since the initial deposit in 2006, the DLP has received thousands more song files from Arhoolie. Beyond providing access copies of these newer recordings on the DLP’s Frontera website, DLP and Arhoolie are committed to ensuring their long term preservation.

Metadata & gap analysis

Working with Geno Sanchez, Digital Assets Coordinator at the DLP, a process was begun to obtain the latest metadata for all of the Frontera Collection’s recordings, locate tens of thousands of master song files and their associated record album image files, conduct a gap analysis between the existing Merritt-based collection and metadata, and then finally assemble groups of files that would each comprise a digital object for ingest into the repository.

While the Arhoolie Foundation has delivered new song files and album artwork to the DLP over the years, it has also sent along spreadsheets of metadata to accompany every batch files. At the onset of our project, we reached out to the Foundation to request a spreadsheet complete with metadata from all of these batches, consisting of every master recording file name in Frontera. In late April, we received a spreadsheet with no fewer than 153,000 entries, each with fields that note information about the album artist, composer, catalog number, record label, producer, recording date, format, and recording file name among other bits. There are 42 fields for every entry, leaving out no detail of import. Even the equipment used to transfer each recording from shellac or vinyl record, or magnetic media to digital is listed.

Upon initial examination, it became clear that the recording file name would serve as the key field against which we would compare the content of digital objects in the existing Merritt collection. However, it also became evident there were many orphaned entries on both sides of the equation.

Through use of our file analysis tool, we were able to cross reference fields in the spreadsheet against those in a spreadsheet of data exported from Merritt’s inventory database. Using the recording file name as a unique way to associate each existing digital object in Merritt with an entry on the Arhoolie spreadsheet, we discovered that slightly less than a thousand objects (out of 27,000) were lacking corresponding entries. And of those, 578 did not include recording files – only album artwork. Attempts to manually locate recording files associated with the album artwork by searching on-premises storage at the DLP did not succeed, and it was ultimately determined that we would start with a fresh collection in Merritt to house all of the known song files and their art, rather than attempt to augment existing objects.

With a path forward based on the gap analysis, we dove into examining the metadata for each song in the Arhoolie spreadsheet and determined a mapping into a MODS metadata file to accompany each object. Since the metadata in the spreadsheet was so robust and meaningful, we decided to carry over detailed information about the album as well as include artist names, Matrix Number, track duration, publisher, and of course donor. The importance of the Matrix Number cannot be understated, as it identifies the unique Master recording for a song and may ultimately appear on many different record labels alongside a variety of catalog numbers.

Stepping back from the metadata, a complete digital object for a single song in the collection appears as follows in the repository. Note the front and back album artwork files, as well as a TIFF for the center label placed on the record.

──
│ └── bv_33_318_a4 (digital object folder name and Local ID in Merritt)
│ ├── bv_33_318_a.tif
│ ├── bv_33_318_a4.wav
│ ├── bv_33_318_a4.xml
│ ├── bv_33_318_back.tif
│ └── bv_33_318_front.tif

Not all objects contain a superset of files like this one. At a minimum, only the .wav and .xml files are included.

Assembling what would become the submission information packages (SIPs) for each song took a tremendous effort to automate the location, movement and gathering of all required object components. At the same time, we decided to divide the full collection into batches to allow for easier transmission and management. The Arhoolie song list spreadsheet was divided into batches of circa 24,000 songs and we then transformed the spreadsheet for the initial batch into MODS XML files.

Submitting the first batch

With the first batch of objects assembled, we pivoted our focus toward enabling their submission to the repository. Submitting batches of content to Merritt is best done through use of manifest files. Manifests are csv-like files that record the web-accessible location of each SIP to be ingested, along with object-level metadata and an md5 checksum for verification purposes. Once each group of files was organized into a directory on local storage (one group per song), these were zipped up and an associated checksum generated. The URL at which the package could be retrieved was then recorded, along with the aforementioned metadata in a single line entry in the manifest.

With a very small subset of about a dozen or so SIPs, we tested submission into the staging environment for Merritt. This testing revealed we needed to make minor manifest encoding corrections to ensure support for diacritics, and that otherwise our process was successful. Shortly thereafter, a new collection in Merritt was established to house the initial batch of 24,000 song objects. We proceeded to split the primary manifest into four, six-thousand-object manifests for submission in turn – and within four days time, all objects were ingested into the repository.

Looking forward

As the entire Frontera collection contains close to 170,000 sound recordings, many more batches of content are yet to be assembled. At this time, the second batch is nearly complete and we hope to have it ingested before the end of 2021. Slowly but surely, Frontera will make its way to a safe new home, its master recordings ready to be retrieved and shared with generations to come.

As an extended team, UC3 and the Digital Library Program at UCLA have already begun collaborating on more projects like this one. We’ll share our experiences as we move forward and look to hear from you with your thoughts and questions. But in the meantime, you may wish to do some song catching of your own while exploring Frontera – an extraordinarily rich collection of music that encompasses over 90 years of cultural heritage.

Themes for Digital Preservation

By: Eric Lopatin

Since our inception, a main focus of UC3 has been to deliver high quality, reliable digital preservation services for the UC community. Currently, this takes the form of both consultative and community engagement work, as well as technical development to ensure our digital preservation repository, Merritt, remains durable and innovative.

For 2021, our digital preservation goals fit into three key themes: Community Engagement, Simplification, and Scaling. By working with these themes in mind, our goal is to promote the values at the core of many preservation systems and programs – values such as reliability, authenticity, integrity and sustainability. The crossroads of technology and policy are where these values play out, and we’re aiming to keep abreast of many of them while heading into 2021.

Community Engagement

Metrics and insight – Last year, our team laid the groundwork for more granular reporting on content held in Merritt. This work has already allowed us to provide reports for campuses that illustrate a variety of aspects related to their collections. This year, we’ll work on dashboards and data visualizations to provide for more insight to users into their collections.

UC system-wide digital preservation – Over the past two years, UC3 has participated in multiple phases of a system-wide Digital Preservation Strategy working group. The next phase of this effort will establish a systemwide leadership group and begin to construct a digital preservation training program across UC campuses. In 2021, our team will continue to participate in these efforts as they set the course for future projects across our campus community.

NDSA engagement – The National Digital Stewardship Alliance is a longstanding, community building organization which promotes discussion, learning and standards surrounding digital preservation. This year I’ll be co-chairing the NDSA Infrastructure Interest group with Leah Prescott at Georgetown University. We’ll be facilitating conversations surrounding preservation technologies and infrastructure, while also joining NDSA Leadership meetings to help apply input from interest group participants directly to activities the organization takes on throughout the year. Given the record attendance of NDSA’s recent DigiPres 2020 conference, I’m looking forward to helping build out future opportunities through which the larger preservation community can collaborate.

Simplification

Preservation Assurance – The overarching UC3 digital preservation strategy calls for creating and maintaining three copies of every object in Merritt, across two geographic regions with differing disaster threats, with at least one of those copies being less volatile in nearline storage. All Merritt collection content now adheres to this strategy, and we replicate new submissions to our cloud storage providers as it arrives.

New submissions – One of the most commonly used methods of adding new content to Merritt is uploading a manifest file that enables batch ingest of hundreds or thousands of objects that are pulled from a user’s on-prem storage. In 2021, we’re planning to simplify and automate the manifest creation process to assist users with this task, so they can be assured that all of their objects and object-level metadata will be handled correctly.

Common API – One API for use across the Merritt system has been a goal for quite a while, and we’re looking forward to making it a reality. In 2021, we will continue our work to design a common API for use by users and Merritt microservices alike. This will allow for submitting content, gaining insight into existing submissions, easier external systems integration, and of course access to individual files and object versions.

Scaling

Auto-scaling – The theme of Simplification goes hand-in-hand with Scaling. In our case, effectively scaling aspects of the Merritt system could be more aptly referred to as auto-scaling. In a recent blog post, we discussed how the team has been at work implementing a centralized parameter store with AWS Systems Manager to streamline Merritt microservice configuration.

Resilience – In 2021, our work will include simplifying the process of adding new hosts when needed (during periods of increased load on the system). Eventually our goal is to reach the point where this can happen without human intervention. And on the flip side, spinning down hosts when they are not needed will occur as well. Auto-scaling microservices in this sense promises to make the overall Merritt system more resilient, secure and cost effective.

In summary, 2021 promises to be a busy year for our digital preservation team at CDL. As always, feel free to contact me with any questions. I am happy to discuss any of these ideas and directions for 2021, along with others you may have in mind!

This blog is a part of the “A Peek Into 2021 for UC3″ series.

Streamlining Merritt Microservice Configuration

Working with a Centralized Parameter Store and how Merritt Stands to Benefit

– Eric Lopatin, Terry Brady, Marisa Strong –

Introduction

There are many paths forward for the technology stack that underpins any digital preservation system. It’s often the process of choosing an appropriate path that can be challenging, rather than enumerating any number of possible, new implementations. The Merritt team has found itself in this position recently, having assembled a long list of new paths to tread, and knowing that our resources are finite and should be applied with care to solutions that stand to benefit users.

Leading up to this point in time, we’ve finished migrations in the past year that reduced preservation storage costs for campuses to nearly a quarter of what they were. These also helped secure our vision for a revised approach to preservation that includes a third object copy in a separate geographic region of the country in order to mitigate risk to collections.

Having completed this work, the team has decided that among a number of possible initiatives, one in particular provides a solid building block in a larger strategy to improve the repository’s resilience – and it takes the form of streamlining configuration practices across all of Merritt’s microservices. Let’s dig into that a bit.

Merritt is a complex system, employing both Java and Ruby web applications across nine microservices responsible for tasks ranging from content ingestion, to inventory, replication and fixity checking. Since its introduction, each microservice application has employed a unique method to specify its configuration for criteria such as database credentials, authentication service credentials, cloud storage node characteristics and among others, information associated with the use of upstream, external services. This type of strategy lends itself to the need for specialized, tribal knowledge across the members of a development team, and in turn leaves itself vulnerable to the loss of that knowledge. Of equal and arguably greater importance, it manifests itself in a way that distributes configuration information across several facilities such as private code repositories and directly within executable packages (for instance, within the .war file that encapsulates a Java web application). It’s easy to go a step further and intuit that the overall security of an application or system may be compromised as well when configuration parameters are distributed in this manner.

By focusing our efforts on streamlining the approach to application configuration across Merritt, the team stands to benefit from consistent implementations across microservices while raising the security bar and ultimately making the system as a whole more robust and manageable. There’s yet another benefit that we will be privy to through this focus – and it plays directly into a larger goal of dynamically scaling services when large influxes of content are generating high load: moving application configuration parameters out of compiled executables promotes the use of a single application package version across multiple microservice instances. Given that all Merritt microservices are high-availability, this added benefit clearly fits in with long term goals pertaining to scalability.

Our approach

But what does it mean to streamline configuration? For our team, it boils down to removing the complexity of literally hundreds of configuration properties files from our codebase, and shifting to the use of a centralized parameter store provided by AWS Systems Manager (a.k.a. SSM).

SSM provides a parameter store that allows us to make use of a hierarchy of configuration parameters defined in YAML files. More specifically, the keys to parameters are stored in .yml files in application code repositories. Each compiled executable contains these keys, rather than values specific to a particular instance of the application running on a host. The key hierarchy incorporates entries for properties grouped by environment (dev/stage/production) including, for example, paths to endpoints, and service parameters such as regions, nodes and bucket names. Here’s a snippet of a YAML file with a key that refers to a database password in a Stage environment, highlighted in blue (followed by a corresponding section for Production):

stage:
  user: username
  password: {!SSM: app/db-password} 
  debug-level: {!SSM: app/debug-level !DEFAULT: warning}
  hostname: {!ENV: HOSTNAME}

production:
  user: username
  password: {!SSM: app/db-password} 
  debug-level: {!SSM: app/debug-level !DEFAULT: error} 
  hostname: {!ENV: HOSTNAME}

Importantly, an environment-specific path for a property is concatenated with the desired key value during a call to the SSM centralized store. For the above Stage environment example, a call to the store would result in the following path being used to obtain a secret:

/system/stage/app/db-password

When possible, keys are used in conjunction with SSM API endpoints to obtain actual secrets and other configuration information from the centralized store. On application startup, all necessary configuration values are obtained from the store and loaded at runtime. If it is not feasible to use an endpoint, information can be copied from the store into local environment variables on individual microservice EC2 hosts. In our case, environment variables are used for properties that are not sensitive information or are expected to be unchanged at runtime.

Furthermore, if the parameter store is offline, an application can load a default value which causes it to throw an exception that is captured in logs. The following example illustrates such a default value for a cloud storage provider access key:

production:
  user: username
  password: {!SSM: app/db-password}
  debug-level: {!SSM: app/debug-level !DEFAULT: error}
  hostname: my-prod-hostname
  accessKey: "{!SSM: cloud/nodes/my-accessKey !DEFAULT: SSMFAIL}"

Altogether, this process is dynamic in nature and is the basis of a strategy that enables loading and potentially re-loading values when required, without the need to incur downtime. Although each application must of course implement a mechanism to reload configuration on-demand, use of the SSM parameter store enables our stage and production microservices to be configured on-the-fly. In the future, we see this as a potential way to switch to new storage nodes that come online when it is beneficial to channel incoming content to a node during a migration, or for other risk mitigation purposes. And altogether, it’s our hope that we’ll be able to minimize downtime, and make the Merritt system more resilient to changes to its dependencies.

Working with DevOps

Underlying the implementation of SSM calls in our microservice applications is associated infrastructure and a partnership with our DevOps team. By working with DevOps, we were able to complete preliminary research into the use of SSM and tackle the learning curves that come with it, all while the team experimented with different approaches to configure, set and retrieve SSM values on Merritt’s EC2 instances. Based on this cooperative experimentation, we designed our overall approach.

To begin working with SSM, Identity Access Management (IAM) roles needed to be configured to perform certain actions. These roles help a DevOps administrator control access to resources provided by the systems manager. Roles are then assigned SSM-specific policies which allow the SSM to interact with an EC2 instance via a related IAM instance profile. Each policy defines access such as Systems Manager core functionality, S3 bucket access or running a CloudWatch agent.

In addition to the above roles, each of our AWS resources, servers, and lambda functions are assigned AWS tags. These tags are metadata that serve to define characteristics of each resource. For example, we have a tag that corresponds to a group of EC2 instances that all run Merritt’s Ingest microservice. Tags are used for many reasons. One in particular is to restrict access across environments. For example, recall our earlier SSM parameter path:

 /system/stage/app/db-password

This path corresponds to the tags for a stage resource. Therefore, the environment-specific construction of a tag promotes restrictions, such as a stage EC2 instance only being able to read stage parameters. It would therefore never be able to successfully access (e.g.) a production database password. Tags and policies govern the management and access to SSM parameters, and provide for an inherently more robust configuration strategy.

Of course, while these tag-based restrictions exist for Merritt’s microservice hosts, for administrative purposes, the team has access to an operations server that provides a central hub to manage SSM parameters. Here one can query SSM parameter values for all resources across all environments in UC3 systems. Our operations server makes for an excellent center of cooperation with DevOps that promotes further experimentation with SSM.

One such experiment that’s come to fruition is a set of tools created by our DevOps engineer to assist with routine tasks. It consists of aliases for common SSM commands, wrapper scripts, and other utility functions. One wrapper script in particular allows for the retrieval of SSM parameters that store database credentials. It provides well-defined and secure access to databases according to the same roles established for the parameters themselves. In fact, the script allows a user to access a database without ever viewing required credentials.

Current implementation and the road ahead

So where are we on our journey with regard to the use of SSM and its secure parameter store? At this point, the Merritt microservice with the most complex configuration strategy is now taking advantage of the parameter store in production. This is Merritt’s Storage microservice, which coordinates the shuttling of digital objects and their metadata to multiple cloud storage providers for safe keeping. During this implementation, we literally did move hundreds of configuration parameters and files from a private code repository into the SSM parameter store. We’ve migrated Merritt’s frontend service as well, which provides the system’s UI and a limited number of API endpoints. These are also both significant from the standpoint of languages. The Storage service is a Java application, while the frontend is a Ruby on Rails application. We’ve strived for parity in functionality across implementations in both languages.

At present, we’re working on similar implementations for two more Java-based microservices. These control the ingest of new content, and the execution of inventory tasks for stored objects and their versions. Once done with those, we’ll be able to wrap up our configuration strategy by doing the same for our replication and fixity checking applications.

In summary, it can be said that this project has helped (and is helping) us on a number of levels. Not only is it providing for centralized, secure application configuration management – but it is also allowing us to begin realizing the larger goal of dynamically scaling Merritt’s microservices. With scaling and increased resilience, the system will ultimately better serve our users and bolster our mission to provide a secure, cost-effective digital preservation solution.

If you would like to learn more about the specifics of our SSM implementation, please visit the following links.

Digital Preservation at CDL: Where We Are Headed

At CDL, our digital preservation strategy hinges on offering trusted, reliable, low-cost preservation services to the University of California. Over the past year, we’ve been busy moving forward with these values in mind.

Along the way, the Merritt team has achieved its CoreTrust Seal certification for our preservation repository, evaluated two new cloud storage solutions, established a more thorough documentation portal, and embarked on a major data migration. At the same time, we’ve seen two of our colleagues move on to new endeavors, while two new ones – including myself, and our new lead developer – have been welcomed to the UC3 team.

Over the past months, we have also been rethinking and evaluating key aspects of our approach to digital preservation. Some of these aspects include:

As an organization, we would like to introduce increased geographic separation across digital object copies, to ensure that our copies reside in locations with different disaster threats.
Along the same lines, we would like to introduce additional storage diversification across object copies, also in order to mitigate risk.
Though cloud storage is an integral part of Merritt, its use has prevented our team from being able to effectively lower the cost of preservation for UC campuses and affiliated organizations. We would like to find ways to dramatically lower our costs.
We have not been satisfied with some of the technical limitations that come along with one of our storage solutions, particularly with regard to fixity checking, to confirm data in the repository remains unchanged, and unaffected by potential bit rot.

The first two aspects in particular relate directly to the National Digital Stewardship Alliance’s Levels of Preservation, a set of guidelines that have been well and widely received by the larger digital preservation community.

Moving forward, by the end of January 2020, when our team plans to complete the final steps of its data migration, we will have relocated the primary copy of the majority of Merritt’s objects and collections from OpenStack Swift storage at the San Diego Supercomputer Center (SDSC) to a new storage technology offered by SDSC. Known as Qumulo object storage, it is roughly one quarter of the cost of Swift, and will allow us to critically reduce the dollar amount per/TB that we pass on to campuses as a recharge.

For any individual digital object, use of Qumulo is a significant first step toward realizing our new preservation approach. Our approach, as we’ve agreed to pursue at UC3, also entails another object copy in a geographically-separate location. To this end, we’re in the process of entering into an agreement with Wasabi Hot Cloud Storage. Wasabi’s US-East-2 data center, part of the Iron Mountain facility in Manassas, Virginia, will serve as the location for this third copy. As with Qumulo, Wasabi storage is online and allows for fixity checking, without additional request or retrieval costs, which will in turn lend us the ability to run checks on two of three copies at any time.

The total cost for these two, auditable copies, plus a third, nearline copy stored in Amazon Glacier will still amount to less than a quarter of our current recharge per/TB. This new combination allows us to implement a preservation approach where there are three individual copies of each object, two of which are in geographically-separate regions, and one of which is stored in a less volatile, nearline service. And with such a significant cost reduction for campuses and affiliated organizations, we hope to break down the monetary barrier for libraries, cultural heritage institutions, and other organizations across the University of California.

By midsummer 2020, our plan is to have completed the implementation of our new storage configuration, allowing us to introduce campuses to a much more inexpensive option for digital preservation.

We’re excited by this team goal, and look forward to engaging with libraries and organizations across UC to help them realize their own goals surrounding digital preservation.