(index page)

Digital Preservation at CDL: Where We Are Headed

At CDL, our digital preservation strategy hinges on offering trusted, reliable, low-cost preservation services to the University of California. Over the past year, we’ve been busy moving forward with these values in mind.

Along the way, the Merritt team has achieved its CoreTrust Seal certification for our preservation repository, evaluated two new cloud storage solutions, established a more thorough documentation portal, and embarked on a major data migration. At the same time, we’ve seen two of our colleagues move on to new endeavors, while two new ones – including myself, and our new lead developer – have been welcomed to the UC3 team.

Over the past months, we have also been rethinking and evaluating key aspects of our approach to digital preservation. Some of these aspects include:

As an organization, we would like to introduce increased geographic separation across digital object copies, to ensure that our copies reside in locations with different disaster threats.
Along the same lines, we would like to introduce additional storage diversification across object copies, also in order to mitigate risk.
Though cloud storage is an integral part of Merritt, its use has prevented our team from being able to effectively lower the cost of preservation for UC campuses and affiliated organizations. We would like to find ways to dramatically lower our costs.
We have not been satisfied with some of the technical limitations that come along with one of our storage solutions, particularly with regard to fixity checking, to confirm data in the repository remains unchanged, and unaffected by potential bit rot.

The first two aspects in particular relate directly to the National Digital Stewardship Alliance’s Levels of Preservation, a set of guidelines that have been well and widely received by the larger digital preservation community.

Moving forward, by the end of January 2020, when our team plans to complete the final steps of its data migration, we will have relocated the primary copy of the majority of Merritt’s objects and collections from OpenStack Swift storage at the San Diego Supercomputer Center (SDSC) to a new storage technology offered by SDSC. Known as Qumulo object storage, it is roughly one quarter of the cost of Swift, and will allow us to critically reduce the dollar amount per/TB that we pass on to campuses as a recharge.

For any individual digital object, use of Qumulo is a significant first step toward realizing our new preservation approach. Our approach, as we’ve agreed to pursue at UC3, also entails another object copy in a geographically-separate location. To this end, we’re in the process of entering into an agreement with Wasabi Hot Cloud Storage. Wasabi’s US-East-2 data center, part of the Iron Mountain facility in Manassas, Virginia, will serve as the location for this third copy. As with Qumulo, Wasabi storage is online and allows for fixity checking, without additional request or retrieval costs, which will in turn lend us the ability to run checks on two of three copies at any time.

The total cost for these two, auditable copies, plus a third, nearline copy stored in Amazon Glacier will still amount to less than a quarter of our current recharge per/TB. This new combination allows us to implement a preservation approach where there are three individual copies of each object, two of which are in geographically-separate regions, and one of which is stored in a less volatile, nearline service. And with such a significant cost reduction for campuses and affiliated organizations, we hope to break down the monetary barrier for libraries, cultural heritage institutions, and other organizations across the University of California.

By midsummer 2020, our plan is to have completed the implementation of our new storage configuration, allowing us to introduce campuses to a much more inexpensive option for digital preservation.

We’re excited by this team goal, and look forward to engaging with libraries and organizations across UC to help them realize their own goals surrounding digital preservation.

Expanding the Carpentries Community in California Academic Libraries

Today we are announcing a new project to help build regional training partnerships in California.

As the Data Services Librarian at the University of California, San Francisco, and co-chair of the Library Carpentry Advisory Committee, I have been privileged to witness the transformative impact of data and software training for librarians. Whether it is a cataloging librarian figuring out how to automate data uploads, an instruction librarian realizing there is an easier way to clean their messy class registration data, or a research data management librarian who now has the language to connect with computational researchers, data training allows librarians to become invaluable resources to their libraries and academic communities. I am therefore very excited to share that I will be working with The Carpentries Team on a special project to develop the Carpentries community in academic libraries across California. The goal of this project is to create sustainable communities of academic libraries that can work together to make computational training more accessible to librarians and the communities that they serve. As part of this project, I will be reaching out to promote the work of the Carpentries in regional library groups, piloting new membership models for library networks and consortia, assisting with Library Carpentry workshops across the state, and offering more instructor training opportunities for California librarians.

Are you a California librarian interested in what the Carpentries can offer your library? Want to learn more about hosting a Library Carpentry workshop? Send me an email at ariel.deardorff@ucsf.edu.

This post was guest-authored by Ariel Deardorff, Data Services Librarian at UCSF, and originally published at The Carpentries blog.

Persistent Identifier Services at CDL: A Rich Tapestry

EZID is one strand in a larger tapestry of persistent identifier activity at CDL. These activities, at their core, are focused on how and where persistent identifiers can help enrich and connect the scholarly outputs and cultural heritage materials of the University of California system. Persistent identifiers in this sense both drive and support CDL’s underlying mission to “provide[s] transformative digital library services, grounded in campus partnerships and extended through external collaborations, that amplify the impact of the libraries, scholarship, and resources of the University of California.”

The past year was a transitional one for EZID in particular and for CDL’s identifier services portfolio in general. In the first half of 2019, we completed a multi-year process to rescope EZID’s DOI services to focus exclusively on UC users. We worked to support non-UC users of our DOI services in setting up direct memberships with other providers through memberships with Crossref and DataCite. We also welcomed Rushiraj Nenuji to the development team as we said farewell to EZID’s long-time developer and original architect Greg Janée.

Last year, in the midst of these transitions, we posed the following question:

Rather than thinking about EZID solely as a tool or a service, we want to situate it instead as one layer of a deep and broad persistent identifier portfolio at CDL. EZID is a great tool for creating and managing DOIs and ARKs—what else could it do? And how might it also support infrastructure, training, and outreach for a more networked and interoperable scholarly communication ecosystem through the use and coordination of persistent identifiers?

Now, as we kick off the new year, we wanted to provide a brief update on what this persistent identifier services portfolio looks like, and how it will continue to evolve in the months ahead.

EZID remains involved in the day-to-day business of supporting DOI and ARK services for UC campuses as well as ARK services for non-UC EZID members. EZID development work is currently focused on strengthening and upgrading the application for long-term robustness and stability, and reconfiguring the platform to minimize dependencies on external systems. Future development work in the coming months will be geared toward optimizing the EZID user interface and adding more support for different metadata schemas.

From the portfolio perspective, we are working on a number of initiatives to encourage and enable the adoption and use of persistent identifiers across the UCs and beyond. A few examples:

We work closely with CDL’s eScholarship Publishing team to help UC journals obtain Crossref DOIs. An integration between eScholarship and EZID assigns DOIs automatically to eScholarship journal articles and sends the metadata to Crossref. These articles are then available to indexes, libraries, and other third parties, enhancing journals’ exposure and increasing the discoverability of their Open Access content. This service supports about 20 journals and our teams will expand to more publications in the year ahead. Two related efforts concern greater adoption of ARK identifiers for special collections objects (UCSF’s Industry Documents Library is one recent project), and DataCite DOIs for UC data repositories.

Organization identifiers are growing in visibility across the scholarly infrastructure landscape with the launch of the Research Organization Registry (ROR), of which CDL is a founding partner. The ROR registry now includes unique IDs for approximately 97,000 organizations, and these IDs are being supported in both DataCite and Crossref metadata. A number of platforms are integrating or looking to integrate ROR into their systems wherever affiliations are collected. The new Dryad platform was the first to pilot this type of ROR integration, and Dryad now has clean and consistent affiliation data for all of its datasets. With additional integrations expected in the new year, it will become increasingly easier for libraries and research administrators to track and analyze their institutions’ scholarly outputs.

Engaging with the broader PID community is another important aspect of our ongoing work. CDL is a member of the ORCID US Community, joining other institutions around the country in championing adoption and use of ORCID identifiers by UC researchers. We are also a founding sponsor of PIDapalooza, the festival of persistent identifiers now approaching its fourth year. We are collaborating within and beyond the UC in persistent identifier training and outreach, including providing guidance on identifiers for UC librarians, and organizing global workshops for stakeholders and practitioners.

All of these efforts showcase how persistent identifier services capture the spirit of the CDL’s vision as a “catalyst for deeply collaborative solutions providing a rich, intuitive and seamless environment for publishing, sharing and preserving our scholars’ increasingly diverse outputs.”

We are looking forward to the year ahead! As always, get in touch with your ideas and questions.