(index page)

Merritt and Dash Certified as Trustworthy Repositories

University of California Curation Center (UC3) is happy to announce that its Merritt repository and companion Dash data publishing platform have received CoreTrustSeal certification. This certification helps to instill transparency and accountability to our stakeholder communities, as it provides public evidence of our adherence to community-accepted norms for the preservation and and accessibility of managed digital content. Through this certification Merritt and Dash are joining a select group of 137 international repositories meeting established standards for demonstrating necessary technical, operational, and organizational trustworthiness. What this certification means for our institutional users of Merritt – librarians, archivists, curators – and individual scholars and researchers using Dash is that their digital content is being stewarded and preserved for long-term access and reuse in an appropriate and effective manner.

The CTS certification process began with a critical self-audit by UC3 staff addressing issues in 16 key areas, organized into three high-level topical categories: Organizational Infrastructure (mission/scope, licensing, business continuity, ethical norms, business structure, and expertise and consultation); Digital Object Management (data integrity and authenticity, appraisal, storage, preservation planning, quality assurance, documented workflows, identification and discovery, and facilitating reuse); and Technology (infrastructure and security). The UC3 submission was then reviewed by independent external experts who provided valuable comments and feedback, asking only for minor clarification of several points. Our revised submission was given final approval on August 7, 2018.

While certification does not entail any change in established behavior or workflow by Merritt and Dash users, it is indicative of a higher level of curatorial and preservation service and assurance provided by those systems and the UC3 team. The achievement of certification is a reflection of UC3’s and CDL’s ongoing commitment to support innovative and sustainable open scholarship. It is also an important step in meeting the UC Libraries’ complementary strategic goals of maximizing discovery of and ensuring long-term access to the University’s valuable and often unique digital content.

The CoreTrustSeal (CTS) certification instrument and organization represents the consolidation of two prior independent certification groups: the Data Seal of Approval (DSA) and the ICSU World Data System (WDS). CTS is also a component of the European Framework for Audit and Certification of Digital Repositories, a collaboration between CTS, the Consultative Committee on Space Data Systems (CCSDS), developer of the influential ISO 14721 Open Archival Information System (OAIS) reference model and its companion ISO 16363 Audit and Certification of Trusted Digital Repositories (TDR) standard, and DIN (Deutsches Institut für Normung), the German national standards organization.

The Datamirror.org Experiment: Preservation Assurance for Federal Research Data

In early 2017, UC3 created Datamirror.org as an independent, dynamic, online mirror of Data.gov, the US federal government’s primary research data portal. Developed in collaboration with Code for Science & Society (CSS), a non-profit organization supporting innovative uses of technology for public good, Datamirror was intended to provide additional levels of assurance that the significant research data found at Data.gov remains freely accessible to the scholarly community and the public for open retrieval and reuse. As noted by the government’s Project Open Data initiative, “Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the Public.” Thus, Datamirror.org plays a critical role in protecting this valuable resource from risks of data loss or loss of availability due to technological obsolescence, funding constraints, shifting organizational priorities, malicious attack, or inadvertent error.

History

Datamirror.org can be seen as one activity within the larger “data rescue” or “refuge” movement that has arisen spontaneously in recent years in recognition of the central role data plays in so many aspects of commerce, culture, science, and education. These activities rely on a broad informal coalition of scholars, librarians, public interest groups, and citizen-scientists who have participated in numerous rescue events to collect, catalog, and provide open access to federal research data. While these efforts are significant, they are mostly targeted at narrowly-focused data sources, which means that critical scale is reached only through the independent actions of many independent actors, but with the unfortunate opportunity for needless duplication of effort.

Datamirror.org takes an alternative – what could be called “wholesale” – approach of automated agency-spanning collection from the central point of the already-existing aggregation by Data.gov. While a mirror of Data.gov, Datamirror.org differs from Data.gov in one important respect:

Data.gov is a portal, providing actionable links to individual datasets, but not hosting the datasets themselves, which reside on individual agency websites.
Datamirror.org, on the other hand, is both portal and repository, maintaining descriptive information about the data (the portal function) as well as holding actual copies of the datasets themselves (the repository function).

The reason for this is to provide greater confidence that datasets discovered via the Data.gov can be retrievable while not distracting researchers from the original sources of information. Our approach focuses attention on the Data.gov aggregation. A researcher could use Datamirror.org at any time, but would only need to do so if a dataset is no longer retrievable through its catalog entry on Data.gov.

How Does It Work?

To date, Datamirror.org holds over 152,000 datasets totaling 42 TB originating from 188 organizational units spread across more than 50 federal agencies and laboratories. (While Data.gov is focused on federal research data, it also catalogs datasets from state, county, local, city, regional, and tribal governments, as well as commercial, non-profit, and educational sources. To avoid any potential intellectual property rights issues, Datamirror.org captures only the federal subset of the full Data.gov corpus.) Datamirror.org scans the Data.gov portal every 4 hours to identify new or modified metadata or data links; if found, Datamirror.org is automatically updated with new metadata and/or new copies of the data files. In most cases, Datamirror.org is identical to Data.gov, with official metadata from the agency and links to the official copy at the agency, but with the addition of links to the local preservation copies available on Datamirror.org servers.

The project was outlined as a recommended path forward for the data rescue efforts at the Libraries+ Network workshop. It was also highlighted in Against the Grain as a successful project working to preserve federal research data. The software stack for Datamirror.org, like Data.gov, uses the open source CKAN data management system. UC3 recently participated in the CKANconUS conference, giving a summary of the datamirror.org project. Datamirror.org was developed and is operated with the cooperation of Data.gov staff at the General Services Administration (GSA), Technology Transformation Service.

Lessons Learned

The creation of Datamirror.org was made possible by the serendipitous availability to UC3 of spare storage capacity due to the successful conclusion of an unrelated project. That storage has now reached the end of its service life and, unfortunately, reprovisioning the necessary 42+ TB of capacity is financially prohibitive. UC3 is exploring options to identify alternative sources of funding or organizations prepared to take on hosting responsibility for the datamirror.org corpus.

While the UC3 Datamirror.org experiment is coming to a close, the lessons learned from the exercise remain valid and pertinent to future related initiatives. Most importantly, it has validated the wholesale automated approach to data collection. Data creators can and should continue to contribute their data to central aggregation sites like Data.gov, where the data will have the most widespread visibility for high-level discovery. Digital preservationists can then step in effectively to build added-value systems like Datamirror.org that offer increased preservation assurance through additional backup copies for use in the event that the primary copies ever become inaccessible. This is the best way for all of us to quickly collect the broadest set of federal research data with the least amount of duplicative work as well as the least amount of human effort and error.

Additional preservation assurance with DPN

CDL is a founding member of the Digital Preservation Network (DPN), a coalition of over 50 academic libraries, foundations, and non-profit memory institutions dedicated to the long-term preservation of the scholarly and cultural record. UCLA and UCSD are also DPN members. DPN supports a high level of preservation assurance through widespread replication of digital assets across a geographically-dispersed network of five technically and administratively heterogeneous repositories. DPN membership agreements also incorporate language (a “quitclaim”) that ensures continuity of preservation management in the event a member organization cannot or chooses not to continue to exercise stewardship responsibility for material previously contributed to the network. As a benefit of membership, CDL has the opportunity to contribute up to 5 TB of content to DPN annually at no additional cost.

In late 2015 the UC Libraries Advisory Structure (UCLAS) Direction and Oversight Committee (DOC) formed a DPN allocation project team (DAPT) to investigate the question of how best to take advantage of this DPN capacity by UC members. The DAPT recommended that CDL’s 5 TB allotment should be used as “a common resource for systemwide benefit.” CDL determined that the following collection groups, drawn from content managed in UC3’s Merritt repository, meet that criterion:

Dash – Open datasets from UCB, UCD, UCI, UCM, UCR, UCSC, UCSF, and UCOP: 140 datasets, 11,192 files, 266 GB
DataONE/ONEShare – Open datasets from outside UC: 242 datasets, 10,048 files, 12 GB
Digital Special Collections
- California census data: 30 datasets, 6,100 files, 4 GB
- LSTA collections – archival assets from 94 California public libraries, archives, historical societies, and other local memory institutions: 31,469 archival assets, 943,711 files, 935 GB
- Online Archive of California (OAC): 306,273 archival assets, 4.8 million files, 144 GB
- McGraw-Hill eBooks: 289 eBooks, 6,069 files, 7 GB
- eScholarship Editions: 1,833 publications, 197,790 files, 9 GB
- Mark Twain Editions: 2,946 publications, 38,508 files, 223 MB
eScholarship – UC open access publications: 138,562 publications, 6.7 million files, 698 GB
ETDs – Electronic theses and dissertations from UCB, UCI, UCLA, UCM, UCR, UCSB, UCSC, UCSD, and UCSF: 30,207 ETDs, 380,751 files, 225 GB
UC3
- iPRES 2009 Conference proceedings: 46 papers, 1,602 files, 14 GB
- UCLA modular digital courses: 1 course, 2,652 files, 234 MB
- WAS – Archived copy of UC3’s deprecated Web Archiving Service: 4 objects, 59 files, 376 GB

All told, over 519,000 digital resources, 13.6 million files, and 3.3 TB have been successfully transferred to DPN, which maintains three independent external replicas, hosted across the Academic Preservation Trust (APT), HathiTrust, Texas Digital Library (TDL), and UCSD, in addition to the replication internal to the Merritt repository at the San Diego Supercomputer Center (SDSC) and the Amazon AWS S3 and Glacier storage clouds. (As impressive as these numbers sound, the DPN subset constitutes only about 19% by number and 4% by size of the full Merritt corpus.)

Due to a flurry of deposits by DPN members at the end of 2017, submission processing took longer than expected, extending into February 2018. To avoid a similar rush this year, the deposit of the 2018 Merritt material will begin earlier, with planning starting in September.

Dat-in-the-Lab: Announcing UC3 research collaboration

We are excited to announce that the Gordon and Betty Moore Foundation has awarded a research grant to the California Digital Library and Code for Science & Society (CSS) for the Dat-in-the-Lab project to develop practical new techniques for effective data management in the academic research environment.

Dat-in-the-Lab

The project will pilot the use of CSS’s Dat system to streamline data preservation, publication, sharing, and reuse in two UC research laboratories: the Evolution: Ecology, Environment lab at UC Merced, focused on basic ecological and evolutionary research under the direction of Michael Dawson; and the Center for Watershed Sciences at UC Davis, dedicated to the interdisciplinary study of water challenges. UC researchers are increasingly faced with demands for proactive and sustainable management of their research data with respect to funder mandates, publication requirements, institutional policies, and evolving norms of scholarly best practice. With the support of the UC Davis and UC Merced Libraries, the project team will conduct a series of site visits to the two UC labs in order to create, deploy, evaluate, and refactor Dat-based data management solutions built for real-world data collection and management contexts, along with outreach and training materials that can be repurposed for wider UC or non-UC use.

What is Dat?

The Dat system enables effective research data management (RDM) through continuous data versioning, efficient distribution and synchronization, and verified replication. Dat lets researchers continue to work with the familiar paradigm of file folders and directories yet still have access to rich, robust, and cryptographically-secure peer-to-peer networking functions. You can think of Dat as doing for data what Git has done for distributed source code control. Details of how the system works are explained in the Dat whitepaper.

Project partners

Dat-in-the-Lab is the latest expression of CDL’s longstanding interest in supporting RDM at the University of California, and is complementary to other initiatives such as the DMPTool for data management planning, the Dash data publication service, and active collaboration with local campus-based RDM efforts. CSS is a non-profit organization committed to improving access to research data for the public good, and works at the intersection of technology with science, journalism, and government to promote openness, transparency, and collaboration. Dat-in-the-Lab activities will be coordinated by Max Ogden, CSS founder and director; Danielle Robinson, CSS scientific and partnerships director; and Stephen Abrams, associate director of the CDL’s UC Curation Center (UC3).

Learn more

Stay tuned for monthly updates on the project. You can bookmark Dat-in-the-Lab on GitHub for access to code, curricula, and other project outputs. Also follow along as the project evolves on our roadmap, chat with the project team, and keep up to date through the project Twitter feed. For more information about UC3, contact us at uc3@ucop.edu and follow us on Twitter.