The Datamirror.org Experiment: Preservation Assurance for Federal Research Data

Stephen Abrams, July 3, 2018

Posted in: Data Mirror, Data Publication, UC3

In early 2017, UC3 created Datamirror.org as an independent, dynamic, online mirror of Data.gov, the US federal government’s primary research data portal. Developed in collaboration with Code for Science & Society (CSS), a non-profit organization supporting innovative uses of technology for public good, Datamirror was intended to provide additional levels of assurance that the significant research data found at Data.gov remains freely accessible to the scholarly community and the public for open retrieval and reuse. As noted by the government’s Project Open Data initiative, “Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the Public.” Thus, Datamirror.org plays a critical role in protecting this valuable resource from risks of data loss or loss of availability due to technological obsolescence, funding constraints, shifting organizational priorities, malicious attack, or inadvertent error.

History

Datamirror.org can be seen as one activity within the larger “data rescue” or “refuge” movement that has arisen spontaneously in recent years in recognition of the central role data plays in so many aspects of commerce, culture, science, and education. These activities rely on a broad informal coalition of scholars, librarians, public interest groups, and citizen-scientists who have participated in numerous rescue events to collect, catalog, and provide open access to federal research data. While these efforts are significant, they are mostly targeted at narrowly-focused data sources, which means that critical scale is reached only through the independent actions of many independent actors, but with the unfortunate opportunity for needless duplication of effort.

Datamirror.org takes an alternative – what could be called “wholesale” – approach of automated agency-spanning collection from the central point of the already-existing aggregation by Data.gov. While a mirror of Data.gov, Datamirror.org differs from Data.gov in one important respect:

Data.gov is a portal, providing actionable links to individual datasets, but not hosting the datasets themselves, which reside on individual agency websites.
Datamirror.org, on the other hand, is both portal and repository, maintaining descriptive information about the data (the portal function) as well as holding actual copies of the datasets themselves (the repository function).

The reason for this is to provide greater confidence that datasets discovered via the Data.gov can be retrievable while not distracting researchers from the original sources of information. Our approach focuses attention on the Data.gov aggregation. A researcher could use Datamirror.org at any time, but would only need to do so if a dataset is no longer retrievable through its catalog entry on Data.gov.

How Does It Work?

To date, Datamirror.org holds over 152,000 datasets totaling 42 TB originating from 188 organizational units spread across more than 50 federal agencies and laboratories. (While Data.gov is focused on federal research data, it also catalogs datasets from state, county, local, city, regional, and tribal governments, as well as commercial, non-profit, and educational sources. To avoid any potential intellectual property rights issues, Datamirror.org captures only the federal subset of the full Data.gov corpus.) Datamirror.org scans the Data.gov portal every 4 hours to identify new or modified metadata or data links; if found, Datamirror.org is automatically updated with new metadata and/or new copies of the data files. In most cases, Datamirror.org is identical to Data.gov, with official metadata from the agency and links to the official copy at the agency, but with the addition of links to the local preservation copies available on Datamirror.org servers.

The project was outlined as a recommended path forward for the data rescue efforts at the Libraries+ Network workshop. It was also highlighted in Against the Grain as a successful project working to preserve federal research data. The software stack for Datamirror.org, like Data.gov, uses the open source CKAN data management system. UC3 recently participated in the CKANconUS conference, giving a summary of the datamirror.org project. Datamirror.org was developed and is operated with the cooperation of Data.gov staff at the General Services Administration (GSA), Technology Transformation Service.

Lessons Learned

The creation of Datamirror.org was made possible by the serendipitous availability to UC3 of spare storage capacity due to the successful conclusion of an unrelated project. That storage has now reached the end of its service life and, unfortunately, reprovisioning the necessary 42+ TB of capacity is financially prohibitive. UC3 is exploring options to identify alternative sources of funding or organizations prepared to take on hosting responsibility for the datamirror.org corpus.

While the UC3 Datamirror.org experiment is coming to a close, the lessons learned from the exercise remain valid and pertinent to future related initiatives. Most importantly, it has validated the wholesale automated approach to data collection. Data creators can and should continue to contribute their data to central aggregation sites like Data.gov, where the data will have the most widespread visibility for high-level discovery. Digital preservationists can then step in effectively to build added-value systems like Datamirror.org that offer increased preservation assurance through additional backup copies for use in the event that the primary copies ever become inaccessible. This is the best way for all of us to quickly collect the broadest set of federal research data with the least amount of duplicative work as well as the least amount of human effort and error.