Skip to content

The Experiment: Preservation Assurance for Federal Research Data

Posted in Data Mirror, Data Publication, and UC3

In early 2017, UC3 created as an independent, dynamic, online mirror of, the US federal government’s primary research data portal.  Developed in collaboration with Code for Science & Society (CSS), a non-profit organization supporting innovative uses of technology for public good, Datamirror was intended to provide additional levels of assurance that the significant research data found at remains freely accessible to the scholarly community and the public for open retrieval and reuse.  As noted by the government’s Project Open Data initiative, “Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the Public.”  Thus, plays a critical role in protecting this valuable resource from risks of data loss or loss of availability due to technological obsolescence, funding constraints, shifting organizational priorities, malicious attack, or inadvertent error.  

History can be seen as one activity within the larger “data rescue” or “refuge” movement that has arisen spontaneously in recent years in recognition of the central role data plays in so many aspects of commerce, culture, science, and education.  These activities rely on a broad informal coalition of scholars, librarians, public interest groups, and citizen-scientists who have participated in numerous rescue events to collect, catalog, and provide open access to federal research data.  While these efforts are significant, they are mostly targeted at narrowly-focused data sources, which means that critical scale is reached only through the independent actions of many independent actors, but with the unfortunate opportunity for needless duplication of effort. takes an alternative – what could be called “wholesale” – approach of automated agency-spanning collection from the central point of the already-existing aggregation by  While a mirror of, differs from in one important respect:

  • is a portal, providing actionable links to individual datasets, but not hosting the datasets themselves, which reside on individual agency websites.  
  •, on the other hand, is both portal and repository, maintaining descriptive information about the data (the portal function) as well as holding actual copies of the datasets themselves (the repository function).

The reason for this is to provide greater confidence that datasets discovered via the can be retrievable while not distracting researchers from the original sources of information.  Our approach focuses attention on the aggregation. A researcher could use at any time, but would only need to do so if a dataset is no longer retrievable through its catalog entry on

How Does It Work?

To date, holds over 152,000 datasets totaling 42 TB originating from 188 organizational units spread across more than 50 federal agencies and laboratories.  (While is focused on federal research data, it also catalogs datasets from state, county, local, city, regional, and tribal governments, as well as commercial, non-profit, and educational sources.  To avoid any potential intellectual property rights issues, captures only the federal subset of the full corpus.) scans the portal every 4 hours to identify new or modified metadata or data links; if found, is automatically updated with new metadata and/or new copies of the data files.  In most cases, is identical to, with official metadata from the agency and links to the official copy at the agency, but with the addition of links to the local preservation copies available on servers.

The project was outlined as a recommended path forward for the data rescue efforts at the Libraries+ Network workshop.  It was also highlighted in Against the Grain as a successful project working to preserve federal research data.  The software stack for, like, uses the open source CKAN data management system.  UC3 recently participated in the CKANconUS conference, giving a summary of the project. was developed and is operated with the cooperation of staff at the General Services Administration (GSA), Technology Transformation Service.  

Lessons Learned

The creation of was made possible by the serendipitous availability to UC3 of spare storage capacity due to the successful conclusion of an unrelated project.  That storage has now reached the end of its service life and, unfortunately, reprovisioning the necessary 42+ TB of capacity is financially prohibitive. UC3 is exploring options to identify alternative sources of funding or organizations prepared to take on hosting responsibility for the corpus.  

While the UC3 experiment is coming to a close, the lessons learned from the exercise remain valid and pertinent to future related initiatives.  Most importantly, it has validated the wholesale automated approach to data collection. Data creators can and should continue to contribute their data to central aggregation sites like, where the data will have the most widespread visibility for high-level discovery. Digital preservationists can then step in effectively to build added-value systems like that offer increased preservation assurance through additional backup copies for use in the event that the primary copies ever become inaccessible.  This is the best way for all of us to quickly collect the broadest set of federal research data with the least amount of duplicative work as well as the least amount of human effort and error.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *