For the last two months, UC3 have been working with the teams at Data.gov, Data Refuge, Internet Archive, and Code For Science (creators of the Dat Project) to aggregate the government data.
Data that spans the globe
There are currently volunteers across the country working to discover and preserve publicly funded research, especially climate data, from being deleted or lost from the public record. The largest initiative is called Data Refuge and is led by librarians and scientists. They are holding events across the UC campuses and the US that you should attend and help out in person, and are organizing the library community to band together to curate the data and ensure it’s preserved and accessible.
Our initiative builds on this and is looking to build a corpus of government data and corresponding metadata. We are focusing on public research data, especially those at risk of disappearing. The initiative was nicknamed “Svalbard” by Max Ogden of the Dat project, after the Svalbard Global Seed Vault in the Arctic. As of today, our friends at Code for Science have released 38GB of metadata, over 30 million hashes and URLs of research data files.
To aid in this effort
We have assembled the following metadata as part of the Code for Science’s Svalbard v1:
- 2.7 million SHA-256 hashes for all downloadable resources linked from Data.gov, representing around 40TB of data
- 29 million SHA-1 hashes of files archived by the Internet Archive and the Archive Team from federal websites and FTP servers, representing over 120TB of data
- All metadata from Data.gov, about 2.1 million datasets
- A list of ~750 .gov and .mil FTP servers
There are additional sources such as Archivers.Space, EDGI, Climate Mirror, Azimuth Data Backup that we are working adding metadata for in future releases.
Following the principles set forth by the librarians behind Data Refuge, we believe it’s important to establish a clear and trustworthy chain of custody for research datasets so that mirror copies can be trusted. With this project, we are working to curate metadata that includes strong cryptographic hashes of data files in addition to metadata that can be used to reproduce a download procedure from the originating host.
We are hoping the community can use this data in the following ways:
- To independently verify that the mirroring processes that produced these hashes can be reproduced
- To aid in developing new forms of redundant dataset distribution (such as peer to peer networks)
- To seed additional web crawls or scraping efforts with additional dataset source URLs
- To encourage other archiving efforts to publish their metadata in an easily accessible format
- To cross reference data across archives, for deduplication or verification purposes
What about the data?
The metadata is great, but the initial release of 30 million hashes and urls is just part of our project. The actual content (how the hashes were derived) have also been downloaded. They are stored at either the Internet Archive or on our California Digital Library servers.
The Dat Project carried out a Data.gov HTTP mirror (~40TB) and uploaded it to our servers at California Digital Library. We are working with them to access ~160TB of data in the future and have partnered with UC Riverside to offer longer term storage .
You can download the metadata here using Dat Desktop or Dat CLI tool. We are using the Dat Protocol for distribution so that we can publish new metadata releases efficiently while still keeping the old versions around. Dat provides a secure cryptographic ledger, similar in concept to a blockchain, that can verify integrity of updates.
If you want to learn more about how CDL and the UC3 team is involved, contact us at email@example.com or @UC3CDL. If you have suggestions or questions, you can join the Code for Science Community Chat. And, if you are a technical user you can report issues or get involved at the Svalbard GitHub.
This is crossposted here: https://medium.com/@maxogden/project-svalbard-a-metadata-vault-for-research-data-7088239177ab#.f933mmts8