Skip to main content

(index page)

Library Carpentry Receives Supplemental IMLS Grant

In November 2017, the California Digital Library (CDL) announced a two-year Institute of Museum and Library Services (IMLS) grant funded project to further advance the scope, adoption, and impact of Library Carpentry across the US. The grant enables CDL and the University of California Curation Center (UC3) to host Chris Erdmann, the Library Carpentry Community and Development Director, and his work with community members, Carpentries staff, and governance groups to integrate Library Carpentry as a lesson program, further develop the curriculum and lessons, grow the community of Carpentries instructors with library and information backgrounds, and continue outreach to raise awareness about Library Carpentry and The Carpentries in the broader library community.

To support the ongoing work of Library Carpentry and the data and software training to library- and information-related roles, we are happy to report that IMLS has awarded CDL supplemental funding. This supplemental funding will provide continued support for workshops and instructor training, as well as create a membership scholarship program to reach new library communities and consortiums. The funding will also provide continued support for Library Carpentry’s current goals to expand the pool of Carpentries trainers and instructors from library- and information-related roles and to complete and formalise curriculum and lessons currently being developed by community members. The CDL, The Carpentries, and the Library Carpentry Advisory Group are currently planning outreach to various library networks to see how we can work together towards providing data and software training to their communities. Members of these groups will be reaching out in the coming months. Also, this month (September 2019), The Carpentries will launch a new workshop request form that will respond to library driven and related workshops.

About CDL CDL was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. Since then, in collaboration with the ten UC campus libraries and other partners, CDL has assembled one of the world’s largest digital research libraries and changed the ways that faculty, students, and researchers discover and access information. We facilitate the licensing of online materials and develop shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the United States and works in partnership with the UC campuses to bring the treasures of California’s libraries, museums, and cultural heritage organizations to the world. We continue to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

About The Carpentries The Carpentries builds global capacity in essential data and computational skills for conducting efficient, open, and reproducible research. We train and foster an active, inclusive, diverse community of learners and instructors that promotes and models the importance of software and data in research. We collaboratively develop openly-available lessons and deliver these lessons using evidence-based teaching practices. We focus on people conducting and supporting research.

Cross-posted on The Carpentries blog

The Datamirror.org Experiment: Preservation Assurance for Federal Research Data

In early 2017, UC3 created Datamirror.org as an independent, dynamic, online mirror of Data.gov, the US federal government’s primary research data portal.  Developed in collaboration with Code for Science & Society (CSS), a non-profit organization supporting innovative uses of technology for public good, Datamirror was intended to provide additional levels of assurance that the significant research data found at Data.gov remains freely accessible to the scholarly community and the public for open retrieval and reuse.  As noted by the government’s Project Open Data initiative, “Data is a valuable national resource and a strategic asset to the U.S. Government, its partners, and the Public.”  Thus, Datamirror.org plays a critical role in protecting this valuable resource from risks of data loss or loss of availability due to technological obsolescence, funding constraints, shifting organizational priorities, malicious attack, or inadvertent error.  

History

Datamirror.org can be seen as one activity within the larger “data rescue” or “refuge” movement that has arisen spontaneously in recent years in recognition of the central role data plays in so many aspects of commerce, culture, science, and education.  These activities rely on a broad informal coalition of scholars, librarians, public interest groups, and citizen-scientists who have participated in numerous rescue events to collect, catalog, and provide open access to federal research data.  While these efforts are significant, they are mostly targeted at narrowly-focused data sources, which means that critical scale is reached only through the independent actions of many independent actors, but with the unfortunate opportunity for needless duplication of effort.  

Datamirror.org takes an alternative – what could be called “wholesale” – approach of automated agency-spanning collection from the central point of the already-existing aggregation by Data.gov.  While a mirror of Data.gov, Datamirror.org differs from Data.gov in one important respect:

The reason for this is to provide greater confidence that datasets discovered via the Data.gov can be retrievable while not distracting researchers from the original sources of information.  Our approach focuses attention on the Data.gov aggregation. A researcher could use Datamirror.org at any time, but would only need to do so if a dataset is no longer retrievable through its catalog entry on Data.gov.

How Does It Work?

To date, Datamirror.org holds over 152,000 datasets totaling 42 TB originating from 188 organizational units spread across more than 50 federal agencies and laboratories.  (While Data.gov is focused on federal research data, it also catalogs datasets from state, county, local, city, regional, and tribal governments, as well as commercial, non-profit, and educational sources.  To avoid any potential intellectual property rights issues, Datamirror.org captures only the federal subset of the full Data.gov corpus.) Datamirror.org scans the Data.gov portal every 4 hours to identify new or modified metadata or data links; if found, Datamirror.org is automatically updated with new metadata and/or new copies of the data files.  In most cases, Datamirror.org is identical to Data.gov, with official metadata from the agency and links to the official copy at the agency, but with the addition of links to the local preservation copies available on Datamirror.org servers.

The project was outlined as a recommended path forward for the data rescue efforts at the Libraries+ Network workshop.  It was also highlighted in Against the Grain as a successful project working to preserve federal research data.  The software stack for Datamirror.org, like Data.gov, uses the open source CKAN data management system.  UC3 recently participated in the CKANconUS conference, giving a summary of the datamirror.org project.  Datamirror.org was developed and is operated with the cooperation of Data.gov staff at the General Services Administration (GSA), Technology Transformation Service.  

Lessons Learned

The creation of Datamirror.org was made possible by the serendipitous availability to UC3 of spare storage capacity due to the successful conclusion of an unrelated project.  That storage has now reached the end of its service life and, unfortunately, reprovisioning the necessary 42+ TB of capacity is financially prohibitive. UC3 is exploring options to identify alternative sources of funding or organizations prepared to take on hosting responsibility for the datamirror.org corpus.  

While the UC3 Datamirror.org experiment is coming to a close, the lessons learned from the exercise remain valid and pertinent to future related initiatives.  Most importantly, it has validated the wholesale automated approach to data collection. Data creators can and should continue to contribute their data to central aggregation sites like Data.gov, where the data will have the most widespread visibility for high-level discovery. Digital preservationists can then step in effectively to build added-value systems like Datamirror.org that offer increased preservation assurance through additional backup copies for use in the event that the primary copies ever become inaccessible.  This is the best way for all of us to quickly collect the broadest set of federal research data with the least amount of duplicative work as well as the least amount of human effort and error.