(index page)

Finding a Home For Your Data

Where do I put my data? This question comes up often when I talk with researchers about long term data archiving. There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.

First and foremost, choose early. By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset. If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point. You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.

Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful. Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future. If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns). Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.

So how do you choose your perfect data repository? There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services. The site allows you to search for keywords in their giant list of data repositories. For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data. Don’t see your favorite data repository? The site welcomes feedback (including suggestions for additional repositories not in their database).

Another good way to discover your perfect repository: think about where you go to find datasets for your research. Try asking around too – where do your colleagues look for data? Where do they archive it? Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.

A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system. These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field. Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.

The DCXL add-in will initially connect with CDL’s Merritt Repository. The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable. Long live Excel data!

Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel

Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany. If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany. It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation. I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates. For humans, this data might be used for, say, calculating premiums for life insurance. For other organisms, these types of data are useful for studying population declines, increases, and changes. Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino — Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data. Luckily, they aren’t starting from scratch. They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data. Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database. In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here). Many of the scientists with whom I spoke needed convincing to share their datasets. But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range. The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to believe in the cause. This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well. But they are choosing instead to venture into the wilderness of scientific data sharing. Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community. Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets. Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.

DataShare: A Plan to Increase Scientific Data Sharing

This post was co-authored by Dr. Michael Weiner, CIND director at UCSF

The DataShare project is a collaboration between University of California San Francisco’s Clinical and Translational Science Institute, the UCSF Library, and the UC Curation Center (UC3) at the California Digital Library. The goal of the DataShare project is to achieve widespread voluntary sharing of scientific data at the time of publication. This will be achieved by creating a data sharing website which could be used by all UCSF investigators, and ultimately by others in the UC system and other institutions. Currently data sharing is mostly done by large, well funded multi-investigator projects. There would be great benefit if much more raw data were widely shared, especially data from individual investigators.

we are the world — Imagine the possible scientific advances if we pooled our data the way that “We are the world” pooled celebrity voices. From live.drjays.com

This project is the brainchild of Michael Weiner M.D., the director for the Center for Imaging of Neurodegenerative Diseases. Weiner’s experience as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) led him to conclude that widespread data sharing can be achieved now, with great scientific and economic benefits. All ADNI raw data is immediately shared (at UCLA/LONI/ADNI) with all scientists in the world without embargo. The project is very successful: more than 300 publications and many more submitted. This success demonstrates the feasibility and benefits of sharing data.

Individual initiatives:

The laboratory at the Center for Imaging of Neurodegenerative Diseases began to share data at the time of publication in 2011. This included both raw data and a description of how the raw data was processed and analyzed, leading to the findings in the publication. For the DataShare project, the following expansions to data sharing are planned:

ADNI scientists will be encouraged to share the raw data of their ADNI papers, and other papers from their laboratories
Other faculty in the Department of Radiology at UCSF and our collaborators in Neurology and Psychiatry at UCSF will be encouraged to share their raw data
Chancellor, Deans, and Department Chairs at UCSF will be urged to make more widespread voluntary sharing of scientific data a UCSF priority/policy; this may include providing storage space for shared data and/or development of policies which would reward data sharing in the hiring and promotion process
The example UCSF sets may then encourage the entire University of California system to implement similar changes
Other collaborators and colleagues in other universities around the world will then be encouraged to adopt similar policies
A “data sharing impact factor” will be developed and tested which will allow scientists to cite others’ data that they use and provide metrics for how others are using their data.

Institutional initiatives:

The project seeks to encourage involvement by the National Institutes of Health (NIH), the National Science Foundation (NSF), and the National Library of Medicine (NLM), to promote and facilitate sharing of scientific data. This will be accomplished via five tasks:

Encourage NIH and NSF to emphasize and expand their existing policies concerning data sharing and notify the scientific community of this greater emphasis
Promote the establishment of a small group of committed individuals who can help formulate policy for NIH in this area, including a policy framework that favors open availability of scientific data.
Establish technical mechanisms for data sharing, such as a national system for storage of all raw scientific data (e.g., a national data repository or data bank). This repository may be created by NLM, or be housed at universities, foundations, or private companies (e.g., Dataverse).
Work to develop incentives for scientists and institutions to share their raw data. This may include
1. Requesting reports in non competitive reviews, competitive reviews and/or new applications
2. Instructing the reviewers to consider data sharing in assessing priority scores in grant reviews
3. Acknowledgment in publications
4. Providing affordable access to infrastructure, i.e. software and media, which facilitates data sharing
5. Encouraging NIH to provide funding for small grants aimed to promote and take advantage of shared data. Examples include projects that utilize data mining or cloud computing.

The potential gains from widespread sharing of raw scientific data greatly outweigh the relatively small costs involved in developing the necessary infrastructure. Industries likely to benefit from increased accessibility of large amounts of raw data include the pharmaceutical and health care industry, chemistry, technology, engineering, etc. We also expect new technologies and new companies to develop to take advantage of newly available data. Furthermore, there will be substantial societal benefits gained by widespread sharing of scientific data, primarily due to the ability to link data sets and repurpose data for making unforeseen discoveries.