(index page)
Your Time is Gonna Come
You know what they say: Timing is everything. Time enters into the data management and stewardship equation at several points and warrants discussion here. Why timeliness? Last week at the University of North Texas‘ Open Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing. It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data. Feel free to comment on this post with your own opinions.
1. When should you start thinking about data management? The best answer to this question is as soon as possible. The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting). The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

at might pass muster without really thinking too carefully about your data. I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship. Spend plenty of time doing this, and return to your plan often during your project to update it.
2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly). I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:
- Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
- Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
- Are others likely to want access to your data soon? Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.
3. When should you make your data publicly accessible? My favorite answer to this question is also as soon as possible. But this might mean different things for different scientists. For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project. This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped. So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication. Some journals are already requiring this, and more are likely to follow.
There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future? In that case, have an honest conversation with yourself. What do you mean by “future”? Are you really likely to follow through on those future projects that might use the dataset? If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much. Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not. Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.
DataShare: A Plan to Increase Scientific Data Sharing
This post was co-authored by Dr. Michael Weiner, CIND director at UCSF
The DataShare project is a collaboration between University of California San Francisco’s Clinical and Translational Science Institute, the UCSF Library, and the UC Curation Center (UC3) at the California Digital Library. The goal of the DataShare project is to achieve widespread voluntary sharing of scientific data at the time of publication. This will be achieved by creating a data sharing website which could be used by all UCSF investigators, and ultimately by others in the UC system and other institutions. Currently data sharing is mostly done by large, well funded multi-investigator projects. There would be great benefit if much more raw data were widely shared, especially data from individual investigators.

This project is the brainchild of Michael Weiner M.D., the director for the Center for Imaging of Neurodegenerative Diseases. Weiner’s experience as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) led him to conclude that widespread data sharing can be achieved now, with great scientific and economic benefits. All ADNI raw data is immediately shared (at UCLA/LONI/ADNI) with all scientists in the world without embargo. The project is very successful: more than 300 publications and many more submitted. This success demonstrates the feasibility and benefits of sharing data.
Individual initiatives:
The laboratory at the Center for Imaging of Neurodegenerative Diseases began to share data at the time of publication in 2011. This included both raw data and a description of how the raw data was processed and analyzed, leading to the findings in the publication. For the DataShare project, the following expansions to data sharing are planned:
- ADNI scientists will be encouraged to share the raw data of their ADNI papers, and other papers from their laboratories
- Other faculty in the Department of Radiology at UCSF and our collaborators in Neurology and Psychiatry at UCSF will be encouraged to share their raw data
- Chancellor, Deans, and Department Chairs at UCSF will be urged to make more widespread voluntary sharing of scientific data a UCSF priority/policy; this may include providing storage space for shared data and/or development of policies which would reward data sharing in the hiring and promotion process
- The example UCSF sets may then encourage the entire University of California system to implement similar changes
- Other collaborators and colleagues in other universities around the world will then be encouraged to adopt similar policies
- A “data sharing impact factor” will be developed and tested which will allow scientists to cite others’ data that they use and provide metrics for how others are using their data.
Institutional initiatives:
The project seeks to encourage involvement by the National Institutes of Health (NIH), the National Science Foundation (NSF), and the National Library of Medicine (NLM), to promote and facilitate sharing of scientific data. This will be accomplished via five tasks:
- Encourage NIH and NSF to emphasize and expand their existing policies concerning data sharing and notify the scientific community of this greater emphasis
- Promote the establishment of a small group of committed individuals who can help formulate policy for NIH in this area, including a policy framework that favors open availability of scientific data.
- Establish technical mechanisms for data sharing, such as a national system for storage of all raw scientific data (e.g., a national data repository or data bank). This repository may be created by NLM, or be housed at universities, foundations, or private companies (e.g., Dataverse).
- Work to develop incentives for scientists and institutions to share their raw data. This may include
- Requesting reports in non competitive reviews, competitive reviews and/or new applications
- Instructing the reviewers to consider data sharing in assessing priority scores in grant reviews
- Acknowledgment in publications
- Providing affordable access to infrastructure, i.e. software and media, which facilitates data sharing
- Encouraging NIH to provide funding for small grants aimed to promote and take advantage of shared data. Examples include projects that utilize data mining or cloud computing.
The potential gains from widespread sharing of raw scientific data greatly outweigh the relatively small costs involved in developing the necessary infrastructure. Industries likely to benefit from increased accessibility of large amounts of raw data include the pharmaceutical and health care industry, chemistry, technology, engineering, etc. We also expect new technologies and new companies to develop to take advantage of newly available data. Furthermore, there will be substantial societal benefits gained by widespread sharing of scientific data, primarily due to the ability to link data sets and repurpose data for making unforeseen discoveries.