(index page)

Data Policies & Other Things

Last Friday I attended a seminar at UC Berkeley’s iSchool given by MacKenzie Smith, a terrific presenter and colleague who is affiliated with Creative Commons (among other prestigious organizations). MacKenzie was talking about data governance, an issue I covered a few months back for the DCXL blog. However on Friday MacKenzie brought up a few things that I think warrant another post.

First, let’s define data governance for those that aren’t familiar with the concept. Based on Wikipedia’s entry, it’s the policies surrounding data, including data risk management, assignment of roles and responsibilities for data, and more generally formally managing data assets throughout the research cycle. Now on to the new things:

The thing adams family — Data policies are some combination of scary and confusing. Similar to Thing from The Addams Family. From monstermoviemusic.blogspot.com

Thing 1: Facts cannot be copyrighted. It makes sense for things like, say, simple math. I can’t say “2+2=4” © 2011 Carly Strasser. Known facts can’t be copyrighted. So what about data? One might argue that data are facts (assuming you are doing science correctly). That means you don’t own the copyright to your data. Eeek! Scary thought, I know. You might be saved by the fact that a unique arrangement or collection of facts can be copyrighted. Huh. Data in a database? Can’t be copyrighted. The database itself? Can be copyrighted. This obviously makes things related to data quite messy when it comes to intellectual property.

Thing 2: Did you know that “attribution” can be legally imposed? The remedy for a lack of attribution where warranted is a lawsuit. Creative Commons licenses are built on this fact. This is not true, however of citation. Citation is a “scholarly norm” that has no underlying legality.

Thing 3: Creative Commons is now working on a CC 4.0 license. Some of goals of this new version are enabling internationalization and interoperability, and improving support of data, Science, and Education. They want input from scientists, librarians, administrators, and anyone else who might have an opinion about intellectual property, open science, and governance in general.

Thing 4: Open Knowledge Foundation is working on concepts related to governance with a global perspective. They have a range of projects in the works for improving the sharing of knowledge, data, and content.

Thing 5: While waiting for a consensus on how to properly govern digital data and other digital content, many data providers are dealing with governance by constructing data usage agreements. These are contracts created by lawyers for a specific data provider (e.g., an online database). The problem with data usage agreements is that they are all different. This means that if you want to use data from a source that requires you agree to their terms, you have three options:

Carefully read the terms before agreeing (and who does that?)
Click that you agree without reading and hope you don’t accidentally break any rules
Find the data that you need from another source that doesn’t have terms and conditions for data usage.

Item three points to one of the serious downsides to data usage agreements: researchers may avoid using data if don’t understand the terms of use. Furthermore, the terms only apply to the party that agreed to the contract (i.e. checked the box). If they (potentially illegally) share those data with someone else, that someone else is not bound by the terms.

Thing 6: What about international collaborations? As you might imagine, this offers yet another layer of complication. As a scientist, you are supposed to be ensuring that you look into any data policies that may apply to your collaborators. From NSF DMP FAQ (hello, alphabet soup!):

16. If I participate in a collaborative international research project, do I need to be concerned with data management policies established by institutions outside the United States?

Yes. There may be cases where data management plans are affected by formal data protocols established by large international research consortia or set forth in formal science and technology agreements signed by the United States Government and foreign counterparts. Be sure to discuss this issue with your sponsored projects office (or equivalent) and your international research partner when first planning your collaboration.

Hmm. It looks like the waters are very muddy right now, and until they clear, researchers should watch their step.

DataShare: A Plan to Increase Scientific Data Sharing

This post was co-authored by Dr. Michael Weiner, CIND director at UCSF

The DataShare project is a collaboration between University of California San Francisco’s Clinical and Translational Science Institute, the UCSF Library, and the UC Curation Center (UC3) at the California Digital Library. The goal of the DataShare project is to achieve widespread voluntary sharing of scientific data at the time of publication. This will be achieved by creating a data sharing website which could be used by all UCSF investigators, and ultimately by others in the UC system and other institutions. Currently data sharing is mostly done by large, well funded multi-investigator projects. There would be great benefit if much more raw data were widely shared, especially data from individual investigators.

we are the world — Imagine the possible scientific advances if we pooled our data the way that “We are the world” pooled celebrity voices. From live.drjays.com

This project is the brainchild of Michael Weiner M.D., the director for the Center for Imaging of Neurodegenerative Diseases. Weiner’s experience as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) led him to conclude that widespread data sharing can be achieved now, with great scientific and economic benefits. All ADNI raw data is immediately shared (at UCLA/LONI/ADNI) with all scientists in the world without embargo. The project is very successful: more than 300 publications and many more submitted. This success demonstrates the feasibility and benefits of sharing data.

Individual initiatives:

The laboratory at the Center for Imaging of Neurodegenerative Diseases began to share data at the time of publication in 2011. This included both raw data and a description of how the raw data was processed and analyzed, leading to the findings in the publication. For the DataShare project, the following expansions to data sharing are planned:

ADNI scientists will be encouraged to share the raw data of their ADNI papers, and other papers from their laboratories
Other faculty in the Department of Radiology at UCSF and our collaborators in Neurology and Psychiatry at UCSF will be encouraged to share their raw data
Chancellor, Deans, and Department Chairs at UCSF will be urged to make more widespread voluntary sharing of scientific data a UCSF priority/policy; this may include providing storage space for shared data and/or development of policies which would reward data sharing in the hiring and promotion process
The example UCSF sets may then encourage the entire University of California system to implement similar changes
Other collaborators and colleagues in other universities around the world will then be encouraged to adopt similar policies
A “data sharing impact factor” will be developed and tested which will allow scientists to cite others’ data that they use and provide metrics for how others are using their data.

Institutional initiatives:

The project seeks to encourage involvement by the National Institutes of Health (NIH), the National Science Foundation (NSF), and the National Library of Medicine (NLM), to promote and facilitate sharing of scientific data. This will be accomplished via five tasks:

Encourage NIH and NSF to emphasize and expand their existing policies concerning data sharing and notify the scientific community of this greater emphasis
Promote the establishment of a small group of committed individuals who can help formulate policy for NIH in this area, including a policy framework that favors open availability of scientific data.
Establish technical mechanisms for data sharing, such as a national system for storage of all raw scientific data (e.g., a national data repository or data bank). This repository may be created by NLM, or be housed at universities, foundations, or private companies (e.g., Dataverse).
Work to develop incentives for scientists and institutions to share their raw data. This may include
1. Requesting reports in non competitive reviews, competitive reviews and/or new applications
2. Instructing the reviewers to consider data sharing in assessing priority scores in grant reviews
3. Acknowledgment in publications
4. Providing affordable access to infrastructure, i.e. software and media, which facilitates data sharing
5. Encouraging NIH to provide funding for small grants aimed to promote and take advantage of shared data. Examples include projects that utilize data mining or cloud computing.

The potential gains from widespread sharing of raw scientific data greatly outweigh the relatively small costs involved in developing the necessary infrastructure. Industries likely to benefit from increased accessibility of large amounts of raw data include the pharmaceutical and health care industry, chemistry, technology, engineering, etc. We also expect new technologies and new companies to develop to take advantage of newly available data. Furthermore, there will be substantial societal benefits gained by widespread sharing of scientific data, primarily due to the ability to link data sets and repurpose data for making unforeseen discoveries.