Thoughts on Data Publication

CDL UC3, January 24, 2013

Posted in: UC3

If you read last week’s post on the IDCC meeting in Amsterdam, you may know that today’s post was inspired by a post-conference workshop on Data Publication, sponsored by the PREPARDE group. The workshop was “Data publishing, peer review and repository accreditation: everyone a winner?” (to access the workshop agenda, goals, and slides, go to the conference workshop website and scroll down to Workshop 6).

Basically the workshop focused on all things data publication, and incited lively discussion among those in attendance. Check out the workshop’s Twitter backchannel via this Storify by Sarah Callaghan of STFC. My previous blog post about data publication sums it up like this:

The concept of data publication is rather simple in theory: rather than relying on journal articles alone for scholarly communication, let’s publish data sets as “first class citizens”. Data sets have inherent value that makes them standalone scholarly objects— they are more likely to be discovered by researchers in other domains and working on other questions if they are not associated with a specific journal and all of the baggage that entails.

Stealing shamelessly from Sarah’s presentation, I’m providing a brief overview of issues surrounding data publication for those not well-versed:

First, the benefits of data publication:

Allows credit to data producers and curators (via data citation and emerging altmetrics)
Encourages reuse of datasets and discourages duplication of effort
Encourages proper curation and management of data (you don’t want to share messy data, right?)
Ensures completeness of the scientific record, as well as transparency and reproducibility of research (fundamental tenets of the scientific method!)
Improves discoverability of datasets (they will never be discovered on that old hard drive in your desk drawer)

We had an internal meeting here at CDL yesterday about data publication. After running through this list of benefits for those in attendance, one of my colleagues asked the question: “Does listing these benefits work? Do researchers want to publish their data?” I didn’t hesitate to answer “No”.

Why not? The biggest reason is a lack of time. Preparing data for sharing and publication is laborious, and overstretched researchers aren’t motivated by these benefits given the current incentive structures in research (papers, papers, papers. And citation of those papers.). Of course, I think this is changing in the very near future. Check out my post on data sharing mandates in the works. So let’s go with the assumption that researchers want to publish. How do they go about this?

Methods for “publishing” data:

A personal or lab webpage. This is a common choice for researchers who wish to share data since they can maintain control of the datasets. However, there are issues with stability, persistence, discoverability of these data, siloed on individual websites. Plus, website maintenance often falls to the bottom of a researcher’s to-do list.
A disciplinary repository. This is a common solution for only a select few data types (e.g., genetic data). Most disciplines are still awaiting a culture change that will motivate researchers to share their data in this way.
An institutional repository. Of course, researchers have to know that this is an option (most don’t), and must then properly prepare their data for deposit.
Supplementary materials. In this case, the data accompany a primary journal article as supporting information. I recently shared data this way, but recognized that the data should also be placed in a curated repository. There are a few reasons for this apparent duplication:
- Supplemental materials are sometimes not available many years after publication due to broken links.
- Journals are not particularly excited about archiving lots of supplementary data, especially if it’s a large volume of data. This is not their area of expertise, after all.
Data article. This is a new-ish option: basically, you publish your data in a proper data journal (see this semi-complete list of data journals on the PREPARDE blog).

Wondering what a “data article” is? Let’s look to Sarah again:

A data article describes a dataset, giving details of its collection, processing, software, file formats, et cetera, without the requirement of novel analyses or ground-breaking conclusions.

That is, it’s a standalone product of research that can be cited as such. There is much debate surrounding such data articles. Among the issues are:

Is it really “publication”? How is this different from a landing page for the dataset that’s stored in a repository?
Traditional academic use of “publication” implies peer review. How do you review datasets?
How should publication differ depending on the discipline?

There are no easy answers to these questions, but I love hearing the debate. I’m optimistic that the forthcoming person we hire as a data publication postdoc will have some great ideas to contribute. Stay tuned!

Related Data Pub blog posts:
- Data Publication: An Introduction
- NSF Recognizes Data as an Academic Product
- Data Publication – The First 500 Years, by Lisa Schiff
- Data Publication and the Coproduction of Quality, by Eric Kansa
- We’re hiring a data publication postdoc