Skip to content

A forthcoming experiment in data publication

Posted in UC3

What we’re doing:

Like these dapper gentlemen, as small or as large as needed... From the Public Domain Review.
Like these dapper gentlemen, as small or as large as needed…
From The Public Domain Review.

Some time next year, the CDL will start an experiment in data publication. Our version of data publication will look like lightweight, non-peer reviewed dataset descriptions. These publications are designed to be flexible in structure and size. At a minimum, each document must have six elements:

  • Title
  • Creator(s)
  • Publisher
  • Publication year
  • Identifier (e.g.DOI or ARK)
  • Citation to the dataset

This bare bones document can expand to be richly descriptive, with optional items like subject keywords, version number, spatial or temporal range, collection methods, and as much description as the author cares to suppy.

Why we’re doing it:

The general agreement expressed in the recently released draft FORCE11 Declaration of Data Citation Principles –that datasets should be treated like “first class” research objects in how they are discovered, cited, and recognized– is still far from reality. Datasets are largely invisible to search engines, and authors rarely cite them formally.

A solution being implemented by a number of journals (e.g. Nature Scientific Data and Geoscience Data Journal
) is to publish proxy objects for discovery and citation called “data descriptors” or, more commonly, “data papers”. Data papers are formal scholarly publications that describe a dataset’s rationale and collection methods, but don’t analyze the data or draw any conclusions. Peer reviewers ensure that the paper contains all the information needed to use, re-use, or replicate the dataset.

The strength of the data paper approach– creators must write up rich and useful metadata to pass peer review– leads directly to the weakness: a data paper often takes more time and energy to produce than dataset creators are willing to invest. In a 2011 survey, researchers said that the biggest impediment to publishing data is lack of time. For researchers who manage to publish datasets but lack time to write and submit (and revise and resubmit) a data paper, we will provide some of the benefits of a data paper at none of the cost.

How we’re doing it:

We will publish these documents through EZID (easy-eye-dee), an identifier service that has supplied DataCite DOIs to over 167,000 datasets. All of the dataset metadata records have at least the five elements required by the DataCite metadata schema, more than 2,000 already have abstracts, and another 2,000 have other kinds of descriptive metadata. EZID will begin using dataset metadata to automatically generate publications that can be viewed as HTML in a web browser or as a dynamically generated PDF. The documents will be hosted by EZID in a format optimized for indexing by search engines like Google and Google Scholar.

Dataset creators won’t have to do anything to get a publication that they don’t already have to do to get a DOI. If the creator only fills in the required metadata, the document will function as a cover-sheet or landing page. If they submit an abstract and methods, the document expands to begin to look like a traditional journal article (while retaining the linking functionality of a landing page). It will capture as much effort as the researcher puts forth, whether that’s a lot or very little.

Do you have thoughts or comments on our idea? We would love to hear from you! Comment on this blog post or email us at

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *