Skip to main content

(index page)

Resources, and Versions, and Identifiers! Oh, my!

The only constant is change.  —Heraclitus

Data publication, management, and citation would all be so much easier if data never changed, or at least, if it never changed after publication. But as the Greeks observed so long ago, change is here to stay. We must accept that data will change, and given that fact, we are probably better off embracing change rather than avoiding it. Because the very essence of data citation is identifying what was referenced at the time it was referenced, we need to be able to put a name on that referenced quantity, which leads to the requirement of assigning named versions to data. With versions we are providing the x that enables somebody to say, “I used version x of dataset y.”

Since versions are ultimately names, the problem of defining versions is inextricably bound up with the general problem of identification. Key questions that must be asked when addressing data versioning and identification include:

So far we have only raised questions, and that’s the nature of dealing with versions: the answers tend to be very situation-specific. Fortunately, some broad guidelines have emerged:

These guidelines still leave the question of how to actually assign identifiers to versions unanswered. One approach is to assign a different, unrelated identifier to each version. For example, doi:10.1234/FOO might refer to version 1 of a resource and doi:10.5678/BAR to version 2. Linkages, stored in the resource versions themselves or externally in a database, can record the relationships between these identifiers. This approach may be appropriate in many cases, but it should be recognized that it places a burden on both the resource maintainer (every link that must be maintained represents a breakage point) and user (there is no easily visible or otherwise obvious relationship between the identifiers). Another approach is to syntactically encode version information in the identifiers. With this approach, we might start with doi:10.1234/FOO as a base identifier for the resource, and then append version information in a visually apparent way. For example, doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to version 2, and so forth. And in a logical extension we could then treat the version-less identifier doi:10.1234/FOO as identifying the resource as a whole. This is exactly the approach used by the arXiv preprint service.

Resources, versions, identifiers, citations: the issues they present tend to get bound up in a Gordian knot.  Oh, my!

Further reading:

ESIP Interagency Data Stewardship/Citations/Provider Guidelines

DCC “Cite Datasets and Link to Publications” How-to Guide

Resources, Versions, and URIs

DataCite Metadata Schema update

 

This spring, work is underway on a new version of the DataCite metadata schema. DataCite is a worldwide consortium founded in 2009 dedicated to “helping you find, access, and reuse data.” The principle mechanism for doing so is the registration of digital object identifiers (DOIs) via the member organizations. To make sure dataset citations are easy to find, each registration for a DataCite DOI has to be accompanied by a small set of citation metadata. It is small on purpose:  this is intended to be a “big tent” for all research disciplines. DataCite has specified these requirements with a metadata schema.

The team in charge of this task is the Metadata Working Group. This group responds to suggestions from DataCite clients and community members. I chair the group, and my colleagues on the group come from the British Library, GESIS, the TIB, CISTI, and TU Delft.

The new version of the schema, 2.3, will be the first to be paired with a corresponding version in the Dublin Core Application Profile format. It fulfills a commitment that the Working Group made with its first release in January of 2011. The hope is that the application profile will promote interoperability with Dublin Core, a common metadata format in the library community, going forward. We intend to maintain synchronization between the schema and the profile with future versions.

Additional changes will include some new selections for the optional fields including support for a new relationType (isIdenticalTo), and we’re considering a way to specify temporal collection characteristics of the resource being registered. This would mean describing, in simple terms and optionally, a data set collected between two dates. There are a few other changes under discussion as well, so stay tuned.

DataCite metadata is available in the Search interface to the DataCite Metadata Store. The metadata is also exposed for harvest, via an OAI-PMH protocol. California Digital Library is a founding member, and our DataCite implementation is the EZID service, which also offers ARKs, an alternative identifier scheme. Please let me know if you have any questions by contacting uc3 at ucop.edu.

EZID: now even easier to manage identifiers

EZID, the easy long-term identifier service, just got a new look. EZID lets you create and maintain ARKs and DataCite Digital Object Identifiers (DOIs), and now it’s even easier to use:

In the coming months, we will also be introducing these EZID user interface enhancements:

So, stay tuned: EZID just gets better and better!

Data Citation Redux

I know what faithful DCXL readers are thinking: didn’t you already post about data citation? (For the unfaithful among you, check out this post from last November). Yes, I did. But I’ve been inspired to post yet again because I just attended an amazing workshop about all things data citation related.

The workshop was hosted by the NCAR Library (NCAR stands for National Center for Atmospheric Research) and took place in Boulder on Thursday and Friday of last week.  Workshop organizers expected about 30 attendees; more than 70 showed up to learn more about data citation.  Hats off to the organizers – there healthy discussions among attendees and interesting presentations by great speakers.

One of the presentations that struck me most was by Dr. Tim KilleenAssistant Director for the Geosciences Directorate at NSF.  His talk (available on the workshop website) discussed the motivation for data citation, and what policies have begun to emerge.  Near the end of a rather long string reports about data citation, data sharing, and data management, Killeen said  “There is a drumbeat into Washington about this.”

John Bonham
If Led Zeppelin drummer J Bonham were still alive, he would leading the data charge into DC. Bonham was voted by Rolling Stone readers as the best drummer of all time. Photo from drummerworld.com

This phrase stuck with me long after I flew home because it juxtaposted two things I hadn’t considered as being related: Washington DC and data policy.  Yes, I understand that NSF is located in Washington, and that very recently the White House announced some exciting Big Data funding and initiatives. But Washington DC as a whole – congress, lobbyists, lawyers, judges, etc. – would notice a drum beat about data? I must say, I got pretty excited about the idea.

What are these reports cited by Killeen?  In chronological order:

The NSB report on long-lived digital data had yet another a great phrase that stuck with me:

Long-lived digital data collections are powerful catalysts for progress and for democratization of science and education

Wow. I really love the idea of democratized data.  It warms the cockles, doesn’t it?  With regard to DCXL, the link is obvious.  One of the features we are developing is generation of a data citation for your Excel dataset.

The Future of Metrics in Science

Ask any researcher what they need for tenure, and the answer is virtually the same across institutions and disciplines: publications.  The “publish or perish” model has reigned supreme for generations of scientists, despite its rather annoying ignorance of having quality over quantity publications, how many collaborations have been established, or even the novelty or difficulty of a particular research project.  This archaic measure of impact tends to rely measures like a scientist’s number of citations and the impact factor of the journals in which they publish.

With the upswing in blogs, Twitter feeds, and academic social sites like MendeleyZotero, and (my favorite) CiteULike, some folks are working on developing a new model for measuring one’s impact on science.  Jason Priem, a graduate student at UNC’s School of Information and Library Science, coined the term “altmetrics” rather recently, and the idea has taken off like wildfire.

altmetrics is the creation and study of new metrics based on the Social Web for analyzing, and informing scholarship.

The concept is simple: instead of using traditional metrics for measuring impact (citation counts, journal impact factors), Priem and his colleagues want to take into account more modern measures of impact like number of bookmarks, shares, or re-tweets.  In addition, altmetrics seeks to consider not only publications, but associated data or code downloads.

sex pistols
The original alternatives: The Sex Pistols. From Arroz Do Ceu (limpa-vias.blogspot.com). Read more about the beginnings of alternative rock in Dave Thompson’s book “Alternative Rock”.

Old-school scientists and Luddites might balk at the idea of measuring a scientist’s impact on the community by the number of re-tweets their article received, or by the number of downloads of their dataset.  This reaction can be attributed to several causes, one of which may be an irrational fear of change.  But the reality is that the landscape of science is changing dramatically, and the trend towards social media as a scientific tool is only likely to continue.  See my blog post on why scientists should tweet for more information on the benefits of embracing one of the aspects of this trend.

Need another reason to get onboard? Funders see the value in altmetrics.  Priem, along with his co-PI (and my DataONE colleague) Heather Piwowar, just received $125K from the Sloan Foundation to expand their Total Impact project.  Check out the Total Impact website for more information, or read the UNC SILS news story about the grant.

The DCXL project feeds right into the concept of altmetrics.  By providing citations for datasets that are housed in data centers, the impact of a scientist’s data can be easily incorporated into their impact factor.