(index page)

Resources, and Versions, and Identifiers! Oh, my!

The only constant is change. —Heraclitus

Data publication, management, and citation would all be so much easier if data never changed, or at least, if it never changed after publication. But as the Greeks observed so long ago, change is here to stay. We must accept that data will change, and given that fact, we are probably better off embracing change rather than avoiding it. Because the very essence of data citation is identifying what was referenced at the time it was referenced, we need to be able to put a name on that referenced quantity, which leads to the requirement of assigning named versions to data. With versions we are providing the x that enables somebody to say, “I used version x of dataset y.”

Since versions are ultimately names, the problem of defining versions is inextricably bound up with the general problem of identification. Key questions that must be asked when addressing data versioning and identification include:

What is being identified by a version? This can be a surprisingly subtle question. Is a particular set of bits being identified? A conceptual quantity (to use FRBR terms, an expression or manifestation)? A location? A conceptual quantity at a location? For a resource that changes rapidly or predictably, such as a data stream that accumulates over time, it will probably be necessary to address the structure of the stream separately from the content of the stream, and to support versions and/or citation mechanisms that allow the state of the stream to be characterized at the time of reference. In any case, the answer to the question of what is being identified will greatly impact both what constitutes change (and therefore what constitutes a version) and the appropriateness of different identifier technologies to identifying those versions.
When does a change constitute a new version? Always? Even when only a typographical error is being corrected? Or, in a hypertext document, when updating a broken hyperlink? (This is a particularly difficult case, since updating a hyperlink requires updating the document, of course, but a URL is really a property of the identifiee, not the identifier.) In the case of a science dataset, does changing the format of the data constitute a new version? Reorganizing the data within a format (e.g., changing from row-major to column-major order)? Re-computing the data on different floating-point hardware? Versions are often divided into “major” versions and “minor” versions to help characterize the magnitude and backward-compatibility of changes.
Is each version an independent resource? Or is there one resource that contains multiple versions? This may seem a purely semantic distinction, but the question has implications on how the resource is managed in practice. The W3C struggled with this question in identifying the HTML specification. It could have created one HTML resource with many versions (3.1, 4.2, 5, …), but for manageability it settled on calling HTML3 one resource (with versions 3.1, 3.2, etc.), HTML4 a separate resource (with analogous versions 4.1, 4.2, etc.), and continuing on to HTML5 as yet another resource.

So far we have only raised questions, and that’s the nature of dealing with versions: the answers tend to be very situation-specific. Fortunately, some broad guidelines have emerged:

Assign an identifier to each version to support identification and citation.
Assign an identifier to the resource as a whole, that is, to the resource without considering any particular version of the resource. There are many situations where it is desirable to be able to make a version-agnostic reference. Consider that, in the text above, we were able to refer to something called “HTML4” without having to name any particular version of that resource. What if that were not possible?
Provide linkages between the versions, and between the versions and the resource as a whole.

These guidelines still leave the question of how to actually assign identifiers to versions unanswered. One approach is to assign a different, unrelated identifier to each version. For example, doi:10.1234/FOO might refer to version 1 of a resource and doi:10.5678/BAR to version 2. Linkages, stored in the resource versions themselves or externally in a database, can record the relationships between these identifiers. This approach may be appropriate in many cases, but it should be recognized that it places a burden on both the resource maintainer (every link that must be maintained represents a breakage point) and user (there is no easily visible or otherwise obvious relationship between the identifiers). Another approach is to syntactically encode version information in the identifiers. With this approach, we might start with doi:10.1234/FOO as a base identifier for the resource, and then append version information in a visually apparent way. For example, doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to version 2, and so forth. And in a logical extension we could then treat the version-less identifier doi:10.1234/FOO as identifying the resource as a whole. This is exactly the approach used by the arXiv preprint service.

Resources, versions, identifiers, citations: the issues they present tend to get bound up in a Gordian knot. Oh, my!

DataCite Metadata Schema update

This spring, work is underway on a new version of the DataCite metadata schema. DataCite is a worldwide consortium founded in 2009 dedicated to “helping you find, access, and reuse data.” The principle mechanism for doing so is the registration of digital object identifiers (DOIs) via the member organizations. To make sure dataset citations are easy to find, each registration for a DataCite DOI has to be accompanied by a small set of citation metadata. It is small on purpose: this is intended to be a “big tent” for all research disciplines. DataCite has specified these requirements with a metadata schema.

The team in charge of this task is the Metadata Working Group. This group responds to suggestions from DataCite clients and community members. I chair the group, and my colleagues on the group come from the British Library, GESIS, the TIB, CISTI, and TU Delft.

The new version of the schema, 2.3, will be the first to be paired with a corresponding version in the Dublin Core Application Profile format. It fulfills a commitment that the Working Group made with its first release in January of 2011. The hope is that the application profile will promote interoperability with Dublin Core, a common metadata format in the library community, going forward. We intend to maintain synchronization between the schema and the profile with future versions.

Additional changes will include some new selections for the optional fields including support for a new relationType (isIdenticalTo), and we’re considering a way to specify temporal collection characteristics of the resource being registered. This would mean describing, in simple terms and optionally, a data set collected between two dates. There are a few other changes under discussion as well, so stay tuned.

DataCite metadata is available in the Search interface to the DataCite Metadata Store. The metadata is also exposed for harvest, via an OAI-PMH protocol. California Digital Library is a founding member, and our DataCite implementation is the EZID service, which also offers ARKs, an alternative identifier scheme. Please let me know if you have any questions by contacting uc3 at ucop.edu.

EZID: now even easier to manage identifiers

EZID, the easy long-term identifier service, just got a new look. EZID lets you create and maintain ARKs and DataCite Digital Object Identifiers (DOIs), and now it’s even easier to use:

One stop for EZID and all EZID information, including webinars, FAQs, and more.
Image by Simon Cousins
- A clean, bright new look.
- No more hunting across two locations for the materials and information you need.

NEW Manage IDs functions:
- View all identifiers created by logged-in account;
- View most recent 10 interactions–based on the account–not the session;
- See the scope of your identifier work without any API programming.

NEW in the UI: Reserve an Identifier
- Create identifiers early in the research cycle;
- Choose whether or not you want to make your identifier public–reserve them if you don’t;
- On the Manage screen, view the identifier’s status (public, reserved, unavailable/just testing).

In the coming months, we will also be introducing these EZID user interface enhancements:

Enhanced support for DataCite metadata in the UI;
Reporting support for institution-level clients.

So, stay tuned: EZID just gets better and better!

Data Citation Redux

I know what faithful DCXL readers are thinking: didn’t you already post about data citation? (For the unfaithful among you, check out this post from last November). Yes, I did. But I’ve been inspired to post yet again because I just attended an amazing workshop about all things data citation related.

The workshop was hosted by the NCAR Library (NCAR stands for National Center for Atmospheric Research) and took place in Boulder on Thursday and Friday of last week. Workshop organizers expected about 30 attendees; more than 70 showed up to learn more about data citation. Hats off to the organizers – there healthy discussions among attendees and interesting presentations by great speakers.

One of the presentations that struck me most was by Dr. Tim Killeen, Assistant Director for the Geosciences Directorate at NSF. His talk (available on the workshop website) discussed the motivation for data citation, and what policies have begun to emerge. Near the end of a rather long string reports about data citation, data sharing, and data management, Killeen said “There is a drumbeat into Washington about this.”

John Bonham — If Led Zeppelin drummer J Bonham were still alive, he would leading the data charge into DC. Bonham was voted by Rolling Stone readers as the best drummer of all time. Photo from drummerworld.com

This phrase stuck with me long after I flew home because it juxtaposted two things I hadn’t considered as being related: Washington DC and data policy. Yes, I understand that NSF is located in Washington, and that very recently the White House announced some exciting Big Data funding and initiatives. But Washington DC as a whole – congress, lobbyists, lawyers, judges, etc. – would notice a drum beat about data? I must say, I got pretty excited about the idea.

What are these reports cited by Killeen? In chronological order:

NSF’s advisory panel report waaay back in 2003: a “Harbinger of Cyberinfrstructure” according to Killeen
National Science Board’s report in 2005 on the importance of ensuring digital data are long-lived.
Final report from ARL/NSF Workshop on Long-Term Stewardship of Digital Data Collections in 2006: called for promoting “change in the research enterprise regarding… stewardship of digital data”
NSF’s stated vision in a 2007 report Cyberinfrastructure Vision for 21st Century Discovery. The vision? Data being routinely deposited in a well-documented form. Love it.
A 2009 Report of the Interagency Working Group on Digital Data stated that “all sectors of society are stakeholders in digital preservation and access.” Agreed!
NSF’s 2012 Vision and Strategic Plan: Cyber Infrastructure Framework for the 21st Century

The NSB report on long-lived digital data had yet another a great phrase that stuck with me:

Long-lived digital data collections are powerful catalysts for progress and for democratization of science and education

Wow. I really love the idea of democratized data. It warms the cockles, doesn’t it? With regard to DCXL, the link is obvious. One of the features we are developing is generation of a data citation for your Excel dataset.

The Future of Metrics in Science

Ask any researcher what they need for tenure, and the answer is virtually the same across institutions and disciplines: publications. The “publish or perish” model has reigned supreme for generations of scientists, despite its rather annoying ignorance of having quality over quantity publications, how many collaborations have been established, or even the novelty or difficulty of a particular research project. This archaic measure of impact tends to rely measures like a scientist’s number of citations and the impact factor of the journals in which they publish.

With the upswing in blogs, Twitter feeds, and academic social sites like Mendeley, Zotero, and (my favorite) CiteULike, some folks are working on developing a new model for measuring one’s impact on science. Jason Priem, a graduate student at UNC’s School of Information and Library Science, coined the term “altmetrics” rather recently, and the idea has taken off like wildfire.

altmetrics is the creation and study of new metrics based on the Social Web for analyzing, and informing scholarship.

The concept is simple: instead of using traditional metrics for measuring impact (citation counts, journal impact factors), Priem and his colleagues want to take into account more modern measures of impact like number of bookmarks, shares, or re-tweets. In addition, altmetrics seeks to consider not only publications, but associated data or code downloads.

sex pistols — The original alternatives: The Sex Pistols. From Arroz Do Ceu (limpa-vias.blogspot.com). Read more about the beginnings of alternative rock in Dave Thompson’s book “Alternative Rock”.

Old-school scientists and Luddites might balk at the idea of measuring a scientist’s impact on the community by the number of re-tweets their article received, or by the number of downloads of their dataset. This reaction can be attributed to several causes, one of which may be an irrational fear of change. But the reality is that the landscape of science is changing dramatically, and the trend towards social media as a scientific tool is only likely to continue. See my blog post on why scientists should tweet for more information on the benefits of embracing one of the aspects of this trend.

Need another reason to get onboard? Funders see the value in altmetrics. Priem, along with his co-PI (and my DataONE colleague) Heather Piwowar, just received $125K from the Sloan Foundation to expand their Total Impact project. Check out the Total Impact website for more information, or read the UNC SILS news story about the grant.

The DCXL project feeds right into the concept of altmetrics. By providing citations for datasets that are housed in data centers, the impact of a scientist’s data can be easily incorporated into their impact factor.