Skip to main content

(index page)

Resources, and Versions, and Identifiers! Oh, my!

The only constant is change.  —Heraclitus

Data publication, management, and citation would all be so much easier if data never changed, or at least, if it never changed after publication. But as the Greeks observed so long ago, change is here to stay. We must accept that data will change, and given that fact, we are probably better off embracing change rather than avoiding it. Because the very essence of data citation is identifying what was referenced at the time it was referenced, we need to be able to put a name on that referenced quantity, which leads to the requirement of assigning named versions to data. With versions we are providing the x that enables somebody to say, “I used version x of dataset y.”

Since versions are ultimately names, the problem of defining versions is inextricably bound up with the general problem of identification. Key questions that must be asked when addressing data versioning and identification include:

So far we have only raised questions, and that’s the nature of dealing with versions: the answers tend to be very situation-specific. Fortunately, some broad guidelines have emerged:

These guidelines still leave the question of how to actually assign identifiers to versions unanswered. One approach is to assign a different, unrelated identifier to each version. For example, doi:10.1234/FOO might refer to version 1 of a resource and doi:10.5678/BAR to version 2. Linkages, stored in the resource versions themselves or externally in a database, can record the relationships between these identifiers. This approach may be appropriate in many cases, but it should be recognized that it places a burden on both the resource maintainer (every link that must be maintained represents a breakage point) and user (there is no easily visible or otherwise obvious relationship between the identifiers). Another approach is to syntactically encode version information in the identifiers. With this approach, we might start with doi:10.1234/FOO as a base identifier for the resource, and then append version information in a visually apparent way. For example, doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to version 2, and so forth. And in a logical extension we could then treat the version-less identifier doi:10.1234/FOO as identifying the resource as a whole. This is exactly the approach used by the arXiv preprint service.

Resources, versions, identifiers, citations: the issues they present tend to get bound up in a Gordian knot.  Oh, my!

Further reading:

ESIP Interagency Data Stewardship/Citations/Provider Guidelines

DCC “Cite Datasets and Link to Publications” How-to Guide

Resources, Versions, and URIs