(index page)

Your Time is Gonna Come

You know what they say: Timing is everything. Time enters into the data management and stewardship equation at several points and warrants discussion here. Why timeliness? Last week at the University of North Texas‘ Open Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing. It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data. Feel free to comment on this post with your own opinions.

1. When should you start thinking about data management? The best answer to this question is as soon as possible. The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting). The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

dark side of the rainbow image — If you have never watched the Wizard of Oz while listening to Pink Floyd’s Dark Side of the Moon album, you should. Of course, timing is everything: start the album on the third roar of the MGM lion. Image from horrorhomework.com (click on the image to go to the site)

at might pass muster without really thinking too carefully about your data. I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship. Spend plenty of time doing this, and return to your plan often during your project to update it.

2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly). I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:

Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
Are others likely to want access to your data soon? Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.

3. When should you make your data publicly accessible? My favorite answer to this question is also as soon as possible. But this might mean different things for different scientists. For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project. This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped. So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication. Some journals are already requiring this, and more are likely to follow.

There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future? In that case, have an honest conversation with yourself. What do you mean by “future”? Are you really likely to follow through on those future projects that might use the dataset? If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much. Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not. Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.

Data Diversity is Okay

At risk of sounding like a motivational speaker, this is such an exciting time to be involved in science and research. We are swimming in data and information (yay!), there are exciting software tools available for researchers, librarians, and lay people alike, and the possibilities for discovery seem endless. Of course, all of this change can be a bit daunting. How do you handle the data deluge? What software is likely to be around for a while? How do you manage your time effectively in the face of so much technology?

Like many other groups, academic libraries are undergoing some growing pains in the face of the information age. This may be attributed drastic budget cuts, rising costs for journal subscriptions, and the less important role that physical collections play in due to increasing digitization of information. Researchers are quite content to sit at their laptops and download PDFs from their favorite journals rather than wander the stacks of their local library; they would rather use Google searches to scour the internet for obscure references rather than ask their friendly subject librarian for help in the hunt.

Despite the challenges above, I firmly believe that this is such an exciting time to be working at the interface of libraries, science, and technology. Many librarians agree with me, including those at UCLA. Lisa Federer and Jen Weintraub recently put on a great panel at the UCLA library focused on data curation. I was invited to participate and agreed, which turned out to be an excellent decision.

The panel was called “Data Curation in Action”, and featured four panelists: Chris Johanson, UCLA professor of classics and digital humanities; Tamar Kremer-Sadlik, director of research at the UCLA Center for Everyday Lives of Families (CELF); Paul Conner, the digital laboratory director of CELF; and myself, intended to represent some mix of researchers in science and librarians.

Without droning on about how great the panel was, and how interesting the questions from the audience were, and how wonderful my discussions were with attendees after the panel, I wanted to mention the major thing that I took away: there is so much diverse data being generated by so many different kinds of projects and researchers. Did I mention that this is an exciting time in the world of information?

Take Tamar and Paul: their project involves following families every day for hours on end, recording video, documenting interactions and locations of family members, taking digital photographs, conducting interviews, and measuring cortisol levels (an indicator for stress). You should read that sentence again, because that is an enormous diversity of data types, not to mention the volume. Interviews and video are transcribed, quantitative observations are recorded in databases, and there is an intense coding system for labeling images, videos, and audio files.

Now for Chris, who has the ability to say “I am a professor of classics” at dinner parties (I’m jealous). Chris doesn’t sit about reading old texts and talking about marble statues. Instead he is trying to reconstruct “ephemeral activities in the ancient world”, such as attending a funeral, going to the market, etcetera. He does this using a complex combination of Google Earth, digitized ancient maps, pictures, historical records, and data from excavations of ancient civilizations. He stole the show at the panel when he demonstrated how researchers are beginning to create virtual worlds in which a visitor can wander around the landscape, just like in a modern day 3D video game.

This is really just a blog post about how much I love my job. I can’t imagine anything more interesting than trying to solve problems and provide assistance for researchers such as Tamar, Paul and Chris.

In case you are not one of the 35 million who have watched it, OK Go has a wonderful video about getting through the tough times associated with the dawning information age (at least that’s my rather nerdy interpretation of this song):

Finding a Home For Your Data

Where do I put my data? This question comes up often when I talk with researchers about long term data archiving. There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.

First and foremost, choose early. By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset. If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point. You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.

Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful. Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future. If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns). Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.

So how do you choose your perfect data repository? There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services. The site allows you to search for keywords in their giant list of data repositories. For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data. Don’t see your favorite data repository? The site welcomes feedback (including suggestions for additional repositories not in their database).

Another good way to discover your perfect repository: think about where you go to find datasets for your research. Try asking around too – where do your colleagues look for data? Where do they archive it? Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.

A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system. These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field. Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.

The DCXL add-in will initially connect with CDL’s Merritt Repository. The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable. Long live Excel data!

Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel

Resources, and Versions, and Identifiers! Oh, my!

The only constant is change. —Heraclitus

Data publication, management, and citation would all be so much easier if data never changed, or at least, if it never changed after publication. But as the Greeks observed so long ago, change is here to stay. We must accept that data will change, and given that fact, we are probably better off embracing change rather than avoiding it. Because the very essence of data citation is identifying what was referenced at the time it was referenced, we need to be able to put a name on that referenced quantity, which leads to the requirement of assigning named versions to data. With versions we are providing the x that enables somebody to say, “I used version x of dataset y.”

Since versions are ultimately names, the problem of defining versions is inextricably bound up with the general problem of identification. Key questions that must be asked when addressing data versioning and identification include:

What is being identified by a version? This can be a surprisingly subtle question. Is a particular set of bits being identified? A conceptual quantity (to use FRBR terms, an expression or manifestation)? A location? A conceptual quantity at a location? For a resource that changes rapidly or predictably, such as a data stream that accumulates over time, it will probably be necessary to address the structure of the stream separately from the content of the stream, and to support versions and/or citation mechanisms that allow the state of the stream to be characterized at the time of reference. In any case, the answer to the question of what is being identified will greatly impact both what constitutes change (and therefore what constitutes a version) and the appropriateness of different identifier technologies to identifying those versions.
When does a change constitute a new version? Always? Even when only a typographical error is being corrected? Or, in a hypertext document, when updating a broken hyperlink? (This is a particularly difficult case, since updating a hyperlink requires updating the document, of course, but a URL is really a property of the identifiee, not the identifier.) In the case of a science dataset, does changing the format of the data constitute a new version? Reorganizing the data within a format (e.g., changing from row-major to column-major order)? Re-computing the data on different floating-point hardware? Versions are often divided into “major” versions and “minor” versions to help characterize the magnitude and backward-compatibility of changes.
Is each version an independent resource? Or is there one resource that contains multiple versions? This may seem a purely semantic distinction, but the question has implications on how the resource is managed in practice. The W3C struggled with this question in identifying the HTML specification. It could have created one HTML resource with many versions (3.1, 4.2, 5, …), but for manageability it settled on calling HTML3 one resource (with versions 3.1, 3.2, etc.), HTML4 a separate resource (with analogous versions 4.1, 4.2, etc.), and continuing on to HTML5 as yet another resource.

So far we have only raised questions, and that’s the nature of dealing with versions: the answers tend to be very situation-specific. Fortunately, some broad guidelines have emerged:

Assign an identifier to each version to support identification and citation.
Assign an identifier to the resource as a whole, that is, to the resource without considering any particular version of the resource. There are many situations where it is desirable to be able to make a version-agnostic reference. Consider that, in the text above, we were able to refer to something called “HTML4” without having to name any particular version of that resource. What if that were not possible?
Provide linkages between the versions, and between the versions and the resource as a whole.

These guidelines still leave the question of how to actually assign identifiers to versions unanswered. One approach is to assign a different, unrelated identifier to each version. For example, doi:10.1234/FOO might refer to version 1 of a resource and doi:10.5678/BAR to version 2. Linkages, stored in the resource versions themselves or externally in a database, can record the relationships between these identifiers. This approach may be appropriate in many cases, but it should be recognized that it places a burden on both the resource maintainer (every link that must be maintained represents a breakage point) and user (there is no easily visible or otherwise obvious relationship between the identifiers). Another approach is to syntactically encode version information in the identifiers. With this approach, we might start with doi:10.1234/FOO as a base identifier for the resource, and then append version information in a visually apparent way. For example, doi:10.1234/FOO/v1 might refer to version 1, doi:10.1234/FOO/v2 to version 2, and so forth. And in a logical extension we could then treat the version-less identifier doi:10.1234/FOO as identifying the resource as a whole. This is exactly the approach used by the arXiv preprint service.

Resources, versions, identifiers, citations: the issues they present tend to get bound up in a Gordian knot. Oh, my!

Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany. If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany. It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation. I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates. For humans, this data might be used for, say, calculating premiums for life insurance. For other organisms, these types of data are useful for studying population declines, increases, and changes. Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino — Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data. Luckily, they aren’t starting from scratch. They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data. Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database. In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here). Many of the scientists with whom I spoke needed convincing to share their datasets. But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range. The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to believe in the cause. This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well. But they are choosing instead to venture into the wilderness of scientific data sharing. Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community. Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets. Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.

DataCite Metadata Schema update

This spring, work is underway on a new version of the DataCite metadata schema. DataCite is a worldwide consortium founded in 2009 dedicated to “helping you find, access, and reuse data.” The principle mechanism for doing so is the registration of digital object identifiers (DOIs) via the member organizations. To make sure dataset citations are easy to find, each registration for a DataCite DOI has to be accompanied by a small set of citation metadata. It is small on purpose: this is intended to be a “big tent” for all research disciplines. DataCite has specified these requirements with a metadata schema.

The team in charge of this task is the Metadata Working Group. This group responds to suggestions from DataCite clients and community members. I chair the group, and my colleagues on the group come from the British Library, GESIS, the TIB, CISTI, and TU Delft.

The new version of the schema, 2.3, will be the first to be paired with a corresponding version in the Dublin Core Application Profile format. It fulfills a commitment that the Working Group made with its first release in January of 2011. The hope is that the application profile will promote interoperability with Dublin Core, a common metadata format in the library community, going forward. We intend to maintain synchronization between the schema and the profile with future versions.

Additional changes will include some new selections for the optional fields including support for a new relationType (isIdenticalTo), and we’re considering a way to specify temporal collection characteristics of the resource being registered. This would mean describing, in simple terms and optionally, a data set collected between two dates. There are a few other changes under discussion as well, so stay tuned.

DataCite metadata is available in the Search interface to the DataCite Metadata Store. The metadata is also exposed for harvest, via an OAI-PMH protocol. California Digital Library is a founding member, and our DataCite implementation is the EZID service, which also offers ARKs, an alternative identifier scheme. Please let me know if you have any questions by contacting uc3 at ucop.edu.

QSE3, IGERT, OA and DCXL

A few months back I received an invite to visit the University of Florida in sunny Gainesville. The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program. Phew! That was a lot of typing for the first two acronyms in my blog post’s title. The third acronym (OA) stands for Open Access, and the fourth acronym should be familiar.

I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA. There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion. In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing. The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”. Some of the themes that emerged from our discussion warrant a blog post.

First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey. This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form. Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft. In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.

The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?” This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX. You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication. Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.

No journals means empty library shelves. Perhaps the newly freed up space could be used to store curmudgeonly professors resistant to change.

Let’s pretend that we have overhauled both scientists and the publishing system as it is. In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents. They document their workflows and create python scripts that run in the command line for reproducible results. Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.

One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality. In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.

One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.

My answer was a resounding YES! Enter the DCXL project. We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing. One of the main goals of the DCXL project is to help “the little guy”. These are often scientists working on relatively small datasets that can be contained in Excel files.

In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while. Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.

Special thanks goes to Emilio Bruna (@brunalab) who not only scored me the invite to UF, but also hosted me for a lovely dinner during my visit (albeit NOT the Tasty Budda…)

EZID: now even easier to manage identifiers

EZID, the easy long-term identifier service, just got a new look. EZID lets you create and maintain ARKs and DataCite Digital Object Identifiers (DOIs), and now it’s even easier to use:

One stop for EZID and all EZID information, including webinars, FAQs, and more.
Image by Simon Cousins
- A clean, bright new look.
- No more hunting across two locations for the materials and information you need.

NEW Manage IDs functions:
- View all identifiers created by logged-in account;
- View most recent 10 interactions–based on the account–not the session;
- See the scope of your identifier work without any API programming.

NEW in the UI: Reserve an Identifier
- Create identifiers early in the research cycle;
- Choose whether or not you want to make your identifier public–reserve them if you don’t;
- On the Manage screen, view the identifier’s status (public, reserved, unavailable/just testing).

In the coming months, we will also be introducing these EZID user interface enhancements:

Enhanced support for DataCite metadata in the UI;
Reporting support for institution-level clients.

So, stay tuned: EZID just gets better and better!