Data Hangover

What happened? This is a common question among co-eds, Mardi Gras celebrants, and many scientists. I’m not referring to alcohol-soaked evenings at conferences, but instead to the question that inevitably results from poor data management. What happened? How did I get here?

It might seem a bit trite to pose poor data management as a parallel to poor decisions about alcohol consumption, but actually the results are quite similar: you have a hard time remembering what happened, you regret the decisions that led to your predicament, and it’s rather embarrassing.

ideas on a bar napkin — Is your data management record any better than notes on cocktail napkins? From Flickr by sillydog

How can you avoid it? The solution for data hangover is easy: good data management. This is often easier said than done, especially if the project and data collection are all well underway. Let’s start with the easier scenario, which is that the project has not yet started. Preventing data hangover is easiest at this point since good planning will avoid most of the problematic issues. A few general guidelines (for more ideas see Data Management 101 on this website and the Resources tab on my website):

Choose a repository or data center for your long-term data storage. Some repositories allow you to store versions of your data and provide for access among collaborators during the project, in which case it is good to establish the relationship with the data manager before data are collected. Don’t know where to put your data? Check out DataCite’s list of repositories, check with your institution’s librarians (many institutions have repositories for their researchers, for example MIT’s DSpace), or ask senior colleagues in your field.
Establish standards, e.g. a “data dictionary” that sets the parameters and terms used for your data. It is wise start by determining the metadata standards used by the data center where you plan to archive your data (see #1).
Assign roles and responsibilities to specific individuals for each component of the data life cycle: documentation of data collection, quality control measures, backup and storage, and long-term archiving.
Create a budget to cover costs associated with good data documentation, including hardware, software and personnel. This eliminates many of the most common excuses for data hangover.

After a recent presentation at UC Davis, someone in the audience asked What do you do if the project is already underway? How do you try to fix the problems? That is a bit trickier to answer, and good answers will vary widely. Solutions depend on things like the types of data that exist, how they are currently stored, and the degree to which they have been documented already. I began to draft potential steps for solving this conundrum, however it became a rather lengthy blog post. I will save this particular answer next time. Stay tuned.

The DCXL add-in that this project will produce will reduce your chances for data hangover, although exactly what features will do so are not yet determined. Most likely, the add-in will assist with #2 above: establishing standards. It may also assist with #1: choosing a repository, with the potential for linking to a particular repository via the add-in. If you have other ideas about the way the add-in might be the cure for data hangover, please let me know.

CDL UC3

Be First to Comment

Leave a Reply Cancel reply