(index page)

Data Questions: Who Can Help?

Discussions about data seem to be everywhere. For evidence of this, look at recent discussions of big data, calls for increasing cyber-infrastructure for data, data management requirements by funders, and data sharing requirements by journals. Given all of this discussion, researchers are (or should be) considering how to handle their own data for both the long term and the short term.

The+Four+Tops+Four+Tops — Admit it: You can’t help yourself. You need the expertise of others! The Four Tops knew it. Image from www.last.fm (click for more). Check out this live performance of “I can’t help myself”: http://www.youtube.com/watch?v=qXavZYeXEc0

The popularity of discussions about data is good and bad for the average researcher.

Let’s start with the bad first: it means researchers are now, more than ever, responsible for being good data stewards (before commenting that this “isn’t a bad thing!!” read on). Gone are the days when you could manage your data in-house, with no worries that others might notice your terrible file naming schemes or scoff at the color coding system in your spreadsheets. With increasing requirements for managing and sharing data, researchers should be careful to construct their datasets and perform their analyses knowing that they will have to share those files eventually. This means that researchers need to learn a bit about best practices for data management and invest some time in creating data management plans that go beyond simply funder requirements (which are NOT adequate for actually properly managing your data – see next week’s blog post for more).

Arguably, the “bad” I mention above is not actually bad at all. Speaking from the point of view of a researcher, however, anything that requires more demands on your time can be taxing. Moving on to the good: all of this attention being given to data stewardship means that there are lots of places to go for help and guidance. You aren’t in this alone, researchers. In previous posts I’ve written about the stubbornness of scientists and our inherent inability to believe that someone might be able us. In the case of data management and related topics, it will pay off in the long run to put aside your ego and ask for help. Who? Here are a few ideas:

Librarians. I’ve blogged about how great and under-used academic libraries and librarians tend to be, but it is worth mentioning again. Librarians are very knowledgeable about information. Yes, your information is special. No, no one can possibly understand how great/complex/important/nuanced your data set is. But I promise you will learn something if you go hang out with a librarian. Since my entry into the libraries community, I have found that librarians are great listeners. They will actively listen while you to babble on endlessly about your awesome data and project, and then provide you with insight that only someone from the outside might provide. Bonus: many librarians are active in the digital data landscape, and therefore are likely to be able to guide you towards helpful resources for scientific data management.
Data Centers/repositories. If you have never submitted data to a data center for archiving, you will soon. Calls for sharing data publicly will only get louder in the next few years, from funders, journals, and institutions interested in maximizing their investment and increasing credibility. Although you might be just hearing of data centers’ existence, they have been around for a long time and have been thinking about how to organize and manage data. How to pick a data center? A wonderful searchable database of repositories is available at www.databib.org. Once you zero in on a data center that’s appropriate for your particular data set, contact them. They will have advice on all kinds of useful stuff, including metadata, file formats, and getting persistent identifiers for your data.
Publishers and Funders. Although they wouldn’t be my first resource for topics related to data, many publishers and funders are increasingly providing guidance, help text, and links to resources that might help you in your quest for improved data stewardship.

My final takeaway is this: researchers, you aren’t in this alone. There is lots of support available for those humble enough to accept it.

NAME CHANGES AND OTHER NEWS

You might notice that I haven’t spoken much about DCXL lately. Alternatively, you read the blog for the riveting articles on data topics, in which case you might not realize that the blog was originally intended for updates on the DCXL project. Well lucky you, because this blog post is chock-full of updates on the project, so after reading this all of you will be up to speed. In itemized form, here are the updates:

Prince — Sometimes name changes are a good thing. From wallpaper-addict.blogspot.com (click for the website).

The DCXL project’s goal is to develop an add-in for Excel and a web application (for the differences between the two, see this blog post). The names of these tools will NOT be DCXL: they will be called DataUp. That means all things DCXL-related will be transitioning to DataUp.
The beta version of the DataUp add-in is due out by the end of June. The beta version of the web application will be shortly after that.
The first public release of DataUp tools (look for this in August) will connect to ONEShare, a special DataONE/UC3 data repository that will accept tabular data from DataUp users. You don’t need to be a member of any particular repository, and the bar is low for requirements pertaining your data and metadata (for better or worse).
We are looking for beta testers! If you or your colleagues fancy getting first crack a DataUp, email me. We are interested in getting feedback from a wide range of folks, so whatever your background don’t hesitate to sign up.
If you or someone you know considers themselves a graphic artist, feel free to design a logo for DataUp. We are considering crowd-sourcing the job at a site like 99designs or CrowdSPRING, but it would be way more interesting to have a DataUp community member design it. With credit, of course.
We are hoping for an extension of funds that will allow us to incorporate suggested improvements into DataUp before we widely release it. Stay tuned for updates on that front. With that extension we would be able to create extensive documentation for other data centers to add their repository to the list of data deposition options.

The new DataUp site will be coming soon, and will likely have a new URL. Don’t worry, the DCXL URL will continue to work for a while.

Workflows Part II: Formal

In my last blog post, I provided an overview of scientific workflows in general. I also covered the basics of informal workflows, i.e. flow charts and commented scripts. Well put away the tuxedo t-shirt and pull out your cummerbund and bow tie, folks, because we are moving on to formal workflow systems.

Nothing says formal like a sequined cummerbund and matching bow tie. From awkwardfamilyphotos.com (click the pic for more)

A formal workflow (let’s call them FW) is essentially an “analytical pipeline” that takes data in one end and spits out results on the other. The major difference between FW and commented scripts (one example of informal workflows) is that FW can be implemented in different software systems. A commented R script for estimating parameters works for R, but what about those simulations you need to run in MATLAB afterward? Saving the outputs from one program, importing them into another, and continuing analysis there is a very common practice in modern science.

So how do you link together multiple software systems automatically? You have two options: become one of those geniuses that use the command line for all of your analyses, or use a FW software system developed by one of those geniuses. The former requires a level of expertise that many (most?) Earth, environmental, and ecological scientists do not possess, myself included. It involves writing code that will access different software programs on your machine, load data into them, perform analyses, save results, and use those results as input for a completely different set of analyses, often using a different software program. FW are often called “executable workflows” because they are a way for you to push only one button (e.g., enter) and obtain your results.

What about FW software systems? These are a bit more accessible for the average scientist. FW software has been around for about 10 years, with the first user-friendly(ish) breakthrough being the Kepler Workflow System. Kepler was developed with researchers in mind, and allows the user to drag and drop chunks of analytical tasks into a window. The user can indicate which data files should be used as inputs and where the outputs should be sent, connecting the analytical tasks with arrows. Kepler is still in a beta version, and most researchers will find the work required to set up a workflow prohibitive.

Groups that have managed to incorporate workflows into their community of sharing are genomicists; this is because they tend to have predictable data as inputs, with a comparatively small set of analyses performed on those data. Interestingly, a social networking site has grown up around genomicists’ use workflows called myExperiment, where researchers can share workflows, download others’ workflows, and comment on those that they have tried.

The benefits of FW are the each step in the analytical pipeline, including any parameters or requirements, is formally recorded. This means that researchers can reuse both individual steps (e.g., the data cleaning step in R or the maximum likelihood estimation in MATLAB), as well as the overall workflow). Analyses can be re-run much more quickly, and repetitive tasks can be automated to reduce chances for manual error. Because the workflow can be saved and re-used, it is a great way to ensure reproducibility and transparency in the scientific process.

Although Kepler is not in wide use, it is a great example of something that will likely become common place in the researcher’s toolbox over the next decade. Other FW software includes Taverna, VisTrails, and Pegasus – all with varying levels of user-friendliness and varied communities of use. As the complexity of analyses and the variety of software systems used by scientists continues to increase, FW are going to become a more common part of the research process. Perhaps more importantly, it is likely that funders will start requiring the archiving of FW alongside data to ensure accountability, reproducibility, and to promote reuse.

A few resources for more info:

My CiteULike list of workflow-related articles and publications
Great book edited by Michener and Brunt, Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000)

Workflows Part I: Informal

Two weeks ago, I was back at NCEAS in Santa Barbara to help with a workshop on data management. DataONE hosted the event, which provided us with a venue to test drive and improve upon on some data management lessons we created. One of the slide sets I created was about workflows, and it occurred to me – I haven’t blogged about workflows properly! So here it is, workflows in a DCXL two-part nutshell.

First, let’s define a workflow… Actually, we are first going to define process metadata. Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as the things you write down in your lab notebook (you do keep a notebook, right?). If you have good records of your process metadata, it’s much easier for others to reproduce your results. It’s also easier for you to trace your steps back in case of mistakes or errors. A related concept is data provenance, which is essentially the origins of your data.

tuxedo t shirt — Like the tuxedo t-shirt, an informal workflow can work great in some situations. The next blog post will cover formalwear. Image from metashirt.cricketsoda.com (click to order!)

Now that we’ve established process metadata and provenance, we can move on to the workflow. A workflow is a formalization of process metadata. Think of it as a precise description of the scientific procedures you follow. There are two main types of workflows: informal and formal. The next post will cover formal workflows; here we will talk about the informal kind, which is easy to create.

The easiest informal workflow is a flow chart or a diagram expressing what goes into an analysis, and what comes out the other side. Here’s an example:

The yellow boxes in the middle are the “transformation rules”, or more simply put, the things you are doing to your data. The blue boxes are the actual data, in different forms, moving in and out of the analysis boxes. The informal workflow represented above is probably about as complicated as a lab exercise from first year biology, but you get the idea. A more complex example is represented here (hat tip to Stacy Rebich-Hespanha):

Flow charts are a great way to start thinking about workflows, especially because you probably already have one sketched out in your head or lab notebook.

Another example of an informal workflow is a commented script. This really only applies if you work with scripted languages in R, MATLAB, SAS, or otherwise. If you do, here are a few best practices (hat tip to Jim Regetz):

Add high-level information at the top of your script. Consider adding a project description, script dependencies, parameter descriptions, etcetera
Notice and organize sections in your script. Describe what happen in each section, why, and any dependencies.
Construct an end-to-end script if possible. This means it runs without intervention on your part, from start to finish, and generates your outputs from your raw data. These are easier to archive and easier to understand.

Why all the formalization, you ask? Because the more computationally complex and digitized our analyses become, the harder it will be to retrace our steps. Reproducibility is one of the tenets of science, and workflows should be common practice among all modern day scientists. Plus (although I have no hard evidence to this effect) workflows are likely to be the next scientific output required by funders, journals, and institutions. Start creating workflows now, while you are still young and creative.