Two weeks ago, I was back at NCEAS in Santa Barbara to help with a workshop on data management. DataONE hosted the event, which provided us with a venue to test drive and improve upon on some data management lessons we created. One of the slide sets I created was about workflows, and it occurred to me – I haven’t blogged about workflows properly! So here it is, workflows in a DCXL two-part nutshell.
First, let’s define a workflow… Actually, we are first going to define process metadata. Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as the things you write down in your lab notebook (you do keep a notebook, right?). If you have good records of your process metadata, it’s much easier for others to reproduce your results. It’s also easier for you to trace your steps back in case of mistakes or errors. A related concept is data provenance, which is essentially the origins of your data.
Now that we’ve established process metadata and provenance, we can move on to the workflow. A workflow is a formalization of process metadata. Think of it as a precise description of the scientific procedures you follow. There are two main types of workflows: informal and formal. The next post will cover formal workflows; here we will talk about the informal kind, which is easy to create.
The easiest informal workflow is a flow chart or a diagram expressing what goes into an analysis, and what comes out the other side. Here’s an example:
The yellow boxes in the middle are the “transformation rules”, or more simply put, the things you are doing to your data. The blue boxes are the actual data, in different forms, moving in and out of the analysis boxes. The informal workflow represented above is probably about as complicated as a lab exercise from first year biology, but you get the idea. A more complex example is represented here (hat tip to Stacy Rebich-Hespanha):
Flow charts are a great way to start thinking about workflows, especially because you probably already have one sketched out in your head or lab notebook.
Another example of an informal workflow is a commented script. This really only applies if you work with scripted languages in R, MATLAB, SAS, or otherwise. If you do, here are a few best practices (hat tip to Jim Regetz):
- Add high-level information at the top of your script. Consider adding a project description, script dependencies, parameter descriptions, etcetera
- Notice and organize sections in your script. Describe what happen in each section, why, and any dependencies.
- Construct an end-to-end script if possible. This means it runs without intervention on your part, from start to finish, and generates your outputs from your raw data. These are easier to archive and easier to understand.
Why all the formalization, you ask? Because the more computationally complex and digitized our analyses become, the harder it will be to retrace our steps. Reproducibility is one of the tenets of science, and workflows should be common practice among all modern day scientists. Plus (although I have no hard evidence to this effect) workflows are likely to be the next scientific output required by funders, journals, and institutions. Start creating workflows now, while you are still young and creative.