(index page)

Workflows Part I: Informal

Two weeks ago, I was back at NCEAS in Santa Barbara to help with a workshop on data management. DataONE hosted the event, which provided us with a venue to test drive and improve upon on some data management lessons we created. One of the slide sets I created was about workflows, and it occurred to me – I haven’t blogged about workflows properly! So here it is, workflows in a DCXL two-part nutshell.

First, let’s define a workflow… Actually, we are first going to define process metadata. Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as the things you write down in your lab notebook (you do keep a notebook, right?). If you have good records of your process metadata, it’s much easier for others to reproduce your results. It’s also easier for you to trace your steps back in case of mistakes or errors. A related concept is data provenance, which is essentially the origins of your data.

tuxedo t shirt — Like the tuxedo t-shirt, an informal workflow can work great in some situations. The next blog post will cover formalwear. Image from metashirt.cricketsoda.com (click to order!)

Now that we’ve established process metadata and provenance, we can move on to the workflow. A workflow is a formalization of process metadata. Think of it as a precise description of the scientific procedures you follow. There are two main types of workflows: informal and formal. The next post will cover formal workflows; here we will talk about the informal kind, which is easy to create.

The easiest informal workflow is a flow chart or a diagram expressing what goes into an analysis, and what comes out the other side. Here’s an example:

The yellow boxes in the middle are the “transformation rules”, or more simply put, the things you are doing to your data. The blue boxes are the actual data, in different forms, moving in and out of the analysis boxes. The informal workflow represented above is probably about as complicated as a lab exercise from first year biology, but you get the idea. A more complex example is represented here (hat tip to Stacy Rebich-Hespanha):

Flow charts are a great way to start thinking about workflows, especially because you probably already have one sketched out in your head or lab notebook.

Another example of an informal workflow is a commented script. This really only applies if you work with scripted languages in R, MATLAB, SAS, or otherwise. If you do, here are a few best practices (hat tip to Jim Regetz):

Add high-level information at the top of your script. Consider adding a project description, script dependencies, parameter descriptions, etcetera
Notice and organize sections in your script. Describe what happen in each section, why, and any dependencies.
Construct an end-to-end script if possible. This means it runs without intervention on your part, from start to finish, and generates your outputs from your raw data. These are easier to archive and easier to understand.

Why all the formalization, you ask? Because the more computationally complex and digitized our analyses become, the harder it will be to retrace our steps. Reproducibility is one of the tenets of science, and workflows should be common practice among all modern day scientists. Plus (although I have no hard evidence to this effect) workflows are likely to be the next scientific output required by funders, journals, and institutions. Start creating workflows now, while you are still young and creative.

Finding a Home For Your Data

Where do I put my data? This question comes up often when I talk with researchers about long term data archiving. There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.

First and foremost, choose early. By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset. If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point. You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.

Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful. Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future. If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns). Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.

So how do you choose your perfect data repository? There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services. The site allows you to search for keywords in their giant list of data repositories. For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data. Don’t see your favorite data repository? The site welcomes feedback (including suggestions for additional repositories not in their database).

Another good way to discover your perfect repository: think about where you go to find datasets for your research. Try asking around too – where do your colleagues look for data? Where do they archive it? Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.

A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system. These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field. Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.

The DCXL add-in will initially connect with CDL’s Merritt Repository. The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable. Long live Excel data!

Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel

Putting the Meta in Metadata

The prefix “meta” implies “abstraction”. Metadata is an abstraction of the data. My favorite abstract artist is Jean-Michel Basquiat. See the connection? From www.provisionslibrary.com

The DCXL project is in full swing now– developers are working closely with Microsoft Research to create the add-in that will revolutionize scientific data curation (in my humble opinion!). Part of this process was deciding how to handle metadata. For a refresher on metadata, i.e. data documentation, read this post about the metadata in DCXL.

Creating metadata was one of the major requirements for the project, and arguably the most challenging task. The challenges stem from the fact that there are many metadata standards out there, and of course, none are perfect for our particular task. So how do we incorporate good work done by others much smarter than me into DCXL, without compromising our need for user-friendly, simple data documentation?

It was tricky, but we came up with a solution that will work for many, if not most, potential DCXL users. A few things entered into the metadata route we chose:

DataONE: We are very interested in making sure that data posted to a repository via the DCXL add-in can be found using the DataONE Mercury metadata search system (Called ONE-Mercury; to be released in May). That means we need to make sure we are using metadata that the DataONE infrastructure likes. At this point in DataONE’s development, that limits us to International Organization for Standardization Geospatial Metadata Standard (ISO19115), Federal Geographic Data Committee Geospatial Metadata Standard (FGDC) , and Ecological Metadata Language (EML).
We want metadata created by the DCXL software to be as flexible as possible for as many different types of data as possible. ISO19115 and FGDC are both geared towards spatial data specifically (e.g., GIS). EML is a bit more general and flexible, so we chose to go with it.
EML is a very well documented metadata schema; rather than include every element of EML in DCXL, we cherry-picked the elements we thought would generate metadata that makes the data more discoverable and useable. Of course, just like never being too skinny or too rich, you can NEVER have too much metadata. But we chose to draw the line somewhere between “not useful at all” and “overwhelming”.
We ensured that the metadata elements we included could be mapped to DataCite and Dublin Core minimal metadata. This ensures that a data citation can be generated based on the metadata collected for the dataset.

The result of this work can be seen here: dot.ucop.edu/specs/emlcore.html. Word of warning – proceed to this link only if HTML doesn’t scare you! (Hat tip to CDL’s John Kunze for leading the charge on the DCXL metadata front).

Still hungry for more? There are three DataONE education modules pertaining to metadata: What Is Metadata? | The Value of Metadata | Writing Metadata