(index page)

Workflows Part I: Informal

Two weeks ago, I was back at NCEAS in Santa Barbara to help with a workshop on data management. DataONE hosted the event, which provided us with a venue to test drive and improve upon on some data management lessons we created. One of the slide sets I created was about workflows, and it occurred to me – I haven’t blogged about workflows properly! So here it is, workflows in a DCXL two-part nutshell.

First, let’s define a workflow… Actually, we are first going to define process metadata. Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as the things you write down in your lab notebook (you do keep a notebook, right?). If you have good records of your process metadata, it’s much easier for others to reproduce your results. It’s also easier for you to trace your steps back in case of mistakes or errors. A related concept is data provenance, which is essentially the origins of your data.

tuxedo t shirt — Like the tuxedo t-shirt, an informal workflow can work great in some situations. The next blog post will cover formalwear. Image from metashirt.cricketsoda.com (click to order!)

Now that we’ve established process metadata and provenance, we can move on to the workflow. A workflow is a formalization of process metadata. Think of it as a precise description of the scientific procedures you follow. There are two main types of workflows: informal and formal. The next post will cover formal workflows; here we will talk about the informal kind, which is easy to create.

The easiest informal workflow is a flow chart or a diagram expressing what goes into an analysis, and what comes out the other side. Here’s an example:

The yellow boxes in the middle are the “transformation rules”, or more simply put, the things you are doing to your data. The blue boxes are the actual data, in different forms, moving in and out of the analysis boxes. The informal workflow represented above is probably about as complicated as a lab exercise from first year biology, but you get the idea. A more complex example is represented here (hat tip to Stacy Rebich-Hespanha):

Flow charts are a great way to start thinking about workflows, especially because you probably already have one sketched out in your head or lab notebook.

Another example of an informal workflow is a commented script. This really only applies if you work with scripted languages in R, MATLAB, SAS, or otherwise. If you do, here are a few best practices (hat tip to Jim Regetz):

Add high-level information at the top of your script. Consider adding a project description, script dependencies, parameter descriptions, etcetera
Notice and organize sections in your script. Describe what happen in each section, why, and any dependencies.
Construct an end-to-end script if possible. This means it runs without intervention on your part, from start to finish, and generates your outputs from your raw data. These are easier to archive and easier to understand.

Why all the formalization, you ask? Because the more computationally complex and digitized our analyses become, the harder it will be to retrace our steps. Reproducibility is one of the tenets of science, and workflows should be common practice among all modern day scientists. Plus (although I have no hard evidence to this effect) workflows are likely to be the next scientific output required by funders, journals, and institutions. Start creating workflows now, while you are still young and creative.

The Digital Dark Age, Part 1

This will be known as the Digital Dark Age. The first time I heard this statement was at Internet Archive, during the PDA 2012 Meeting (read my blog post about it here). What did this mean? What is a Digital Dark Age? Read on.

While serving in Vietnam, my father wrote letters to my grandparents about his life fighting a war in a foreign country. One of his letters was sent to arrive in time for my grandfather’s birthday, and it contained a lovely poem that articulated my father’s warm feelings about his childhood, his parents, and his upbringing. My grandparents kept the poem framed in a prominent spot in their home. When I visited them as a child, I would read the poem written in my young dad’s handwriting, stare at the yellowed paper, and think about how far that poem had to travel to relay its greetings to my grandparents. It was special– for its history, the people involved, and the fact that these people were intimately connected to me.

Now fast forward to 2012. Imagine modern-day soldiers all over the world, emailing, making satellite phone calls, and chatting with their families via video conferencing. When compared to snail mail, these modern communication methods are likely a much preferred way of staying in touch for those families. But how likely is it that future grandchildren will be able to listen those the conversations, read those emails, or watch those video calls? The answer is extremely unlikely.

These two scenarios sum up the concept of a Digital Dark Age: compared to 40 years ago, we are doing a terrible job of ensuring that future generations will be able to read our letters, look at our pictures, or use our scientific data.

You mean future generations won’t be able to listen to my mix tapes?! From Flickr by newrambler

The Digital Dark Age “refers to a possible future situation where it will be difficult or impossible to read historical digital documents and multimedia, because they have been stored in an obsolete and obscure digital format.” The phrase “Dark Age” is a reference to The Dark Ages, a period in history around the beginning of the Middle Ages characterized by a scarcity of historical and other written records at least for some areas of Europe, rendering it obscure to historians. Sounds scary, no?

How can we remedy this situation? What are people doing about it? Most importantly, what does this mean for scientific advancement? Check out my next post to find out.