(index page)

Support your Data

Building an RDM Maturity Model: Part 4

Researchers are faced with rapidly evolving expectations about how they should manage and share their data, code, and other research products. These expectations come from a variety of sources, including funding agencies and academic publishers. As part of our effort to help researchers meet these expectations, the UC3 team spent much of last year investigating current practices. We studied how neuroimaging researchers handle their data, examined how researchers use, share, and value software, and conducted interviews and focus groups with researchers across the UC system. All of this has reaffirmed our that perception that researchers and other data stakeholders often think and talk about data in very different ways.

Such differences are central to another project, which we’ve referred to alternately as an RDM maturity model and an RDM guide for researchers. Since its inception, the goal of this project has been to give researchers tools to self assess their data-related practices and access the skills and experience of data service providers within their institutional libraries. Drawing upon tools with convergent aims, including maturity-based frameworks and visualizations like the research data lifecycle, we’ve worked to ensure that our tools are user friendly, free of jargon, and adaptable enough to meet the needs of a range of stakeholders, including different research, service provider, and institutional communities. To this end, we’ve renamed this project yet again to “Support your Data”.

Image showing some of the support structure for the Golden Gate Bridge. This image also nicely encapsulates how many of the practices described in our tools are essential to the research process but are often invisible from view.

What’s in a name?

Because our tools are intended to be accessible to a people with a broad range of perceptions, practices, and priorities, coming up with a name that encompasses complex concepts like “openness” and “reproducibility” proved to be quite difficult. We also wanted to capture the spirit of terms like “capability maturity” and “research data management (RDM)” without referencing them directly. After spending a lot of time trying to come up with something clever, we decided that the name of our tools should describe their function. Since the goal is to support researchers as they manage and share data (in ways potentially influenced by expectations related to openness and reproducibility), why not just use that?

Recent Developments

In addition to thinking through the name, we’ve also refined the content of our tools. The central element, a rubric that allows researchers to quickly benchmark their data-related practices, is shown below. As before, it highlights how the management of research data is an active and iterative process that occurs throughout the different phases of a project. Activities in different phases represented in different rows. Proceeding left to right, a series of declarative statements describe specific activities within each phase in order of how well they are designed to foster access to and use of data in the future.

The “Support your Data” rubric. Each row is complemented by a one page guide intended to help researchers advance their data-related practices.

The four levels “ad hoc”, “one-time”, “active and informative” and “optimized for re-use”, are intended to be descriptive rather than prescriptive.

Ad hoc — Refers to circumstances in which practices are neither standardized or documented. Every time a researcher has to manage their data they have to design new practices and procedures from scratch.
One time — Refers to circumstances in which data management occurs only when it is necessary, such as in direct response to a mandate from a funder or publisher. Practices or procedures implemented at one phase of a project are not designed with later phases in mind.
Active and informative — Refers to circumstances in which data management is a regular part of the research process. Practices and procedures are standardized, well documented, and well integrated with those implemented at other phases.
Optimized for re-use — Refers to circumstances in which data management activities are designed to facilitate the re-use of data in the future

Each row of the rubric is tied to a one page guide that provides specific information about how to advance practices as desired or required. Development of the content of the guides has proceeded sequentially. During the autumn and winter of 2017, members of the UC3 team met to discuss issues relevant to each phase, reduce the use of jargon, and identify how content could be localized to meet the needs of different research and institutional communities. We are currently working on revising the content based suggestions made during these meetings.

Next Steps

Now that we have scoped out the content, we’ve begun to focus on the design aspect of our tools. Working with CDL’s UX team, we’ve begun to think through the presentation of both the rubric and the guides in physical media and online.

As always, we welcome any and all feedback about content and application of our tools.

Workflows Part II: Formal

In my last blog post, I provided an overview of scientific workflows in general. I also covered the basics of informal workflows, i.e. flow charts and commented scripts. Well put away the tuxedo t-shirt and pull out your cummerbund and bow tie, folks, because we are moving on to formal workflow systems.

Nothing says formal like a sequined cummerbund and matching bow tie. From awkwardfamilyphotos.com (click the pic for more)

A formal workflow (let’s call them FW) is essentially an “analytical pipeline” that takes data in one end and spits out results on the other. The major difference between FW and commented scripts (one example of informal workflows) is that FW can be implemented in different software systems. A commented R script for estimating parameters works for R, but what about those simulations you need to run in MATLAB afterward? Saving the outputs from one program, importing them into another, and continuing analysis there is a very common practice in modern science.

So how do you link together multiple software systems automatically? You have two options: become one of those geniuses that use the command line for all of your analyses, or use a FW software system developed by one of those geniuses. The former requires a level of expertise that many (most?) Earth, environmental, and ecological scientists do not possess, myself included. It involves writing code that will access different software programs on your machine, load data into them, perform analyses, save results, and use those results as input for a completely different set of analyses, often using a different software program. FW are often called “executable workflows” because they are a way for you to push only one button (e.g., enter) and obtain your results.

What about FW software systems? These are a bit more accessible for the average scientist. FW software has been around for about 10 years, with the first user-friendly(ish) breakthrough being the Kepler Workflow System. Kepler was developed with researchers in mind, and allows the user to drag and drop chunks of analytical tasks into a window. The user can indicate which data files should be used as inputs and where the outputs should be sent, connecting the analytical tasks with arrows. Kepler is still in a beta version, and most researchers will find the work required to set up a workflow prohibitive.

Groups that have managed to incorporate workflows into their community of sharing are genomicists; this is because they tend to have predictable data as inputs, with a comparatively small set of analyses performed on those data. Interestingly, a social networking site has grown up around genomicists’ use workflows called myExperiment, where researchers can share workflows, download others’ workflows, and comment on those that they have tried.

The benefits of FW are the each step in the analytical pipeline, including any parameters or requirements, is formally recorded. This means that researchers can reuse both individual steps (e.g., the data cleaning step in R or the maximum likelihood estimation in MATLAB), as well as the overall workflow). Analyses can be re-run much more quickly, and repetitive tasks can be automated to reduce chances for manual error. Because the workflow can be saved and re-used, it is a great way to ensure reproducibility and transparency in the scientific process.

Although Kepler is not in wide use, it is a great example of something that will likely become common place in the researcher’s toolbox over the next decade. Other FW software includes Taverna, VisTrails, and Pegasus – all with varying levels of user-friendliness and varied communities of use. As the complexity of analyses and the variety of software systems used by scientists continues to increase, FW are going to become a more common part of the research process. Perhaps more importantly, it is likely that funders will start requiring the archiving of FW alongside data to ensure accountability, reproducibility, and to promote reuse.

A few resources for more info:

My CiteULike list of workflow-related articles and publications
Great book edited by Michener and Brunt, Ecological Data: Design, Management and Processing. (Blackwell, New York, 2000)