Software for Reproducibility Part 2: The Tools

CDL UC3, June 13, 2013

Posted in: UC3

Last week I wrote about the workshop I attended (Workshop on Software Infrastructure for Reproducibility in Science), held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows. I provided some broad-strokes overviews last week; this week, I’ve created a list of some of the tools we saw during the workshop. Note: the level of detail for tools is consistent with my level of fatigue during their presentation!

Sumatra

Presenter: Andrew Davison
Short description: Sumatra is a library of components and graphical user interfaces (GUIs) for viewing and using tools via Python. This is an “electronic lab notebook” kind of idea.
Sumatra needs to interact with version control systems, such as Subversion, Git, Mercurial, or Bazaar.
Future plans: Integrate with R, Fortran, C/C++, Ruby.

From the Sumatra website:

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable: (1) launching simulations/analyses with automated recording of provenance information; sand (2) managing a computational project: browsing, viewing, deleting simulations/analyses.

Taverna

Presenter: Carole Goble
Short description: Taverna is an “execution environment”, i.e. a way to design and execute formal workflows.
Other details: Written in Java. Consists of the Taverna Engine (the workhorse), the Taverna Workbench (desktop client) and Taverna Server (remote workflow execution server) that sit on top of the Engine.
Links up with myExperiment, a publication environment for sharing workflows. It allows you to put together the workflow, description, files, data, documents etc. and upload/share with others.

From the Taverna website:

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.

IPython Notebook

Presenter: Brian Granger
IPython notebook is a web-based computing environment and open document format for telling stories about code and data. Spawn of the IPython Project (focused on interactive computing)
Very code-centric. Can integrate text (i.e., Markdown), LaTeX, images, code etc. can be interwoven. Linked up with GitHub.

Galaxy

Presenter: James Taylor
Galaxy is an open source, free web service integrating a wealth of tools, resources, etc. to simplify researcher workflows in genomics.
Biggest challenge is archiving. They currently have 1 PB of user data due to the abundance of sequences used.

From their website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

Madagascar

Presenter: Sergey Fomel
Geophysics-focused project management system.

From the Madagascar website:

Madagascar is an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide

a convenient and powerful environment

a convenient technology transfer tool

for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become “computational recipes” to be verified, exchanged, and modified by users of the system.

VisTrails

Presenter: David Koop
VisTrails is an open-source scientific workflow and provenance management system that provides support for simulations, data exploration and visualization.
It was built with ideas of provenance and transparency in mind. Essentially the research “trail” is followed as users generate and test a hypothesis.
Focus on change-based provenance. keeps track of all changes via a version tree.
There’s also execution provenance – start time and end time; where it was done etc. this is instrument-specific metadata.
Crowdlabs.org: associated social website for sharing workflows and provenance

RCloud

No formal website; GitHub site only
Presenter: Carlos Scheidegger
RCloud was developed for programmers who use R at AT&T Labs. They were having interactive sessions in R for data exploration and exploratory data analysis. Their idea: what if every R session was transparent and automatically versioned>
I know this is a bit of a thin description… but this is an ongoing active project. I’m anxious to see where it goes next

ReproZip

No formal website, but see a 15 minute demo here
Presenter: Fernando Chirigati
The premise: few computational experiments are reproducible. To get closer to reproducibility, we need a record of: data description, experiment specs, description of environment (in addition to code, data etc).
ReproZip automatically and systematically captures required provenance of existing experiments. It does this by “packing” experiments. How it works:
- ReproZip executes experiment. system tap captures the provenance. each node of process has details on it (provenance tree)
- Necessary components identified.
- Specification of workflow happens. fed into workflow.
When ready to examine/explore/verify experiments: the package is extracted; ReproZip unloads the experiment and workflows

Open Science Framework

Presenters: Brian Nosek & Jeff Spies
In brief, the OSF is a nifty way to track projects, work with collaborators, and link together tools from your workflow.

You simply go to the website and start a project (for free). Then add contributors and components to the project. Voila!
A neat feature – you can fork projects.
Provides a log of activities (i.e., version control) & access control (i.e., who can see your stuff)>
As long as two software systems work with OSF, they can work together – OSF allows the APIs to “talk”.

From the OSF website:

The Open Science Framework (OSF) is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support the scientist’s workflow and help increase the alignment between scientific values and scientific practices.

RunMyCode

Presenter: Victoria Stodden
The idea behind this service is that researchers create “companion websites” associated with their publications. These websites allow others to implement the methods used in the paper.

From the RunMyCode website:

RunMyCode is a novel cloud-based platform that enables scientists to openly share the code and data that underlie their research publications. This service is based on the innovative concept of a companion website associated with a scientific publication. The code is run on a computer cloud server and the results are immediately displayed to the user.

Dexy

Presenter: Ana Nelson
Tagline: make | docs | sexy (what more can I say?)
Dexy is an Open Source platform that allows those writing up documents that have underpinning code to combine the two.
Can mix/match coding languages and documentation formats. (e.g., Python, C, Markdown to HTML, data from APIs, WordPress, etc.)

From their website:

Dexy lets you to continue to use your favorite documentation tools, while getting more out of them than ever, and being able to combine them in new and powerful ways. With Dexy you can bring your scripting and testing skills into play in your documentation, bringing project automation and integration to new levels.

DuraSpace

Presenter: Jonathan Markow
DuraSpace’s mission is access, discovery, preservation of scholarly digital data. Their primary stakeholders and audience are libraries. As part of this role, DuraSpace is a steward open source projects (Fedora, DSpace, VIVO)
Their main service is DuraCloud: an online storage and preservation service that does things like repair corrupt files, move online copies offsite, distribute content geographically, and scale up or down as needed.

Dataverse

Presenter: Merce Crosas
Dataverse is a virtual archive for “research studies”. These research studies are containers for data, documentation, and code that are needed for reproducibility.

From their website:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.