Skip to main content

(index page)

Software Carpentry and Data Management

About a year ago, I started hearing about Software Carpentry. I wasn’t sure exactly what it was, but I envisioned tech-types showing up at your house with routers, hard drives, and wireless mice to repair whatever software was damaged by careless fumblings. Of course, this is completely wrong. I now know that it is actually an ambitious and awesome project that was recently adopted by Mozilla, and recently got a boost from the Alfred P. Sloan Foundation (how is it that they always seem to be involved in the interesting stuff?).

From their website:

Software Carpentry helps researchers be more productive by teaching them basic computing skills. We run boot camps at dozens of sites around the world, and also provide open access material online for self-paced instruction.

SWC got its start in 1990s, when its founder, Greg Wilson, realized that many of the scientists who were trying to use supercomputers didn’t actually know how to build and troubleshoot their code, much less use things like version control. More specifically, most had never been shown how to do four basic tasks that are fundamentally important to any science involving computation (which is increasingly all science):

Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).
Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Greg started teaching these topics (and others) at Los Alamos National Laboratory in 1998. After a bit of stop and start, he left a faculty position at the University of Toronto in April 2010 to devote himself to it full-time. Fast forward to January 2012, and Software Carpentry became the first project of what is now the Mozilla Science Lab, supported by funding from the Alfred P. Sloan Foundation.

This new incarnation of Software Carpentry has focused on offering intensive, two-day workshops aimed at grad students and postdocs. These workshops (which they call “boot camps”) are usually small – typically 40 learners – with low student-teacher ratios, ensuring that those in attendance get the attention and help they need.

Other than Greg himself, whose role is increasingly to train new trainers, Software Carpentry is a volunteer organization. More than 50 people are currently qualified to instruct, and the number is growing steadily. The basic framework for a boot camp is this:

  1. Someone decides to host a Software Carpentry workshop for a particular group (e.g., a flock of macroecologists, or a herd of new graduate students at a particular university). This can be fellow researchers, department chairs, librarians, advisors — you name it.
  2. Organizers round up funds to pay for travel expenses for the instructors and any other anticipated workshop expenses.
  3. Software Carpentry matches them with instructors according to the needs of their group; together, they and the organizers choose dates and open up enrolment.
  4. The boot camp itself runs eight hours a day for two consecutive days (though there are occasionally variations). Learning is hands-on: people work on their own laptops, and see how to use the tools listed below to solve realistic problems.

That’s it! They have a great webpage on how to run a bootcamp, which includes checklists and thorough instructions on how to ensure your boot camp is a success. About 2300 people have gone through a SWC bootcamp, and the organization hopes to double that number by mid-2014.

The core curriculum for the two-day boot camp is usually:

Software Carpentry also offers over a hundred short video lessons online, all of which are CC-BY licensed  (go to the SWC webpage for a hyperlinked list):

Why focus on grad students and postdocs? They focus on graduate students and post-docs because professors are often too busy with teaching, committees, and proposal writing to improve their software skills, while undergrads have less incentive to learn since they don’t have a longer-term project in mind yet. They’re also playing a long game: today’s grad students are tomorrow’s professors, and the day after that, they will be the ones setting parameters for funding programs, editing journals, and shaping science in other ways. Teaching them these skills now is one way – maybe the only way – to make computational competence a “normal” part of scientific practice.

So why am I blogging about this? When Greg started thinking about training researchers to understand the basics of good computing practice and coding, he couldn’t have predicted that huge explosion in the availability of data, the number of software programs to analyze those datasets, and the shortage of training that researchers receive in dealing with this new era. I believe that part of the reason funders stepped up to help the mission of software caprentry is because now, more than ever, reseachers need these skills to successfully do science. Reproducibility and accountability are in more demand, and data sharing mandates will likely morph into workflow sharing mandates. Ensuring reproducibility in analysis is next to impossible without the skills Software Carpentry’s volunteers teach.

My secret motive for talking about SWC? I want UC librarians to start organizing bootcamps for groups of researchers on their campuses!

Software for Reproducibility Part 2: The Tools

Last week I wrote about the workshop I attended (Workshop on Software Infrastructure for Reproducibility in Science), held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows. I provided some broad-strokes overviews last week; this week, I’ve created a list of some of the tools we saw during the workshop. Note: the level of detail for tools is consistent with my level of fatigue during their presentation!

Sumatra

From the Sumatra website:

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable: (1) launching simulations/analyses with automated recording of provenance information; sand (2) managing a computational project: browsing, viewing, deleting simulations/analyses.

Taverna

From the Taverna website:

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.

IPython Notebook

Galaxy

From their website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

Madagascar

From the Madagascar website:

Madagascar is an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide

for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become “computational recipes” to be verified, exchanged, and modified by users of the system.

VisTrails

RCloud

ReproZip 

Open Science Framework

From the OSF website:

The Open Science Framework (OSF) is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support the scientist’s workflow and help increase the alignment between scientific values and scientific practices.

RunMyCode

From the RunMyCode website:

RunMyCode is a novel cloud-based platform that enables scientists to openly share the code and data that underlie their research publications. This service is based on the innovative concept of a companion website associated with a scientific publication. The code is run on a computer cloud server and the results are immediately displayed to the user.

Dexy

From their website:

Dexy lets you to continue to use your favorite documentation tools, while getting more out of them than ever, and being able to combine them in new and powerful ways. With Dexy you can bring your scripting and testing skills into play in your documentation, bringing project automation and integration to new levels.

DuraSpace 

Dataverse 

From their website:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Software for Reproducibility

Last week I thought a lot about one of the foundational tenets of science: reproducibility. I attended the Workshop on Software Infrastructure for Reproducibility in Science, held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows.

New to workflows? Read more about workflows in old blog posts on the topic, here and here. Basically, a workflow is a formalization of “process metadata”.  Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as a precise description of the scientific procedures you follow.

After sitting through demos and presentations on the different tools folks have created, my head was spinning, in a good way. A few of my takeaways are below. For my next Data Pub post I will provide list of the tools we discussed.

Takeaway #1: Reuse is different from reproducibility.

The end-goal of documenting and archiving a workflow may be different for different people/systems. Reuse of a workflow, for instance, is potentially much easier than exactly reproducing the results .  Any researcher will tell you: reproducibility is virtually impossible. Of course, this differs a bit depending on discipline: anything involving a living thing is much more unpredictable (i.e., biology), while engineering experiments are more likely to be spot-on when reproduced. The level of detail needed to reproduce results is likely to dwarf details and information needed for reuse of workflows.

Takeaway #2: Think of reproducibility as archiving.

This was something Josh Greenberg said, and it struck a chord with me. It was said in the context of considering exactly how much stuff should be captured for reproducibility. Josh pointed out that there is a whole body of work out there addressing this very question: archival science.

Example: an archivist at a library gets boxes of stuff from a famous author who recently passed away. How does s/he decide what is important? What should be kept, and what should be thrown out? How should the items be arranged to ensure that they are useful? What metadata, context, or other information (like a finding aid) should be provided?

The situation with archiving workflows is similar: how much information is needed? What are the likely uses for the workflow? How much detail is too much? Too little? I like considering the issues around capturing the scientific process as similar to archival science scenarios– it makes the problem seem a bit more manageable.

Takeaway #3: High-quality APIs are critical for any tool developed.

We talked about MANY different tools. The one thing we could all agree on was that they should play nice with other tools. In the software world, this means having a nice, user-friendly Application Program Interface (API) that basically tells two pieces of software how to talk to one another.

Takeaway #4: We’ve got the tech-savvy researchers covered. Others? not so much.

The software we discussed is very nifty. That said, many of these tools are geared towards researchers with some impressive tech chops. The tools focus on helping capture code-based work, and integrate with things like LaTeX, Git/Github, the command line. Did I lose you there? You aren’t alone… many of the researchers I interact with are not familiar with these tools, and would therefore not be able to effectively use the software we discussed.

Takeaway #5: Closing the gap between the tools and the researchers that should use them is hard. But not impossible.

There are three basic approaches that we can take:

  1. Focus on better user experience design
  2. Emphasize researcher training via workshops, one-on-one help from experts, et cetera
  3. Force researchers to close the gap on their own. (i.e., Wo/man up).

The reality is that it’s likely to be some combination of these three. Those at the workshop recognized the need for better user interfaces, and some projects here at the CDL are focusing on extensive usability testing prior to release. Funders are beginning to see the value of funding new positions for “human bridges” to help sync up researcher skill sets with available tools. And finally, researchers are slowly recognizing the need to learn basic coding– note the massive uptake of R in the Ecology community as an example.