(index page)

Fifteen ideas about data validation (and peer review)

Phrenology diagram showing honest and dishonest head shapes — It’s easy to evaluate a person by the shape of their head, but datasets are more complicated. From *Vaught’s Practical Character Reader* in the Internet Archive.

Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.

This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.

Data validation is like peer review of the literature: Peer review is an integral part of science; even when they resent the process, scientists understand and respect it. If we are to ask them to start reviewing data, it behooves us to slip data into existing structures. Data reviewed in conjunction with a paper fits this approach. Nature publishing group’s Scientific Data publishes data papers through a traditional review process that considers the data as well as the paper. Peer review at F1000Research follows a literature-descended (although decidedly non-traditional) process that asks reviewers to examine underlying data together with the paper.
Data validation is not like peer review of the literature: Data is fundamentally different from literature, and shouldn’t be treated as such. As Mark Parsons put it at the workshop, “literature is an argument; data is a fact.” The fundamental question in peer review of an article is “did the authors actually demonstrate what they claim?” This involves evaluation of the data, but in the context of a particular question and conclusion. Without a question, there is no context, and no way to meaningfully evaluate the data.
Divide the concerns: Separate out aspects of data quality and consider them independently. For example, Sarah Callaghan divides data quality into technical and scientific quality. Technical quality demands complete data and metadata and appropriate file formats; scientific quality requires appropriate collection methods and high overall believability.
Divvy up the roles: Separate concerns need not be evaluated by the same person or even the same organization. For instance, GigaScience assigns a separate data reviewer for technical review. Data paper publishers generally coordinate scientific review and leave at least some portion of the technical review to the repository that houses the data. Third party peer-review services like LIBRE or Rubriq could conceivably take up data review.
Review data and metadata together: A reviewer must assess data in conjunction with its documentation and metadata. Assessing data quality without considering documentation is both impossible and pointless; it’s impossible to know that data is “good” without knowing exactly what it is and, even if one could, it would be pointless because no one will ever be able to use it. This idea is at least implicit any data review scheme. In particular, data paper journals explicitly raise evaluation of the documentation to the same level as evaluation of the data. Biodiversity Data Journal’s peer review guidelines are not unusual in addressing not only the quality of the data and the quality of the documentation, but the consistency between them.
Experts should review the data: Like a journal article, a dataset should pass review by experts in the field. Datasets are especially prone to cross-disciplinary use, in which case the user may not have the background to evaluate the data themselves. Sarah Callaghan illustrated how peer review might work– even without a data paper– by reviewing a pair of (already published) datasets.
The community should review the data: Like a journal article, the real value of a dataset emerges over time as a result of community engagement. After a slow start, post-publication commenting on journal articles (e.g. through PubMed Commons) seems to be gaining momentum.
Users should review the data: Data review can be a byproduct of use. A researcher using a dataset interrogates it more thoroughly than someone just reviewing it. And, because they were doing it anyway, the only “cost” is the effort of capturing their opinion. In a pilot study, the Dutch Data Archiving and Networked Services repository solicited feedback by emailing a link to an online form to researchers who had downloaded their data.
Use is review: “Indeed, data use in its own right provides a form of review.” Even without explicit feedback, evidence of successful use is itself evidence of quality. Such evidence could be presented by collecting a list of papers that cite to the dataset.
Forget quality, consider fitness for purpose: A dataset may be good enough for one purpose but not another. Trying to assess the general “quality” of a dataset is hopeless; consider instead whether the dataset is suited to a particular use. Extending the previous idea, documentation of how and in what contexts a dataset has been used may be more informative than an assessment of abstract quality.
Rate data with multiple levels of quality: The binary accept/reject of traditional peer review (or, for that matter, fit/unfit for purpose) is overly reductive. A one-to-five (or one-to-ten) scale, familiar from pretty much the entire internet, affords a more nuanced view. The Public Library of Science (PLOS) Open Evaluation Tool applies a five-point scale to journal articles, and DANS users rated datasets on an Amazon-style five-star scale.
Offer users multiple levels of assurance: Not all data, even in one place, needs be reviewed to the same extent. It may be sensible to invest limited resources to most thoroughly validate those datasets which are most likely to be used. For example, Open Context offers five different levels of assurance, ranging from “demonstration, minimal editorial acceptance” to “peer-reviewed.” This idea could also be framed as levels of service ranging (as Mark Parsons put it at the workshop) from “just thrown out there” to “someone answers the phone.”
Rate data along multiple facets : Data can be validated or rated along multiple facets or axes. DANS datasets are rated on quality, completeness, consistency, and structure; two additional facets address documentation quality and usefulness of file formats. This is arguably a different framing of divided concerns, with a difference in application: there, independent assessments are ultimately synthesized into a single verdict; here, the facets are presented separately.
Dynamic datasets need ongoing review: Datasets can change over time, either through addition of new data or revision and correction of existing data. Additions and changes to datasets may necessitate a new (perhaps less extensive) review. Lawrence (2011) asserts that any change to a dataset should trigger a new review.
Unknown users will put the data to unknown uses: Whereas the audience for, and findings of, a journal article are fairly well understood by the author, a dataset may be used by a researcher from a distant field for an unimaginable purpose. Such a person is both the most important to provide validation for– because they lack the expertise to evaluate the data themselves– and the most difficult– because no one can guess who they will be or what they will want to do.

Have an idea about data review that I left out? Let us know in the comments!

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.” We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps.

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

We need a GitHub of Science by Marcio von Muhle
Software Carpentry Bootcamps: learn about GitHub from experts. New camps added often.
Git for Scientists: A Tutorial by John McDonnell
Testimonials about the value of learning things like Git from Software Carpentry attendees
Ram, K. 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code for Biology and Medicine 8:7.
Mascarelli, A. 2014. Research tools: Jump off the page. Nature 507: 523-525.