Category: UC3

Posts written by UC3 staff.

Fifteen ideas about data validation (and peer review)

Phrenology diagram showing honest and dishonest head shapes

It’s easy to evaluate a person by the shape of their head, but datasets are more complicated. From Vaught’s Practical Character Reader in the Internet Archive.

Many open issues drift around data publication, but validation is both the biggest and the haziest. Some form of validation at some stage in a data publication process is essential; data users need to know that they can trust the data they want to use, data creators need a stamp of approval to get credit for their work, and the publication process must avoid getting clogged with unusable junk. However, the scientific literature’s validation mechanisms don’t translate as directly to data as its mechanism for, say, citation.

This post is in part a very late response to a data publication workshop I attended last February at the International Digital Curation Conference (IDCC). In a breakout discussion of models for data peer review, there were far more ideas about data review than time to discuss them. Here, for reference purposes, is a longish list of non-parallel, sometimes-overlapping ideas about how data review, validation, or quality assessment could or should work. I’ve tried to stay away from deeper consideration of what data quality means (which I’ll discuss in a future post) and from the broader issues of peer review associated with the literature, but they inevitably pop up anyway.

  1. Data validation is like peer review of the literature: Peer review is an integral part of science; even when they resent the process, scientists understand and respect it. If we are to ask them to start reviewing data, it behooves us to slip data into existing structures. Data reviewed in conjunction with a paper fits this approach. Nature publishing group’s Scientific Data publishes data papers through a traditional review process that considers the data as well as the paper. Peer review at F1000Research follows a literature-descended (although decidedly non-traditional) process that asks reviewers to examine underlying data together with the paper.
  2. Data validation is not like peer review of the literature: Data is fundamentally different from literature, and shouldn’t be treated as such. As Mark Parsons put it at the workshop, “literature is an argument; data is a fact.” The fundamental question in peer review of an article is “did the authors actually demonstrate what they claim?” This involves evaluation of the data, but in the context of a particular question and conclusion. Without a question, there is no context, and no way to meaningfully evaluate the data.
  3. Divide the concerns: Separate out aspects of data quality and consider them independently. For example, Sarah Callaghan divides data quality into technical and scientific quality. Technical quality demands complete data and metadata and appropriate file formats; scientific quality requires appropriate collection methods and high overall believability.
  4. Divvy up the roles: Separate concerns need not be evaluated by the same person or even the same organization. For instance, GigaScience assigns a separate data reviewer for technical review. Data paper publishers generally coordinate scientific review and leave at least some portion of the technical review to the repository that houses the data. Third party peer-review services like LIBRE or Rubriq could conceivably take up data review.
  5. Review data and metadata together: A reviewer must assess data in conjunction with its documentation and metadata. Assessing data quality without considering documentation is both impossible and pointless; it’s impossible to know that data is “good” without knowing exactly what it is and, even if one could, it would be pointless because no one will ever be able to use it. This idea is at least implicit any data review scheme. In particular, data paper journals explicitly raise evaluation of the documentation to the same level as evaluation of the data. Biodiversity Data Journal’peer review guidelines are not unusual in addressing not only the quality of the data and the quality of the documentation, but the consistency between them.
  6. Experts should review the data: Like a journal article, a dataset should pass review by experts in the field. Datasets are especially prone to cross-disciplinary use, in which case the user may not have the background to evaluate the data themselves. Sarah Callaghan illustrated how peer review might work– even without a data paper– by reviewing a pair of (already published) datasets.
  7. The community should review the data: Like a journal article, the real value of a dataset emerges over time as a result of community engagement. After a slow start, post-publication commenting on journal articles (e.g. through PubMed Commons) seems to be gaining momentum.
  8. Users should review the data: Data review can be a byproduct of use. A researcher using a dataset interrogates it more thoroughly than someone just reviewing it. And, because they were doing it anyway, the only “cost” is the effort of capturing their opinion. In a pilot study, the Dutch Data Archiving and Networked Services repository solicited feedback by emailing a link to an online form to researchers who had downloaded their data.
  9. Use is review: “Indeed, data use in its own right provides a form of review.” Even without explicit feedback, evidence of successful use is itself evidence of quality. Such evidence could be presented by collecting a list of papers that cite to the dataset.
  10. Forget quality, consider fitness for purpose: A dataset may be good enough for one purpose but not another. Trying to assess the general “quality” of a dataset is hopeless; consider instead whether the dataset is suited to a particular use. Extending the previous idea, documentation of how and in what contexts a dataset has been used may be more informative than an assessment of abstract quality.
  11. Rate data with multiple levels of quality: The binary accept/reject of traditional peer review (or, for that matter, fit/unfit for purpose) is overly reductive. A one-to-five (or one-to-ten) scale, familiar from pretty much the entire internet, affords a more nuanced view. The Public Library of Science (PLOS) Open Evaluation Tool applies a five-point scale to journal articles, and DANS users rated datasets on an Amazon-style five-star scale.
  12. Offer users multiple levels of assurance: Not all data, even in one place, needs be reviewed to the same extent. It may be sensible to invest limited resources to most thoroughly validate those datasets which are most likely to be used. For example, Open Context offers five different levels of assurance, ranging from “demonstration, minimal editorial acceptance” to “peer-reviewed.” This idea could also be framed as levels of service ranging (as Mark Parsons put it at the workshop) from “just thrown out there” to “someone answers the phone.”
  13. Rate data along multiple facets : Data can be validated or rated along multiple facets or axes. DANS datasets are rated on quality, completeness, consistency, and structure; two additional facets address documentation quality and usefulness of file formats. This is arguably a different framing of  divided concerns, with a difference in application: there, independent assessments are ultimately synthesized into a single verdict; here, the facets are presented separately.
  14. Dynamic datasets need ongoing review: Datasets can change over time, either through addition of new data or revision and correction of existing data. Additions and changes to datasets may necessitate a new (perhaps less extensive) review. Lawrence (2011) asserts that any change to a dataset should trigger a new review.
  15. Unknown users will put the data to unknown uses: Whereas the audience for, and findings of, a journal article are fairly well understood by the author, a dataset may be used by a researcher from a distant field for an unimaginable purpose. Such a person is both the most important to provide validation for– because they lack the expertise to evaluate the data themselves– and the most difficult– because no one can guess who they will be or what they will want to do.

Have an idea about data review that I left out? Let us know in the comments!

Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From egotripland.com

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From git-scm.com, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. naturejobs.com recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads

Abandon all hope, ye who enter dates in Excel

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”

Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

 

Mountain Observatories in Reno

A few months ago, I blogged about my experiences at the NSF Large Facilities Workshop. “Large Facilities” encompass things like NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). I found the event itself to be an eye-opening experience: much to my surprise, there was some resistance to data sharing in this community. I had always assumed that large, government-funded projects had strict data sharing requirements, but this is not the case. I had stimulating arguments with Large Facilities managers who considered their data too big and complex to share, and (more worrisome), that their researchers would be very resistant to opening up the data they generated at these large facilities.

Why all this talk about large facilities? Because I’m getting the chance to make my arguments again, to a group with overlapping interests to that of the Large Facilities community. I’m very excited to be speaking at Mountain Observatories: A Global Fair and Workshop  this July in Reno, Nevada. Here’s a description from the organizers:

The event is focused on observation sites, networks, and systems that provide data on mountain regions as coupled human-natural systems. So the meeting is expected to bring together biophysical as well as socio-economic researchers to discuss how we can create a more comprehensive and quantitative mountain observing network using the sites, initiatives, and systems already established in various regions of the world.

I must admit, I’m ridiculously excited to geek out with this community. I’ll get to hear about the GLORIA Project (GLObal Robotic-telescopes Intelligent Array), something called “Mountain Ethnobotany“, and “Climate Change Adaptation Governance”. See a full list of the proposed sessions here. The conference is geared towards researchers and managers, which means I’ll have the opportunity to hear about data sharing proclivities straight from their mouths. The roster of speakers joining me include a hydroclimatologist (Mike Dettinger, USGS) and a researcher focused on socio-cultural systems (Courtney Flint, Utah State University), plus representatives from the NSF, a sensor networks company, and others. The conference should be a great one – abstract submission deadline was just extended, so there’s still time to join me and nerd out about science!

Reno! From Flickr by Ravensmagiclantern

Reno! From Flickr by Ravensmagiclantern

Finding Disciplinary Data Repositories with DataBib and re3data

This post is by Natsuko Nicholls and John Kratz.  Natsuko is a CLIR/DLF Postdoctoral Fellow in Data Curation for the Sciences and Social Sciences at the University of Michigan.

The problem: finding a repository

Everyone tells researchers not to abandon their data on a departmental server, hard drive, USB stick , CD-ROM, stack of Zip disks, or quipu– put it in a repository! But, most researchers don’t know what repository might be appropriate for their data. If your organization has an Institutional Repository (IR), that’s one good home for the data. However, not everyone has access to an IR, and data in IRs can be difficult for others to discover, so it’s important to consider the other major (and not mutually exclusive!) option: deposit in a Disciplinary Repository (DR).

Many disciplinary repositories exist to handle data from a particular field or of a particular type (e.g. WormBase cares about nematode biology, while GenBank takes only DNA sequences). Some may be asking if the co-existence of IRs and DRs means competition or is mutually beneficial to both universities and research communities, some may be wondering how many repositories are out there for archiving digital assets, but most librarians and researchers just want to find an appropriate repository in a sea of choices.

For those involved in assisting researchers with data management, helping to find the right place to put data for sharing and preservation has become a crucial part of data services. This is certainly true at the University of Michigan—during a recent data management workshop for faculty, faculty members expressed their interest in receiving more guidance on disciplinary repositories from librarians.

The help: directories of data repositories

Fortunately, there is help to be found in the form of repository directories.  The Open Access Directory maintains a subdirectory of data repositories.  In the Life Sciences, BioSharing collects data policies, standards, and repositories.  Here, we’ll be looking at two large directories that list repositories from any discipline: DataBib and the REgistry of REsearch data REpositories (re3data.org).

DataBib originated in a partnership between Purdue and Penn State University, and it’s hosted by Purdue. The 600 repositories in DataBib are each placed in a single discipline-level category and tagged with more detailed descriptors of the contents.

re3data.org, which is sponsored by the German Research Foundation, started indexing relatively recently, in 2012, but it already lists 628 repositories.  Unlike DataBib, repositories aren’t assigned to a single category, but instead tagged with subjects, content types, and keywords.  Last November, re3data and BioSharing agreed to share records.  re3data is more completely described in this paper.

Given the similar number of repositories listed in DataBib and re3data, one might expect that their contents would be roughly similar and conclude that there are something around 600 operating DRs.  To test this possibility and get a better sense of the DR landscape, we examined the contents of both directories.

The question: how different are DataBib and re3data?

Repository overlap is only 19%Contrary to expectation, there is little overlap between the databases.  At least 1,037 disciplinary data repositories currently exist, and only 18% (191) are listed in both databases.  That’s a lot to look for one right place to put data, because except for a few exceptions, most IRs are not listed in re3data and Databib (you can find  a long list of academic open access repositories).  Of the repositories in both databases, a majority (72%) are categorized into STEM fields. Below is a breakdown of the overlap by discipline (as assigned by DataBib).

CrossoverRepositories

Another way of characterizing repository collections by re3data and Databib is by the repository’s host country. In re3data, the top three contributing countries (US 36%, Germany 15%, UK 12%) form the majority, whereas in Databib 58% of repositories are hosted by the US, followed by UK (12%) and Canada (7%). This finding may not be too surprising, since re3data is based in Germany and Databib is in the US.  If you are a researcher looking for the right disciplinary data repository, the host country may matter, depending on your (national-international/private-public) funding agencies and the scale of collaboration.

The full list of repositories is available here .

The conclusion: check both

Going forward, help with disciplinary repository selection will be increasingly be a part of data management workflows; the Data Management Planing Tool (DMPTool) plans to incorporate repository recommendations through DataBib, and DataCite may integrate with re3data. Further simplifying matters, DataBib and re3data plan to merge their services in some as-yet-undefined way.  But, for now, it’s safe to say that anyone looking for a disciplinary repository should check both DataBib and re3data.

Institutional Repositories: Part 2

A few weeks back I wrote a post describing institutional repositories (IRs for short). IRs have been around for a while, with the impetus of making scholarly publications open access. However more recently, IRs have been cited as potential repositories for datasets, code, and other scholarly outputs. Here I continue the discussion of IRs and compare their utility to DRs. Please note – although IRs are typically associated with open access publications, I discuss them here as potential repositories for data. 

Honest criticism of IRs

In my discussions with colleagues at conferences and meetings, I have found that some are skeptical about the role of IRs in data access preservation. I posit that this skepticism has a couple of origins:

  • IRs are often not intended for “self-service”, i.e., a researcher would need to connect with IR support staff (often via a face-to-face meeting), in order to deposit material into the IR.
  • Many IRs were created at minimum 5 years ago, with interfaces that sometimes appear to pre-date Facebook. Academic institutions often have no budget for a redesign of the user interface, which means those that visit an IR might be put off by the appearance and/or functionality.
  • IRs are run by libraries and IT departments, neither of which are known for self-promotion. Many (most?) researchers are likely unaware of an IR’s existence, and would not think to check in with the libraries regarding their data preservation needs.

These are all viable issues associated with many of the existing IRs. But there is one huge advantage to IRs over other data repositories: they are owned and operated by academic institutions that have a vested interest in preserving and providing access to scholarly work. 

The bright side

IRs aren’t all bad, or I wouldn’t be blogging about them. I believe that they are undergoing a rebirth of sorts: they are now seen as viable places for datasets and other scholarly outputs. Institutions like Purdue are putting IRs at the center of their initiatives around data management, access, and preservation. Here at the CDL, the UC3 group is pursuing the implementation of a data curation platform, DataShare, to allow self-service deposit of datasets into the Merritt Repository (see the UCSF DataShare site). Recent mandates from above requiring access to data resulting from federal grants means that funders (like IMLS) and organizations (like ARL) are taking an interest in improving the utility of IRs.

IRs versus discipline-specific repositories

In my last post, I mentioned that selecting a repository for your data doesn’t need to be either an IR or discipline-specific repository (DR). These repositories each have advantages and disadvantages, so using both makes sense.

DRs: ideal for data discovery and reuse

Often, DRs have collection policies for the specific types of data they are willing to accept. GenBank, for example, has standardized how your deposit your data, what types and formats of data they accept, and the metadata accompanying that data. This all means that searching for and using the data in GenBank is easy, and data users are able to easily download data for use. Another advantage of having a collection of similar, standardized data is the ability to build tools on top of these datasets, making reuse and meta-analyses easier.

The downside of DRs

The nature of a DR is that they are selective in the types of data that they accept. Consider this scenario, typical of many research projects: what if someone worked on a project that combined sequencing genes, collecting population demographics, and documenting location with GIS? Many DRs would not want to (or be able to) handle these disparate types of data. The result is that some of the data gets shared via a DR, while data less suitable for the DR would not be shared.

In my work with the DataONE Community Engagement and Education working group, I reviewed what datasets were shared from NSF grants awarded between 2005 and 2009 (see Panel 1 in Hampton et al. 2013). Many of the resulting publications relied on multiple types of data.  The percentage of those that shared all of the data produced was around 28%. However of the data that was shared, 81% was in GenBank or TreeBase – likely due to the culture of data sharing around genetic work. That means most of the non-genetic data is not available, and potentially lost, despite its importance for the project as a whole. Enter: institutional repositories.

IRs: the whole enchilada

Unlike many DRs, IRs have the potential to host entire collections of data around a project – regardless of the type of data, its format, etc. My postdoctoral work on modeling the effects of temperature and salinity on copepod populations involved field collection, laboratory copepod growth experiments (which included logs of environmental conditions), food growth (algal density estimates and growth rates, nutrient concentrations), population size counts, R scripts, and the development of the mathematical models themselves. An IR could take all of these disparate datasets as a package, which I could then refer to in the publications that resulted from the work. A big bonus is that this package could sit next to other packages I’ve generated over the course of my career, making it easier for me to point people to the entire corpus of research work. The biggest bonus of all: having all of the data the produced a publication, available at a single location, helps ensure reproducibility and transparency.

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

There are certainly some repositories that could handle the type of data package I just described. The Knowledge Network for Biocomplexity is one such relatively generic repository (although I might argue that KNB is more like an IR than a discipline repository). Another is figshare, although this is a repository ultimately owned by a publisher. But as researchers start hunting for places to put their datasets, I would hope that they look to academic institutions rather than commercial publishers. (Full disclosure – I have data stored in figshare!)

Good news! You can have your cake and eat it too. Putting data in both the relevant DRs and more generic IRs is a good solution to ensure discoverability (DRs) and provenance (IRs).

Data Publication Practices and Perceptions

Surveyors working

Credit: Captain Harry Garber, C&GS. From NOAA Photo Library

Today, we’re opening a survey of researcher perceptions and practices around data publication.

Why are you doing a survey?

The term “Data publication” applies language and ideas from traditional scholarly publishing to datasets, with the goal of situating data within the academic reward system and encouraging sharing and reuse. However, the best way to apply these ideas to data is not obvious. The library community has been productively discussing these issues for some time; we hope to contribute by asking researchers directly what they would expect and want from a data publication.

Who should take it?

We are interested in responses from anyone doing research in any branch of the Sciences or Social Sciences at any level (but especially PIs and postdocs).

What do you hope to learn?

  • What do researchers think it means to “publish” data? What do they expect from “peer review” of data?
  • As creators of data, how do they want to be credited? What do they think is adequate?
  • As users of published data, what would help them decide whether to work with a dataset?
  • In evaluating their colleagues, what dataset metrics are most useful? What would be most impressive to, for instance, tenure & promotions committees?

What will you do with the results?

The results will inform the CDL’s vision of data publication and influence our efforts. Additionally, the results will be made public for use by anyone.

What do you want from me?

If you are a researcher, please take 5-10 minutes to complete the survey and consider telling your colleagues about it.

If you are a librarian or other campus staff, please consider forwarding the link to any researchers, departments, or listservs that you feel are appropriate. The text of an email describing the survey can be found here.

The survey can be found at:

http://goo.gl/PuIVoC

Link to survey

If you have any questions or concerns, email me or comment on this post.

My picks for #AGU13

Next week, the city of San Francisco will be overrun with nerds. More specifically,more than 22,000 geophysicists, oceanographers, geologists, seismologists, meteorologists, and volcanologists will be descending upon the Bay Area to attend the 2013 American Geophysical Union Fall Meeting.

If you are among the thousands of attendees, you are probably (like me) overwhelmed by the plethora in sessions, speakers, posters, and mixers. In an effort to force myself to look at the schedule well in advance of the actual meeting, I’m sharing my picks for must-sees at the AGU meeting below.

Note! I’m co-chairing “Managing Ecological Data for Effective Use and Reuse” along with Amber Budden of DataONE and Karthik Ram of rOpenSci. Prepare for a great set of talks about DMPTool, rOpenSci, DataONE, and others.

Session Title

Abbr

Type

Day

Time

Translating Science into Action: Innovative Services for the Geo- and Environmental- Sciences in the Era of Big Data I GC11F Oral Mon 8:00 AM
Data Curation, Credibility, Preservation Implementation, and Data Rescue to Enable Multi-source Science I IN11D Oral Mon 8:00 AM
Data Curation, Credibility, Preservation Implementation, and Data Rescue to Enable Multi-source Science II IN12A Oral Mon 10:20 AM
Enabling Better Science Through Improving Science Software Development Culture I IN22A Oral Tue 10:20 AM
Collaborative Frameworks and Experiences in Earth and Space Science Posters IN23B Poster Tue 1:40 PM
Enabling Better Science Through Improving Science Software Development Culture II Posters IN23C Poster Tue 1:40 PM
Managing Ecological Data for Effective Use and Reuse I ED43E Oral Thu 1:40 PM
Open-Source Programming, Scripting, and Tools for the Hydrological Sciences II H51R Oral Fri 8:00 AM
Data Stewardship in Theory and in Practice I IN51D Oral Fri 8:00 AM
Managing Ecological Data for Effective Use and Reuse II Posters ED53B Poster Fri 1:40 PM

Download the full program as a PDF

Previous Data Pub blog post about AGU: Scientific Data at AGU 2011

A forthcoming experiment in data publication

What we’re doing:

Like these dapper gentlemen, as small or as large as needed... From the Public Domain Review.

Like these dapper gentlemen, as small or as large as needed…
From The Public Domain Review.

Some time next year, the CDL will start an experiment in data publication. Our version of data publication will look like lightweight, non-peer reviewed dataset descriptions. These publications are designed to be flexible in structure and size. At a minimum, each document must have six elements:

  • Title
  • Creator(s)
  • Publisher
  • Publication year
  • Identifier (e.g.DOI or ARK)
  • Citation to the dataset

This bare bones document can expand to be richly descriptive, with optional items like subject keywords, version number, spatial or temporal range, collection methods, and as much description as the author cares to suppy.

Why we’re doing it:

The general agreement expressed in the recently released draft FORCE11 Declaration of Data Citation Principles –that datasets should be treated like “first class” research objects in how they are discovered, cited, and recognized– is still far from reality. Datasets are largely invisible to search engines, and authors rarely cite them formally.

A solution being implemented by a number of journals (e.g. Nature Scientific Data and Geoscience Data Journal
) is to publish proxy objects for discovery and citation called “data descriptors” or, more commonly, “data papers”. Data papers are formal scholarly publications that describe a dataset’s rationale and collection methods, but don’t analyze the data or draw any conclusions. Peer reviewers ensure that the paper contains all the information needed to use, re-use, or replicate the dataset.

The strength of the data paper approach– creators must write up rich and useful metadata to pass peer review– leads directly to the weakness: a data paper often takes more time and energy to produce than dataset creators are willing to invest. In a 2011 survey, researchers said that the biggest impediment to publishing data is lack of time. For researchers who manage to publish datasets but lack time to write and submit (and revise and resubmit) a data paper, we will provide some of the benefits of a data paper at none of the cost.

How we’re doing it:

We will publish these documents through EZID (easy-eye-dee), an identifier service that has supplied DataCite DOIs to over 167,000 datasets. All of the dataset metadata records have at least the five elements required by the DataCite metadata schema, more than 2,000 already have abstracts, and another 2,000 have other kinds of descriptive metadata. EZID will begin using dataset metadata to automatically generate publications that can be viewed as HTML in a web browser or as a dynamically generated PDF. The documents will be hosted by EZID in a format optimized for indexing by search engines like Google and Google Scholar.

Dataset creators won’t have to do anything to get a publication that they don’t already have to do to get a DOI. If the creator only fills in the required metadata, the document will function as a cover-sheet or landing page. If they submit an abstract and methods, the document expands to begin to look like a traditional journal article (while retaining the linking functionality of a landing page). It will capture as much effort as the researcher puts forth, whether that’s a lot or very little.

Do you have thoughts or comments on our idea? We would love to hear from you! Comment on this blog post or email us at uc3@ucop.edu.