Institutional Repositories: Part 2

CDL UC3, February 20, 2014

A few weeks back I wrote a post describing institutional repositories (IRs for short). IRs have been around for a while, with the impetus of making scholarly publications open access. However more recently, IRs have been cited as potential repositories for datasets, code, and other scholarly outputs. Here I continue the discussion of IRs and compare their utility to DRs. Please note – although IRs are typically associated with open access publications, I discuss them here as potential repositories for data.

Honest criticism of IRs

In my discussions with colleagues at conferences and meetings, I have found that some are skeptical about the role of IRs in data access preservation. I posit that this skepticism has a couple of origins:

IRs are often not intended for “self-service”, i.e., a researcher would need to connect with IR support staff (often via a face-to-face meeting), in order to deposit material into the IR.
Many IRs were created at minimum 5 years ago, with interfaces that sometimes appear to pre-date Facebook. Academic institutions often have no budget for a redesign of the user interface, which means those that visit an IR might be put off by the appearance and/or functionality.
IRs are run by libraries and IT departments, neither of which are known for self-promotion. Many (most?) researchers are likely unaware of an IR’s existence, and would not think to check in with the libraries regarding their data preservation needs.

These are all viable issues associated with many of the existing IRs. But there is one huge advantage to IRs over other data repositories: they are owned and operated by academic institutions that have a vested interest in preserving and providing access to scholarly work.

The bright side

IRs aren’t all bad, or I wouldn’t be blogging about them. I believe that they are undergoing a rebirth of sorts: they are now seen as viable places for datasets and other scholarly outputs. Institutions like Purdue are putting IRs at the center of their initiatives around data management, access, and preservation. Here at the CDL, the UC3 group is pursuing the implementation of a data curation platform, DataShare, to allow self-service deposit of datasets into the Merritt Repository (see the UCSF DataShare site). Recent mandates from above requiring access to data resulting from federal grants means that funders (like IMLS) and organizations (like ARL) are taking an interest in improving the utility of IRs.

IRs versus discipline-specific repositories

In my last post, I mentioned that selecting a repository for your data doesn’t need to be either an IR or discipline-specific repository (DR). These repositories each have advantages and disadvantages, so using both makes sense.

DRs: ideal for data discovery and reuse

Often, DRs have collection policies for the specific types of data they are willing to accept. GenBank, for example, has standardized how your deposit your data, what types and formats of data they accept, and the metadata accompanying that data. This all means that searching for and using the data in GenBank is easy, and data users are able to easily download data for use. Another advantage of having a collection of similar, standardized data is the ability to build tools on top of these datasets, making reuse and meta-analyses easier.

The downside of DRs

The nature of a DR is that they are selective in the types of data that they accept. Consider this scenario, typical of many research projects: what if someone worked on a project that combined sequencing genes, collecting population demographics, and documenting location with GIS? Many DRs would not want to (or be able to) handle these disparate types of data. The result is that some of the data gets shared via a DR, while data less suitable for the DR would not be shared.

In my work with the DataONE Community Engagement and Education working group, I reviewed what datasets were shared from NSF grants awarded between 2005 and 2009 (see Panel 1 in Hampton et al. 2013). Many of the resulting publications relied on multiple types of data. The percentage of those that shared all of the data produced was around 28%. However of the data that was shared, 81% was in GenBank or TreeBase – likely due to the culture of data sharing around genetic work. That means most of the non-genetic data is not available, and potentially lost, despite its importance for the project as a whole. Enter: institutional repositories.

IRs: the whole enchilada

Unlike many DRs, IRs have the potential to host entire collections of data around a project – regardless of the type of data, its format, etc. My postdoctoral work on modeling the effects of temperature and salinity on copepod populations involved field collection, laboratory copepod growth experiments (which included logs of environmental conditions), food growth (algal density estimates and growth rates, nutrient concentrations), population size counts, R scripts, and the development of the mathematical models themselves. An IR could take all of these disparate datasets as a package, which I could then refer to in the publications that resulted from the work. A big bonus is that this package could sit next to other packages I’ve generated over the course of my career, making it easier for me to point people to the entire corpus of research work. The biggest bonus of all: having all of the data the produced a publication, available at a single location, helps ensure reproducibility and transparency.

Maybe you can have your cake (DRs) and eat it too (IRs). From Flickr by Mayaevening

There are certainly some repositories that could handle the type of data package I just described. The Knowledge Network for Biocomplexity is one such relatively generic repository (although I might argue that KNB is more like an IR than a discipline repository). Another is figshare, although this is a repository ultimately owned by a publisher. But as researchers start hunting for places to put their datasets, I would hope that they look to academic institutions rather than commercial publishers. (Full disclosure – I have data stored in figshare!)

Good news! You can have your cake and eat it too. Putting data in both the relevant DRs and more generic IRs is a good solution to ensure discoverability (DRs) and provenance (IRs).