(index page)

Neuroimaging as a case study in research data management: Part 2

Part 2: On practicing what we preach

A few weeks ago I described the results of a project investigating the data management practices of neuroimaging researchers. The main goal of this work is to help inform efforts to address rigor and reproducibility in both the brain imaging (neuroimaging) and academic library communities. But, as we were developing our materials, a second goal emerged- practice what we preach and actually apply the open science methods and tools we in the library community have been recommending to researchers

Wait, what? Open science methods and tools

Before jumping into my experience of using open science tools as part of a project that involves investigating open sciences practices, it’s probably worth taking a step back and defining what the term actually means. It turns out this isn’t exactly easy. Even researchers working in the same field understand and apply open science in different ways. To make things simpler for ourselves when developing our materials, we used “open science” broadly to refer to the application of methods and tools that make the processes and products of the research enterprise available for examination, evaluation, use, and re-purposing by others. This definition doesn’t address the (admittedly fuzzy) distinctions between related movements such open access, open data, open peer review, and open source, but we couldn’t exactly tackle all of that in a 75 question survey.

From programming languages used for data analysis like Python and R to collaboration platforms like the Github and the Open Science Framework (OSF) to writing tools like LaTex and Zotero to data sharing tools like Dash, figshare, and Zenodo, there are A LOT of different methods and tools that fall under the category of open science. Some of them worked for our project, some of them didn’t.

Data Analysis Tools

As both an undergraduate and graduate student, all of my research methods and statistics courses involved analyzing data with SPSS. Even putting aside the considerable (and recurrent) cost of an SPSS licence, I wanted to go a different direction in order to get some first-hand experience with the breadth of analysis tools that have been developed and popularized over the last few years.

I thought about trying my hand at a Jupyter notebook, which would have allowed us to share all of our data and analyses in one go. However, I also didn’t want to delay things as I taught myself how to work within a new analysis environment. From there, I tried a few “SPSS-like” applications like PSPP and Jamovi and would recommend both to anyone who has a background like mine and isn’t quite ready to start writing code. I ultimately settled on JASP because, after taking a cursory look through our data using Excel (I know, I know), it was actually being used by the participants in our sample. It turns out that’s probably because it’s really intuitive and easy to use. Now that I’m not in the middle of analyzing data, I’m going to spend some time learning other tools. But, while I do that, I’m going to to keep using and recommending JASP.

From the very beginning, we planned on making our data open. Though I wasn’t necessarily thinking about it at the time, this turned out to be another good reason to try something other than SPSS. Though there are workarounds, .sav is not exactly an open file format. But our plan to make the data open not only affected the choice of analysis tools, it also affected how I felt while running the various statistical tests. One one hand, knowing that other researchers would be able to dive deep into our data amplified my normal anxiety about checking and re-checking (and statcheck-ing) the analyses. On the other hand, it also greatly reduced my anxiety about inadvertently relegating an interesting finding to the proverbial file-drawer.

Collaboration Tools

When we first started, it seemed sensible to create a repository on the Open Science Framework in order to keep our various files and tools organized. However, since our collaboration is between just two people and there really aren’t that many files and tools involved, it became easier to just use services that were already incorporated in our day-to-day work- namely e-mail, Skype, Google Drive, and Box. Though I see how it could be potentially useful for a project with more moving parts, for our purposes it mostly just added an unnecessary extra step.

Writing Tools

This is where I restrain myself from complaining too much about LaTeX. Personally, I find it a less than awesome platform for doing any kind of collaborative writing. Since we weren’t writing chunks of code, I also couldn’t find an excuse to write the paper in R Markdown. Almost all of the collaborative writing I’ve done since graduate school has been in Google docs and this project was no exception. It’s not exactly the best when it comes to formatting text or integrating with tables and figures, I haven’t found a better tool for working on a text with other people.

We used a Mendeley folder to share papers and keep our citations organized. Zotero has the same functionality, but I personally find Mendeley slightly easier to use. In retrospect, we could also have used something like the F1000 Workspace that has a more direct integration with Google docs.

This project is actually the first time I’ve published on a preprint. Like making our data open, this was the plan all along. The formatting was done in Overleaf, mostly because it was a (relatively) user friendly way to use LaTeX and I was worried our tables and figures would break the various MS Word bioRxiv templates that are floating around. Similar making our data open, planning to publish a preprint had a impact on the writing process. I’ve since notices a typo or two, but knowing that people would be reading our preprint only days after its submission made me especially anxious to check the spelling, grammar, and the general flow of our paper. On the other hand, it was a relief to know that the community would be able to read the results of a project that started at the very beginning of my postdoc before it’s conclusion.

Data Sharing Tools

Our survey and data are both available via figshare. More specifically, we submitted our materials to Kilthub, Carnegie Mellon’s instance of figshare for institutions. For those of you out there currently raising an eyebrow, we didn’t submit to Dash, UC3’s data publication platform, because of an agreement outlined when we were going through the IRB process. Overall, the submission was relatively straightforward, through the curation process definitely made me consider how difficult it is to balance adding proper metadata and documentation to a project with the desire (or need) to just get material out there quickly.

A few more thoughts on working openly

More than once over the course of this project I joked to myself, my collaborator, or really to anyone that would listen that “This would probably be easier or quicker if we could just do it the old way.”. However, now that we’re at a point where we’ve submitted our paper (to an open access journal, of course), it’s been useful to look back on what it has been like to use these different open science methods and tools. My main takeaways are that there are a lot of ways to work openly and that what works for one researcher may not necessarily work for another. Most of the work I’ve done as a postdoc has been about meeting researchers where they are and this process has reinforced my desire to do so when talking about open science, even when the researcher in question is myself.

Like our study participants, who largely reported that their data management practices are motived and limited by immediate practical concerns, a lot of our decisions about open which open science methods and tools to apply were heavily influenced by the need to keep our project moving forward. As much as I may have wanted to, I couldn’t pause everything to completely change how I analyze data or write papers. We committed ourselves to working openly, but we also wanted to make sure we had something to show for ourselves.

Additional Reading

Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers. bioRxiv.

We Are Talking Loudly and No One Is Listening

By Daniella Lowenberg

“Listening is not merely not talking, though even that is beyond most of our powers; it means taking a vigorous, human interest in what is being told to us” — Alice Duer Miller

A couple of months ago I wrote about how we need to be advocating for data sharing and data management with more focus on adoption and eliminate discussions about technical backends. I thought this was the key to advocating for getting researchers to change their practices and make data available as part of their normal routines. But, there’s more than just not arguing over platforms that we need to change — we need to listen.

We are talking loudly and saying nothing.

I routinely visit campuses to lead workshops on data publishing (as train the trainers style for librarians and for researchers). Regardless of the material presented, there are always two different conversations happening in the room. At each session, librarians pose technical questions about backend technologies and integrations with scholarly publishing tools (i.e. ORCiD). These are great questions for a scholarly publishing conference but confusing for researchers. This is how workshops start:

Daniella “Who knows what Open Access is?”

<50% researchers in room raised hands

Daniella “Has anyone here been asked to share their data or understand what this means?”

<20% researchers in room raised hands

Daniella “Does anyone here know what an ORCiD is or have one?”

1 person total raised their hand

We are talking too loudly and no one is listening.

We have characterized ‘Open Data’ as successful because we have incentives, and authors write data statements, but this misconception has allowed the library community to focus on scholarly communications infrastructure instead of continuing to work on the issue at hand: sharing research data is not well understood, incentivized, or accessible. We need to focus our efforts on listening to the research community about what their processes are and how data sharing could be a part of these, and then we need to take this as guidance in advocating for our library resources to be a part of lab norms.

We need to be focusing our efforts on education around HOW to organize, manage, and publish data.

Change will come when organizing data to be shared throughout the research process is a norm. Our goal should be to grow adoption of sharing and managing data and as a result see an increase in researchers knowing how to organize and publish data. Less talk about why data should be available, and more hands-on getting research data into repositories, in accessible and researcher-desirable ways.

We need to only build tools that researchers WANT.

The library community has lots of ideas about what is a priority right now in the data world such as curation, data collections, and badges, but we are getting ahead of ourselves. While these initiatives may be shinier and more exciting, it feels like we are polishing marathon trophies before runners can finish a 1 mile jog. And we’re not doing a good job understanding their perspectives on running in the first place.

Before we can convince researchers that they should care about library curation and ‘FAIR’ data, we need to get researchers to even think about managing data and data publishing as a normal activity in research activities. This begins with organization at the lab level and figuring out ways to integrate data publishing systems into lab practice without disrupting normal activity. When researchers are concerned about finishing their experiments, publishing, and their career, it is not an effective or helpful solution to just name platforms they should be using. It is effective to find ways to relieve publishing pain points, and make the process easier. Tools and services for researchers should be understood as ways to make their research and publishing processes easier.

“When you listen, it’s amazing what you can learn. When you act on what you’ve learned, it’s amazing what you can change.” — Audrey McLaughlin

Librarians: this is a space where you can make an impact. Be the translators. Listen to what the researchers want, understand the research day-to-day, and translate to the infrastructure and policy makers what would be effective tools and incentives. If we focus less time and resources on building tools, services, and guides that will never be utilized or appreciated, we can be effective in our jobs by re-focusing on the needs of the research community as requested by the research community. Let’s act like scientists and build evidence-based conclusions and tools. The first step is to engage with the research community in a way that allows us to gather the evidence. And if we do that, maybe we could start translating to an audience that wants to learn the scholarly communication tools and language and we could each achieve our goals of making research available, usable, and stable.

Neuroimaging as a case study in research data management: Part 1

Part 1: What we did and what we found

This post was originally posted on Medium.

How do brain imaging researchers manage and share their data? This question, posed rather flippantly on Twitter a year and a half ago, prompted a collaborative research project. To celebrate the recent publication of a bioRxiv preprint, here is an overview of what we did, what we found, and what we’re looking to do next.

What we did and why

Magnetic resonance imaging (MRI) is a widely-used and powerful tool for studying the structure and function of the brain. Because of the complexity of the underlying signal, the iterative and flexible nature of analytical pipelines, and the cost (measured in terms of both grant funding and person hours) of collecting, saving, organizing, and analyzing such large and diverse datasets, effective research data management (RDM) is essential in research projects involving MRI. However, while the field of neuroimaging has recently grappled with a number of issues related to the rigor and reproducibility of its methods, information about how researchers manage their data within the laboratory remains mostly anecdotal.

Within and beyond the field of neuroimaging, efforts to address rigor and reproducibility often focus on problems such as publication bias and sub-optimal methodological practices and solutions such as the open sharing of research data. While it doesn’t make for particularly splashy headlines (unlike, say, this), RDM is also an important component of establishing rigor and reproducibility. If experimental results to be verified and repurposed, the underlying data must be properly saved and organized. Said another way, even openly shared data isn’t particularly useful if you can’t make sense of it. Therefore, in an effort to inform the ongoing conversation about reproducibility in neuroimaging, myself and Ana Van Gulick set out to survey the RDM practices and perceptions of the active MRI research community.

https://twitter.com/JohnBorghi/status/758030771097636869

With input from several active neuroimaging researchers, we designed and distributed a survey that described RDM-related topics using language and terminology familiar to researchers who use MRI. Questions inquired about the type(s) of data collected, the use analytical tools, procedures for transferring and saving data, and the degree to which RDM practices and procedures were standardized within laboratories or research groups. Building on my work to develop an RDM guide for researchers, we also asked participants to rate the maturity of both their own RDM practices and those of the field as a whole. Throughout the survey, we were careful to note that our intention was not to judge researchers with different styles of data management and that RDM maturity is largely orthogonal to the sophistication of data collection and analysis techniques.

Wait, what? A brief introduction to MRI and RDM.

Magnetic resonance imaging (MRI) is a medical imaging technique that uses magnetic fields and radio waves to create detailed images of organs and tissues. Widely used in medical settings, MRI has also become important tool for neuroscience researchers especially since the development of functional MRI (fMRI) in the early 1990’s. By detecting changes in blood flow that are associated with changes in brain activity, fMRI allows researchers to non-invasively study the structure and function of the living brain.

Because there are so many perspectives involved, it is difficult to give a single comprehensive definition of research data management (RDM). But, basically, the term covers activities related to how data is handled over the course of a research project. These activities include, but are certainly not limited to, those related to how data is organized and saved, how procedures and decisions are documented, and how research outputs are stored are shared. Many academic libraries have begun to offer services related to RDM.

Neuroimaging research involving MRI presented something of an ideal case study for us to study RDM among active researchers. The last few years have seen a rapid proliferation of standards, tools, and best practice recommendations related to the management and sharing of MRI data. Neuroimaging research also crosses many topics relevant to RDM support providers such as data sharing and publication, the handling of sensitive data, and the use and curation of research software. Finally, as neuroimaging researchers who now work in academic libraries, we are uniquely positioned to work across the two communities.

What we found

After developing our survey and receiving the appropriate IRB approvals, we solicited responses to our survey during Summer 2017. A total of 144 neuroimaging researchers participated and their responses revealed several trends that we hope will be informative for both neuroimaging researchers and also data support providers in a academic libraries.

As shown below, our participants indicated that their RDM practices throughout the course of a research project were largely motivated by immediate practical concerns such as preventing the loss of data and the ensuring access to everyone within a lab or research group and limited by a lack of time and discipline-specific best practices.

What motivates and limits RDM practices in neuroimaging? When we asked active researchers, it turned out the answer was immediate and practical concerns. All values listed are percentages, participants could give multiple responses.

We were relatively unsurprised to see that neuroimaging researchers use a wide array of software tools analyze their often heterogeneous sets of data. What did surprise us somewhat was the different responses from trainees (graduate students and postdocs) and faculty on questions related to the consistency of RDM practices within their labs. Trainees were significantly less likely to say that practices related to backing up, organizing, and documenting datas were standardized within their lab than faculty, which we think highlights the need for better communication about how RDM is an essential component of ensuring that research is rigorous and reproducible.

Analysis of RDM maturity ratings revealed that our sample generally rated their own RDM practices as more mature than the field as a whole and practices during the data collection and analysis phases of a project as significantly more mature than those during the data sharing phase. There are several interpretations of the former result, but the later is consistent with the low level of data sharing in the field. Though these ratings provide an interesting insight into the perceptions of the active research community, we believe there is substantial room for improvement in establishing proper RDM across every phase of a project, not just after after the data has already been analyzed.

Study participants rated their own RDM practices during the data collection and analysis phases of a project as significantly more mature than than those of the field as a whole. Ratings for the data sharing phase were significantly lower than ratings for the data collection and analysis phases.

For a complete overview of our results, including an analysis of how the field of neuroimaging is at a major point of transition when it comes to the adoption of practices including open access publishing, preregistration, replication, check out our preprint now on bioRxiv. While you’re at it, feel free to peruse, reuse, or remix our survey and data, both of which are available on figshare.

Is this unique to MRI research?

Definitely not. Just as the consequences of sub-optimal methodological practices and publication biases have been discussed throughout the biomedical and behavioral sciences for decades, we suspect that the RDM-related practices and perceptions observed in our survey are not limited to neuroimaging research involving MRI.

To paraphrase and reiterate a point made in the preprint, this work was intended to be descriptive not prescriptive. We also very consciously have not provided best practice recommendations because we believe that such recommendations would be most valuable (and actionable) if developed in collaboration with active researchers. Moving forward, we hope to continue to engage with the neuroimaging community on issues related to RDM and also expand the scope of our survey to other research communities such as psychology and biomedical science.

Additional Reading

Our preprint, one more time:

Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers. bioRxiv.

For a primer on functional magnetic resonance imaging:

Soares, J. M., Magalhães, R., Moreira, P. S., Sousa, A., Ganz, E., Sampaio, A., … Sousa, N. (2016). A hitchhiker’s guide to functional magnetic resonance imaging. Frontiers in Neuroscience, 10, 1–35.

For more on rigor, reproducibility, and neuroimaging:

Nichols, T. E., Das, S., Eickhoff, S. B., Evans, A. C., Glatard, T., Hanke M., … Yeo, B. T. T. (2017). Best practices in data analysis and sharing in neuroimaging using MRI. Nature Neuroscience, 20(3), 299–303. (Preprint)
Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., … Yarkoni, T. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115–126. (Preprint)

Support your Data

Building an RDM Maturity Model: Part 4

By John Borghi

Researchers are faced with rapidly evolving expectations about how they should manage and share their data, code, and other research products. These expectations come from a variety of sources, including funding agencies and academic publishers. As part of our effort to help researchers meet these expectations, the UC3 team spent much of last year investigating current practices. We studied how neuroimaging researchers handle their data, examined how researchers use, share, and value software, and conducted interviews and focus groups with researchers across the UC system. All of this has reaffirmed our that perception that researchers and other data stakeholders often think and talk about data in very different ways.

Such differences are central to another project, which we’ve referred to alternately as an RDM maturity model and an RDM guide for researchers. Since its inception, the goal of this project has been to give researchers tools to self assess their data-related practices and access the skills and experience of data service providers within their institutional libraries. Drawing upon tools with convergent aims, including maturity-based frameworks and visualizations like the research data lifecycle, we’ve worked to ensure that our tools are user friendly, free of jargon, and adaptable enough to meet the needs of a range of stakeholders, including different research, service provider, and institutional communities. To this end, we’ve renamed this project yet again to “Support your Data”.

Image showing some of the support structure for the Golden Gate Bridge. This image also nicely encapsulates how many of the practices described in our tools are essential to the research process but are often invisible from view.

What’s in a name?

Because our tools are intended to be accessible to a people with a broad range of perceptions, practices, and priorities, coming up with a name that encompasses complex concepts like “openness” and “reproducibility” proved to be quite difficult. We also wanted to capture the spirit of terms like “capability maturity” and “research data management (RDM)” without referencing them directly. After spending a lot of time trying to come up with something clever, we decided that the name of our tools should describe their function. Since the goal is to support researchers as they manage and share data (in ways potentially influenced by expectations related to openness and reproducibility), why not just use that?

Recent Developments

In addition to thinking through the name, we’ve also refined the content of our tools. The central element, a rubric that allows researchers to quickly benchmark their data-related practices, is shown below. As before, it highlights how the management of research data is an active and iterative process that occurs throughout the different phases of a project. Activities in different phases represented in different rows. Proceeding left to right, a series of declarative statements describe specific activities within each phase in order of how well they are designed to foster access to and use of data in the future.

The “Support your Data” rubric. Each row is complemented by a one page guide intended to help researchers advance their data-related practices.

The four levels “ad hoc”, “one-time”, “active and informative” and “optimized for re-use”, are intended to be descriptive rather than prescriptive.

Ad hoc — Refers to circumstances in which practices are neither standardized or documented. Every time a researcher has to manage their data they have to design new practices and procedures from scratch.
One time — Refers to circumstances in which data management occurs only when it is necessary, such as in direct response to a mandate from a funder or publisher. Practices or procedures implemented at one phase of a project are not designed with later phases in mind.
Active and informative — Refers to circumstances in which data management is a regular part of the research process. Practices and procedures are standardized, well documented, and well integrated with those implemented at other phases.
Optimized for re-use — Refers to circumstances in which data management activities are designed to facilitate the re-use of data in the future

Each row of the rubric is tied to a one page guide that provides specific information about how to advance practices as desired or required. Development of the content of the guides has proceeded sequentially. During the autumn and winter of 2017, members of the UC3 team met to discuss issues relevant to each phase, reduce the use of jargon, and identify how content could be localized to meet the needs of different research and institutional communities. We are currently working on revising the content based suggestions made during these meetings.

Next Steps

Now that we have scoped out the content, we’ve begun to focus on the design aspect of our tools. Working with CDL’s UX team, we’ve begun to think through the presentation of both the rubric and the guides in physical media and online.

As always, we welcome any and all feedback about content and application of our tools.

Dash: 2017 in Review

The goal for Dash in 2017 was to build out features that would make Dash a desirable place to publish data. While we continue to work with the research community to find incentives to publish data generally, the small team of us working on Dash wanted to take a moment to thank everyone who published data this year.

In 2017 we worked in two week sprint intervals to release 26 features and instances (not including fixes).

In 2018 we have one major focus: integrate into researcher workflows to make publishing data a more common practice.

To do so we will be working with the community to:

Release a read only API for download of datasets
Release a Submission REST API for publishing and versioning datasets
Implement ‘Make Data Count’- standardized data usage metrics for researchers to have standard views, downloads, and citations of published data
Integrate with publishers (i.e. submit data to Dash while submitting an article to UC Press)
Integrate with online lab notebooks (i.e. right click and submit data after analysis with accompanied metadata from Jupyter notebooks)
Talk to as many labs and researchers as possible to educate on data publishing and better understand incentives and needs

Follow along with our Github and Twitter and please get in touch with us if you have ideas or experiences to share for making data publishing a more common practice in the research environment.

Test-driving the Dash read-only API

The Dash Data Publication service is now allowing access to dataset metadata and public files through a read-only API. This API focuses on allowing metadata access through a RESTful API. Documentation is available at https://dash.ucop.edu/api/docs/index.html.

There are a number of ways to test out and access this API such as through programming language libraries or with the Linux curl command. This short tutorial gives examples of accessing the API using Postman software which is an easy GUI way to test out and browse an API and is available for the major desktop operating systems. If you’d like to follow along please download Postman from https://www.getpostman.com/ .

We are looking to receive feedback on the first of our Dash APIs, before we embark on building our submission API. Please get in touch with us with feedback or if you would be interested in setting up an API integration with the Dash service.

Create a Dash Collection in Postman

After you’ve installed Postman we want to open Postman and create a Dash collection to hold the queries against the API.

Open Postman.
Click New > Collection.

3. Enter the collection name and click Create.

Set Up Your First Request

1. Click the folder icon for the collection you just set up.

2. Click Add requests under this collection.

3. Fill in a name for your request, select to put it in the Dash collection you created earlier and click Save to Dash to create.

4. Click on the request you just created in the left bar and then click the headers tab.

5. Enter the following key and value in the header list. Key: Content-Type and Value: application/json. This header ensures that you’ll receive JSON data.

6. Enter the request URL in the box toward the top of the page. Leave the request type on “GET.” Enter https://dash.ucop.edu/api for the URL and click Save.

Try Your Request

1. Test out your request by clicking the Send button.
2. If everything is set up correctly you’ll see results like these.

Information about the API is being returned in JavaScript Object Notation (JSON) and includes a few features to become familiar with.
– A links section in the JSON exposes Hypertext Application Language (HAL) links that can guide you to other parts of the API, much like links in a web page allow you to browse other parts of a site.
– The self link refers to the current request.
– Other links can allow you to get further information to create other requests in the API.
– The curies section leads to some basic documentation that may be used by some software.

Following Links and Viewing Dataset Information

Postman has a nice feature that allows you to follow links in an API to create additional requests.

Try it out by clicking the url path associated with stash:datasets which shows as /api/datasets.

2. You’ll see a new tab open for your new request toward the top of the screen and then you can submit or save the new request.

3. If you send this request you will see a lot of information about datasets in Dash.

Some things to point out about this request:
– The top-level links section contains paging links because this request returns a list of datasets. Not all datasets are returned at once, but if you needed to see more you could go to the next page.

– The list contains a count of items in the current page and a total for all items.
– When you look at the embedded datasets you’ll see additional links for each individual dataset, which you could also follow.

– You can view metadata for the most recent, successfully submitted version of each dataset that shows as dataset information here.

Hopefully this gives a general idea of how the API can be used and now you can create additional requests to browse the Dash API.

Hints for Navigating the Dash Data Model

– As you browse through the different links in the Dash API, it keep the following in mind.
– A dataset may have multiple versions. If it has only been edited or submitted once in the UI it will still have one version.
– The metadata shown at the dataset level is based on the latest published version and the dataset indicates the version number it is using to derive this metadata.
– Each version has descriptive metadata and files associated with it.
– To download files, look for the stash:download links. There are downloads for a dataset, a version and for an individual file. These links are standard HTTP downloads that could be downloaded using a web browser or other HTTP client.
– If you know the DOI of the dataset you wish to view, use a GET request for /api/datasets/<doi>.
– The DOI would be in a format such as doi:10.5072/FKK2K64GZ22 and needs to be URL encoded when included in the URL.
– See for example https://www.w3schools.com/tags/ref_urlencode.asp or https://www.urlencoder.org/ or use the URL encoding methods available in most programming languages.
– For datasets that are currently private for peer review, downloads will not become available until the privacy period has passed.

Where’s the adoption? Shifting the Focus of Data Publishing in 2018

By Daniella Lowenberg

At RDA10 in Montreal I gave a presentation on Dash in the Repository Platforms for Research Data IG session. The session was focused on backend technology and technology communities for repository platforms. I talked a bit about the Dash open source software and features but walked away thinking “How productive is it to discuss software systems to support research data at length? Is adoption based on technology?”

The answers are: not productive, and no.

Following RDA10, I spent months talking with as many researchers and institutions as possible to figure out how much researchers know about data publishing and what would incentivize them to make it a common practice.

Researchers are the end users of research data publishing platforms and yet they are providing the least amount of input into these systems.

And if you think that is confusing, there is an additional layer of disorder: “researchers” is used as an umbrella term for various levels of scientists and humanists who can have drastically different opinions and values based on discipline and status.

I visited labs and took PIs, grad students, and postdocs to coffee at UCSF, UC Berkeley, and UC Santa Cruz. Coming from a science background and spending time convincing authors to make their data available at PLOS, I thought I had a pretty good sense of incentives, but I needed to span disciplines and leave the mindset of “you have to make your data available, or your paper will not be published” to hear researchers’ honest answers. Here’s what I found:

People like the idea of data publishing in theory, but in practice, motivation is lacking and excuses are prominent.

This is not surprising though. The following is an example scenario (with real quotes) of how data publishing is perceived at various statuses (for some control this scenario takes place within biomedical research)

Grad Student: “Data publishing sounds awesome, I would totally put my data out there when publishing my work but it’s really up to my PI and my PI doesn’t think it is necessary.”

Post Doc: “I like the idea but if we put data in here are people are going to use my data before I can publish 3 Nature papers as first author?”

PI: “I like the idea of having my students put their work in an archive so I can have all research outputs from the lab in one place, but until my Vice Chancellor of Research (VCR) tells me it is a priority I probably won’t use it.”

VCR: “Funder and Publisher mandates aren’t incentivizing enough?”

Publisher: “We really believe the funder mandates are the stick here.”

As you can tell there is not a consensus of understanding and there is a difference between theoretical and practical implementation of data publishing. As one postdoc said at UCSF “If I am putting on my academic hat, of course my motivation is the goodness of it. But, practically speaking I’m not motivated to do anything”. With differing perspectives for each stakeholder there are infinite ways to see how difficult it is to gauge interest in data publishing!

Other reasons adoption of data publishing practices is difficult:

At conferences and within the scholarly communication world, we speak in jargon about sticks (mandates) and carrots (reproducibility, transparency). We are talking to each other: people who have already bought into these incentives and needs and are living in an echo chamber. We forget that these mandates and reasons for open data are not well understood and effective by researchers themselves. Mandates and justifications about being “for the good of science” are not consistently understood across the lab. PIs are applying for grants and writing up Data Management Plans (DMPs), but the grad students and postdocs are doing the data analysis and submitting the paper. There is plenty of space here for miscommunication, misinformation, and difficulty. We also say that reproducibility, transparency, and getting credit for your work are wide ranging carrots, but reproducibility/transparency initiatives vary per field. Getting credit for publishing data is seemingly easy (like articles)- authorship on a dataset and citations of the DOI credit the researchers who first published the data. But, how can we say that right now researchers are “getting credit” for their data publications if citing data isn’t common practice, few publishers support data citations, and tenure committees aren’t looking at the reach of data?

We spend time talking to one another about how open data is a success because publishers have released X many data statements and repositories have X many datasets. Editors and reviewers typically do not check for (or want to check for) data associated with publications to ensure they are underlying or FAIR data, and many high volume repositories take any sort of work (conference talks, pdfs, posters). How many articles have the associated data publicly available and in a usable format? How many depositions to repositories are usable research data? We must take these metrics with a grain of salt and understand that while we are making progress, there are various avenues we must be investing in to make the open data movement a success.

All aspects of this are related to researcher education and lowering the activation energy (i.e. making it a common and accepted practice).

A provocative conversation to bridge people together:

In my presentation at CNI I scrolled through a number of quotes from researchers that I gathered during these coffee talks, and the audience laughed at many of them. The quotes are funny (or sad or realistic or [insert every range of emotion]), but even this reaction is reason for us to re-think our ways of driving adoption of research data management and open data practices. To be talking about technologies and features that aren’t requested by researchers is getting ahead of ourselves.

Right now there should be one focus: finding incentives and ways to integrate into workflows that effectively get researchers to open up and preserve their data.

When presenting this I was apprehensive but confident: I was presenting opinions and experiences but hearing someone say ‘we’re doing it wrong’ usually does not come with applause. What came of the presentation was a 30-minute discussion full of genuine experiences, honest opinions, and advice. Some discussion points that came up:

Yale University: “Find the pain” — talking to researchers about not what their dream features are but what would really help them with their data needs
Elsevier, Institutions: A debate and interest in what is a Supporting Information (SI) file and if SI files are a gateway drug that we support. Note: I and a few others agreed that no, publishing a table already in the article should not be rewarded. That would be positive reinforcement that common practices are good enough
Duke University: Promoting open and preserved data as a way for PIs to reduce panic when students join and leave the lab and have an archived set of work from past grad students (while they still receive authorship of the dataset)
Claremont McKenna Colleges: Are incentives and workflows different per institution and institution level or should the focus be on domains/disciplines? Note: Typically researchers do not limit their focus to the institution level but rather are looking at their field so this may be the better place to align (rather than institutional policies and incentives).

The general consensus was that we have to re-focus on researcher needs and integrate into researcher workflows. To do this successfully:

We need to check our language.
We need to ensure that our primary drive in this community is to build services and tools that make open data and data management common practices in the research workflows.
We need to share our experiences and work with all research stakeholders to understand the landscape and needs (and not refer to an unrealistic lifecycle).

So, let’s work together. Let’s talk to as many researchers in as many domains and position levels in 2018. Let’s share these experiences out when we meet at conferences and on social media. And let’s focus on adoption of a practice (data publishing) instead of spotlighting technologies, to make open data a common, feasible, and incentivized success.

Dash Updates: Fall, 2017

Throughout the summer the Dash team has focused on features that better integrate with researcher workflows. The goal: make data publishing as easy as possible.

With that, here are the releases now up on the Dash site. Please feel free to use our demo site dashdemo.ucop.edu to test features and practice submitting data.

Dash enabled co-author ORCiDs– all listed co-authors now have the ability to link their ORCiD iD with their data publication.
Dash notifies “administrators” (set for each instance- campus data librarians & publishing staff) when data are deposited so researchers can get assistance enhancing their metadata (to make data more reproducible, transparent, and discoverable).
Dash has rich text editing. The abstract, methods, and usage notes fields now have HTML text editors that allow for stylistic text editing to properly format information about the data publication.
Dash allows for individual file download. All versions of the datasets may now be downloaded at the file-level and not just the entire dataset.
Dash welcomes UC Davis. Researchers at UC Davis may now publish and share their research data at dash.ucdavis.edu.
Dash welcomes UC Press journal Elementa. Authors submitting to the Elementa may now utilize UC Press Dash for all data supporting journal publications.

So, what is Dash working on now?

In order to integrate with various aspects of the research workflows, Dash needs an open Rest API. The first API being built is a new deposit API. The team is talking with the repository community and gathering use cases for mapping out how Dash can integrate with journals & online lab notebooks for alternate ways of submitting data that are more in line with researcher workflows.

OA Week 2017: Maximizing the value of research

By John Borghi and Daniella Lowenberg

Happy Friday! This week we’ve defined open data, discussed some notable anecdotes, outlined publisher and funder requirements, and described how open data helps ensure reproducibility. To cap off open access week, let’s talk about one of the principal benefits of open data- it helps to maximize the value of research.

Research is expensive. There are different ways to break it down but, in the United States alone, billions of dollars are spent funding research and development every year. Much of this funding is distributed by federal agencies like the National Institutes of Health (NIH) and the National Science Foundation (NSF), meaning that taxpayer dollars are directly invested in the research process. The budgets of these agencies are under pressure from a variety of sources, meaning that there is increasing pressure on researchers to do more with less. Even if budgets weren’t stagnating, researchers would be obligated to ensure that taxpayer dollars aren’t wasted.

The economic return on investment for federally funded basic research may not be evident for decades and overemphasizing certain outcomes can lead to the issues discussed in yesterday’s post. But making data open doesn’t just refer to giving access other researchers, it also means giving taxpayers access to the research they paid for. Open data also enables reuse and recombination, meaning that a single financial investment can actually fund any number of projects and discoveries.

Research is time consuming. In addition to funding dollars, the cost of research can be measured in the hours it takes to collect, organize, analyse, document, and share data. “The time it takes” is one of the primary reasons cited when researchers are asked why they do not make their data open. However, while certainly takes time to ensure open data is organized and documented in such a way as to enable its use by others, making data open can actually save researchers time over the long run. For example, one consequence of the file drawer problem discussed yesterday is that researchers may inadvertently redo work already completed, but not published, by others. Making data open helps prevents this kind of duplication, which saves time and grant funding. However, the beneficiaries of open data aren’t just for other researchers- the organization and documentation involved in making data open can help researchers from having to redo their own work as well.

Research is expensive and time consuming for more than just researchers. One of the key principles for research involving human participants is beneficence– maximizing possible benefits while minimizing possible risks. Providing access to data by responsibly making it open increases the chances that researchers will be able to use it to make discoveries that result in significant benefits. Said another way, open data ensures that the time and effort graciously contributed by human research participants helps advance knowledge in as many ways as possible.

Making data open is not always easy. Organization and documentation take time. De-identifying sensitive data so that it can be made open responsibly can be less than straightforward. Understanding why doesn’t automatically translate into knowing how. But we hope this week we’ve given you some insight into the advantages of open data, both for individual researchers and for everyone that engages, publishes, pays for, and participates in the research process.

OA Week 2017: Transparency and Reproducibility

By John Borghi and Daniella Lowenberg

Yesterday we talked about about why researchers may have to make their data open, today let’s start talking about why they may want to.

Though some communities have been historically hesitant to do so, researchers appear to be increasingly willing to share their data. Open data even seems to be associated with a citation advantage, meaning that as datasets are accessed and reused, the researchers involved in the original work continue to receive credit. But open data is about more than just complying with mandates and increasing citation counts, it’s also about researchers showing their work.

From discussions about publication decisions to declarations that “most published research findings are false”, concerns about the integrity of the research process go back decades. Nowadays, it is not uncommon to see the term “reproducibility” applied to any effort aimed at addressing the misalignment between good research practices, namely those emphasizing transparency and methodological rigor, and academic reward systems, which generally emphasize the push to publish only the most positive and novel results. Addressing reproducibility means addressing a range of issues related to how research is conducted, published, and ultimately evaluated. But, while the path to reproducibility is a long one, open data represents a crucial step forward.

“While the path to reproducibility is a long one, open data represents a crucial step forward.”

One of the most popular targets of reproducibility-related efforts is p-hacking, a term that refers to the practice of applying different methodological and statistical techniques until non-significant results become significant. The practice of p-hacking is not always intentional, but appears to be quite common. Even putting aside some truly astonishing headlines, p-hacking has been cited as a major contributor to the reproducibility crisis in fields such as psychology and medicine.

One application of open data is sharing the datasets, documentation, and other materials needed to reproduce the results described in a journal article, thus allowing other researchers (including peer reviewers) can check for errors and ensure that the conclusions discussed in the paper are supported by the underlying data and methods. This type of validation doesn’t necessarily prevent p-hacking, but it does increase the degree to which researchers are accountable for explaining marginally significant results.

But the impact of open data on reproducibility goes far beyond just combatting p-hacking. Publication biases such as the file drawer problem, which refers to the tendency of researchers to publish papers describing studies that resulted in positive results while regulating studies that resulted in negative or nonconfirmatory results to the proverbial file drawer. Along with problems related to small sample sizes, this tendency majorly skews the effects described in the scientific literature. Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.

“Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.”

Open data is about researchers showing their work, being transparent about their how they make their conclusions, and providing their data for others to use and evaluate. This allows for validation and helps combat common but questionable research practices like p-hacking. But open data also helps advance reproducibility efforts in a way that is less confrontational, but allowing researchers to open the file drawer and share (and get credit for) all of their work.