(index page)

Neuroimaging as a case study in research data management: Part 1

Part 1: What we did and what we found

This post was originally posted on Medium.

How do brain imaging researchers manage and share their data? This question, posed rather flippantly on Twitter a year and a half ago, prompted a collaborative research project. To celebrate the recent publication of a bioRxiv preprint, here is an overview of what we did, what we found, and what we’re looking to do next.

What we did and why

Magnetic resonance imaging (MRI) is a widely-used and powerful tool for studying the structure and function of the brain. Because of the complexity of the underlying signal, the iterative and flexible nature of analytical pipelines, and the cost (measured in terms of both grant funding and person hours) of collecting, saving, organizing, and analyzing such large and diverse datasets, effective research data management (RDM) is essential in research projects involving MRI. However, while the field of neuroimaging has recently grappled with a number of issues related to the rigor and reproducibility of its methods, information about how researchers manage their data within the laboratory remains mostly anecdotal.

Within and beyond the field of neuroimaging, efforts to address rigor and reproducibility often focus on problems such as publication bias and sub-optimal methodological practices and solutions such as the open sharing of research data. While it doesn’t make for particularly splashy headlines (unlike, say, this), RDM is also an important component of establishing rigor and reproducibility. If experimental results to be verified and repurposed, the underlying data must be properly saved and organized. Said another way, even openly shared data isn’t particularly useful if you can’t make sense of it. Therefore, in an effort to inform the ongoing conversation about reproducibility in neuroimaging, myself and Ana Van Gulick set out to survey the RDM practices and perceptions of the active MRI research community.

https://twitter.com/JohnBorghi/status/758030771097636869

With input from several active neuroimaging researchers, we designed and distributed a survey that described RDM-related topics using language and terminology familiar to researchers who use MRI. Questions inquired about the type(s) of data collected, the use analytical tools, procedures for transferring and saving data, and the degree to which RDM practices and procedures were standardized within laboratories or research groups. Building on my work to develop an RDM guide for researchers, we also asked participants to rate the maturity of both their own RDM practices and those of the field as a whole. Throughout the survey, we were careful to note that our intention was not to judge researchers with different styles of data management and that RDM maturity is largely orthogonal to the sophistication of data collection and analysis techniques.

Wait, what? A brief introduction to MRI and RDM.

Magnetic resonance imaging (MRI) is a medical imaging technique that uses magnetic fields and radio waves to create detailed images of organs and tissues. Widely used in medical settings, MRI has also become important tool for neuroscience researchers especially since the development of functional MRI (fMRI) in the early 1990’s. By detecting changes in blood flow that are associated with changes in brain activity, fMRI allows researchers to non-invasively study the structure and function of the living brain.

Because there are so many perspectives involved, it is difficult to give a single comprehensive definition of research data management (RDM). But, basically, the term covers activities related to how data is handled over the course of a research project. These activities include, but are certainly not limited to, those related to how data is organized and saved, how procedures and decisions are documented, and how research outputs are stored are shared. Many academic libraries have begun to offer services related to RDM.

Neuroimaging research involving MRI presented something of an ideal case study for us to study RDM among active researchers. The last few years have seen a rapid proliferation of standards, tools, and best practice recommendations related to the management and sharing of MRI data. Neuroimaging research also crosses many topics relevant to RDM support providers such as data sharing and publication, the handling of sensitive data, and the use and curation of research software. Finally, as neuroimaging researchers who now work in academic libraries, we are uniquely positioned to work across the two communities.

What we found

After developing our survey and receiving the appropriate IRB approvals, we solicited responses to our survey during Summer 2017. A total of 144 neuroimaging researchers participated and their responses revealed several trends that we hope will be informative for both neuroimaging researchers and also data support providers in a academic libraries.

As shown below, our participants indicated that their RDM practices throughout the course of a research project were largely motivated by immediate practical concerns such as preventing the loss of data and the ensuring access to everyone within a lab or research group and limited by a lack of time and discipline-specific best practices.

What motivates and limits RDM practices in neuroimaging? When we asked active researchers, it turned out the answer was immediate and practical concerns. All values listed are percentages, participants could give multiple responses.

We were relatively unsurprised to see that neuroimaging researchers use a wide array of software tools analyze their often heterogeneous sets of data. What did surprise us somewhat was the different responses from trainees (graduate students and postdocs) and faculty on questions related to the consistency of RDM practices within their labs. Trainees were significantly less likely to say that practices related to backing up, organizing, and documenting datas were standardized within their lab than faculty, which we think highlights the need for better communication about how RDM is an essential component of ensuring that research is rigorous and reproducible.

Analysis of RDM maturity ratings revealed that our sample generally rated their own RDM practices as more mature than the field as a whole and practices during the data collection and analysis phases of a project as significantly more mature than those during the data sharing phase. There are several interpretations of the former result, but the later is consistent with the low level of data sharing in the field. Though these ratings provide an interesting insight into the perceptions of the active research community, we believe there is substantial room for improvement in establishing proper RDM across every phase of a project, not just after after the data has already been analyzed.

Study participants rated their own RDM practices during the data collection and analysis phases of a project as significantly more mature than than those of the field as a whole. Ratings for the data sharing phase were significantly lower than ratings for the data collection and analysis phases.

For a complete overview of our results, including an analysis of how the field of neuroimaging is at a major point of transition when it comes to the adoption of practices including open access publishing, preregistration, replication, check out our preprint now on bioRxiv. While you’re at it, feel free to peruse, reuse, or remix our survey and data, both of which are available on figshare.

Is this unique to MRI research?

Definitely not. Just as the consequences of sub-optimal methodological practices and publication biases have been discussed throughout the biomedical and behavioral sciences for decades, we suspect that the RDM-related practices and perceptions observed in our survey are not limited to neuroimaging research involving MRI.

To paraphrase and reiterate a point made in the preprint, this work was intended to be descriptive not prescriptive. We also very consciously have not provided best practice recommendations because we believe that such recommendations would be most valuable (and actionable) if developed in collaboration with active researchers. Moving forward, we hope to continue to engage with the neuroimaging community on issues related to RDM and also expand the scope of our survey to other research communities such as psychology and biomedical science.

Additional Reading

Our preprint, one more time:

Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers. bioRxiv.

For a primer on functional magnetic resonance imaging:

Soares, J. M., Magalhães, R., Moreira, P. S., Sousa, A., Ganz, E., Sampaio, A., … Sousa, N. (2016). A hitchhiker’s guide to functional magnetic resonance imaging. Frontiers in Neuroscience, 10, 1–35.

For more on rigor, reproducibility, and neuroimaging:

Nichols, T. E., Das, S., Eickhoff, S. B., Evans, A. C., Glatard, T., Hanke M., … Yeo, B. T. T. (2017). Best practices in data analysis and sharing in neuroimaging using MRI. Nature Neuroscience, 20(3), 299–303. (Preprint)
Poldrack, R. A., Baker, C. I., Durnez, J., Gorgolewski, K. J., Matthews, P. M., Munafò, M. R., … Yarkoni, T. (2017). Scanning the horizon: Towards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2), 115–126. (Preprint)

Where’s the adoption? Shifting the Focus of Data Publishing in 2018

By Daniella Lowenberg

At RDA10 in Montreal I gave a presentation on Dash in the Repository Platforms for Research Data IG session. The session was focused on backend technology and technology communities for repository platforms. I talked a bit about the Dash open source software and features but walked away thinking “How productive is it to discuss software systems to support research data at length? Is adoption based on technology?”

The answers are: not productive, and no.

Following RDA10, I spent months talking with as many researchers and institutions as possible to figure out how much researchers know about data publishing and what would incentivize them to make it a common practice.

Researchers are the end users of research data publishing platforms and yet they are providing the least amount of input into these systems.

And if you think that is confusing, there is an additional layer of disorder: “researchers” is used as an umbrella term for various levels of scientists and humanists who can have drastically different opinions and values based on discipline and status.

I visited labs and took PIs, grad students, and postdocs to coffee at UCSF, UC Berkeley, and UC Santa Cruz. Coming from a science background and spending time convincing authors to make their data available at PLOS, I thought I had a pretty good sense of incentives, but I needed to span disciplines and leave the mindset of “you have to make your data available, or your paper will not be published” to hear researchers’ honest answers. Here’s what I found:

People like the idea of data publishing in theory, but in practice, motivation is lacking and excuses are prominent.

This is not surprising though. The following is an example scenario (with real quotes) of how data publishing is perceived at various statuses (for some control this scenario takes place within biomedical research)

Grad Student: “Data publishing sounds awesome, I would totally put my data out there when publishing my work but it’s really up to my PI and my PI doesn’t think it is necessary.”

Post Doc: “I like the idea but if we put data in here are people are going to use my data before I can publish 3 Nature papers as first author?”

PI: “I like the idea of having my students put their work in an archive so I can have all research outputs from the lab in one place, but until my Vice Chancellor of Research (VCR) tells me it is a priority I probably won’t use it.”

VCR: “Funder and Publisher mandates aren’t incentivizing enough?”

Publisher: “We really believe the funder mandates are the stick here.”

As you can tell there is not a consensus of understanding and there is a difference between theoretical and practical implementation of data publishing. As one postdoc said at UCSF “If I am putting on my academic hat, of course my motivation is the goodness of it. But, practically speaking I’m not motivated to do anything”. With differing perspectives for each stakeholder there are infinite ways to see how difficult it is to gauge interest in data publishing!

Other reasons adoption of data publishing practices is difficult:

At conferences and within the scholarly communication world, we speak in jargon about sticks (mandates) and carrots (reproducibility, transparency). We are talking to each other: people who have already bought into these incentives and needs and are living in an echo chamber. We forget that these mandates and reasons for open data are not well understood and effective by researchers themselves. Mandates and justifications about being “for the good of science” are not consistently understood across the lab. PIs are applying for grants and writing up Data Management Plans (DMPs), but the grad students and postdocs are doing the data analysis and submitting the paper. There is plenty of space here for miscommunication, misinformation, and difficulty. We also say that reproducibility, transparency, and getting credit for your work are wide ranging carrots, but reproducibility/transparency initiatives vary per field. Getting credit for publishing data is seemingly easy (like articles)- authorship on a dataset and citations of the DOI credit the researchers who first published the data. But, how can we say that right now researchers are “getting credit” for their data publications if citing data isn’t common practice, few publishers support data citations, and tenure committees aren’t looking at the reach of data?

We spend time talking to one another about how open data is a success because publishers have released X many data statements and repositories have X many datasets. Editors and reviewers typically do not check for (or want to check for) data associated with publications to ensure they are underlying or FAIR data, and many high volume repositories take any sort of work (conference talks, pdfs, posters). How many articles have the associated data publicly available and in a usable format? How many depositions to repositories are usable research data? We must take these metrics with a grain of salt and understand that while we are making progress, there are various avenues we must be investing in to make the open data movement a success.

All aspects of this are related to researcher education and lowering the activation energy (i.e. making it a common and accepted practice).

A provocative conversation to bridge people together:

In my presentation at CNI I scrolled through a number of quotes from researchers that I gathered during these coffee talks, and the audience laughed at many of them. The quotes are funny (or sad or realistic or [insert every range of emotion]), but even this reaction is reason for us to re-think our ways of driving adoption of research data management and open data practices. To be talking about technologies and features that aren’t requested by researchers is getting ahead of ourselves.

Right now there should be one focus: finding incentives and ways to integrate into workflows that effectively get researchers to open up and preserve their data.

When presenting this I was apprehensive but confident: I was presenting opinions and experiences but hearing someone say ‘we’re doing it wrong’ usually does not come with applause. What came of the presentation was a 30-minute discussion full of genuine experiences, honest opinions, and advice. Some discussion points that came up:

Yale University: “Find the pain” — talking to researchers about not what their dream features are but what would really help them with their data needs
Elsevier, Institutions: A debate and interest in what is a Supporting Information (SI) file and if SI files are a gateway drug that we support. Note: I and a few others agreed that no, publishing a table already in the article should not be rewarded. That would be positive reinforcement that common practices are good enough
Duke University: Promoting open and preserved data as a way for PIs to reduce panic when students join and leave the lab and have an archived set of work from past grad students (while they still receive authorship of the dataset)
Claremont McKenna Colleges: Are incentives and workflows different per institution and institution level or should the focus be on domains/disciplines? Note: Typically researchers do not limit their focus to the institution level but rather are looking at their field so this may be the better place to align (rather than institutional policies and incentives).

The general consensus was that we have to re-focus on researcher needs and integrate into researcher workflows. To do this successfully:

We need to check our language.
We need to ensure that our primary drive in this community is to build services and tools that make open data and data management common practices in the research workflows.
We need to share our experiences and work with all research stakeholders to understand the landscape and needs (and not refer to an unrealistic lifecycle).

So, let’s work together. Let’s talk to as many researchers in as many domains and position levels in 2018. Let’s share these experiences out when we meet at conferences and on social media. And let’s focus on adoption of a practice (data publishing) instead of spotlighting technologies, to make open data a common, feasible, and incentivized success.

OA Week 2017: Transparency and Reproducibility

By John Borghi and Daniella Lowenberg

Yesterday we talked about about why researchers may have to make their data open, today let’s start talking about why they may want to.

Though some communities have been historically hesitant to do so, researchers appear to be increasingly willing to share their data. Open data even seems to be associated with a citation advantage, meaning that as datasets are accessed and reused, the researchers involved in the original work continue to receive credit. But open data is about more than just complying with mandates and increasing citation counts, it’s also about researchers showing their work.

From discussions about publication decisions to declarations that “most published research findings are false”, concerns about the integrity of the research process go back decades. Nowadays, it is not uncommon to see the term “reproducibility” applied to any effort aimed at addressing the misalignment between good research practices, namely those emphasizing transparency and methodological rigor, and academic reward systems, which generally emphasize the push to publish only the most positive and novel results. Addressing reproducibility means addressing a range of issues related to how research is conducted, published, and ultimately evaluated. But, while the path to reproducibility is a long one, open data represents a crucial step forward.

“While the path to reproducibility is a long one, open data represents a crucial step forward.”

One of the most popular targets of reproducibility-related efforts is p-hacking, a term that refers to the practice of applying different methodological and statistical techniques until non-significant results become significant. The practice of p-hacking is not always intentional, but appears to be quite common. Even putting aside some truly astonishing headlines, p-hacking has been cited as a major contributor to the reproducibility crisis in fields such as psychology and medicine.

One application of open data is sharing the datasets, documentation, and other materials needed to reproduce the results described in a journal article, thus allowing other researchers (including peer reviewers) can check for errors and ensure that the conclusions discussed in the paper are supported by the underlying data and methods. This type of validation doesn’t necessarily prevent p-hacking, but it does increase the degree to which researchers are accountable for explaining marginally significant results.

But the impact of open data on reproducibility goes far beyond just combatting p-hacking. Publication biases such as the file drawer problem, which refers to the tendency of researchers to publish papers describing studies that resulted in positive results while regulating studies that resulted in negative or nonconfirmatory results to the proverbial file drawer. Along with problems related to small sample sizes, this tendency majorly skews the effects described in the scientific literature. Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.

“Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.”

Open data is about researchers showing their work, being transparent about their how they make their conclusions, and providing their data for others to use and evaluate. This allows for validation and helps combat common but questionable research practices like p-hacking. But open data also helps advance reproducibility efforts in a way that is less confrontational, but allowing researchers to open the file drawer and share (and get credit for) all of their work.

The Digital Dark Age, Part 2

Earlier this week I blogged about the concept of a Digital Dark Age. This is a phrase that some folks are using to describe some future scenario where we are not able to read historical digital documents and multimedia because they have been rendered obsolete or were otherwise poorly archived. But what does this mean for scientific data?

Consider that Charles Darwin’s notebooks were recently scanned and made available online. This was possible because they were properly stored and archived, in a long-lasting format (in this case, on paper). Imagine if he had taken pictures of his finch beaks with a camera and saved the digital images in obsolete formats. Or ponder a scenario where he had used proprietary software to create his famous Tree of Life sketch. Would we be able to unlock those digital formats today? Probably not. We might have lost those important pieces of scientific history forever. Although it seems like software programs such as Microsoft Excel and MATLAB will be around forever, people probably said similar things about the programs Lotus 1-2-3 and iWeb.

darwin by diana sudyka — “Darwin with Finches” by Diana Sudyka, from Flickr by Karen E James

It is a common misconception that things that are posted on the internet will be around “forever”. While that might be true of embarrassing celebrity photos, it is much less likely to be true for things like scientific data. This is especially the case if data are kept on a personal/lab website or archived as supplemental material, rather than being archived in a public repository (See Santos, Blake and States 2005 for more information). Consider the fact that 10% of data published as supplemental material in the six top-cited journals was not available a mere five years later (Evangelou, Trikalinos, and Ioannidis, 2005).

Natalie Ceeney, chief executive of the National Archives, summed it up best in this quote from The Guardian’s 2007 piece on preventing a Digital Dark Age: “Digital information is inherently far more ephemeral than paper.”

My next post and final DDA installment will provide tips on how to avoid losing your data to the dark side.