At the very beginning of my career in research I conducted a study which involved asking college students to smile, frown, and then answer a series of questions about their emotional experience. This procedure was based on several classic studies which posited that, while feeling happy and sad makes people smile and frown, smiling and frowning also makes people feel happy and sad. After several frustrating months of trying and failing to get this to work, I ended my experiment with no significant results. At the time, I chalked up my lack of success to inexperience. But then, almost a decade later, a registered replication report of the original work also showed a lack of significant results and I was left to wonder if I had also been caught up in what’s come to be known as psychology’s reproducibility crisis.
While I’ve since left the lab for the library, my work still often intersects with reproducibility. Earlier this year I attended a Research Transparency and Reproducibility Training session offered by the Berkeley Institute for Transparency in the Social Sciences (BITSS) and my projects involving brain imaging data, software, and research data management all invoke the term in some way. Unfortunately, though it has always has been an important part of my professional activities, it isn’t always clear to me what we’re actually talking about when we talk about reproducibility.
The term “reproducibility” has been applied to efforts to enhance or ensure the research process for at at least 25 years. However, related conversations about how research is conducted, published, and interpreted have been ongoing for more than half a century. Ronald Fisher, who popularized the p-value that lies so central to many modern reproducibility efforts, summed up the situation in 1935.
“We may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results.”
Putting this seemingly simple statement into action has proven to be quite complex. Some reproducibility-related efforts are aimed at how researchers share their results, others are aimed at how they define statistical significance. There is now a burgeoning body of scholarship devoted to the topic. Even putting aside terms like HARKing, QRPs, and p-hacking, seemingly mundane objects like file drawers are imbued with particular meaning in the language of reproducibility.
So what actually is reproducibility?
Well… it’s complicated.
The best place to start might be the National Science Foundation, which defines reproducibility as “The ability of a researcher to duplicate the results of a prior study using the same materials and procedures used by the original investigator.”. According the NSF, reproducibility is one of three qualities that ensure research is robust. The other two, replicability and generalizability, are defined as “The ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.” and “Whether the results of a study apply in other contexts or populations that differ from the original one.” respectively. The difference between these terms is in the degree of separation from the original research, but all three converge on the quality of research. Good research is reproducible, replicable, and generalizable and , at least in the context of the NSF, a researcher invested in ensuring the reproducibility of their work would deposit their research materials and data in a manner and location where they could be accessed and used by others.
Unfortunately, defining reproducibility isn’t always so simple. For example, according to the NSF’s terminology, the various iterations of the Reproducibility Project are actually replicability projects (muddying the waters further, the Reproducibility Project: Psychology was preceded by the Many Labs Replication Project). However, the complexity of defining reproducibility is perhaps best illustrated by comparing the NSF definition to that of the National Institutes of Health.
Like the NSF, NIH invokes reproducibility in the context of addressing the quality of research. However, unlike the NSF, the NIH does not provide an explicit definition of the term. Instead NIH grant applicants are asked to address rigor and reproducibility across four areas of focus: scientific premise, scientific rigor (design), biological variables, and authentication. Unlike the definition supplied by the NSF, NIH’s conception of reproducibility appears to apply to an extremely broad set of circumstances and encompasses both replicability and generalizability. In the context of the NIH, a researcher invested in reproducibility must critically evaluate every aspect of their research program to ensure that any conclusions drawn from it are well supported.
Beyond the NSF and NIH, there have been numerous attempts to clarify what reproducibility actually means. For example, a paper out of the Meta-Research Innovation Center at Stanford (METRICS) distinguishes between “methods reproducibility”, “results reproducibility”, and “inferential reproducibility”. Methods and results reproducibility map onto the NSF definitions of reproducibility and replicability, while inferential reproducibility includes the NSF definition of generalizability and also the notion of different researchers reaching the same conclusion following reanalysis of the original study materials. Other approaches focus on methods by distinguishing between empirical, statistical, and computational reproducibility or specifying that replications can be direct or conceptual.
No really, what actually is reproducibility?
The deeper we dive into defining “reproducibility”, the muddier the waters become. In some contexts, the term refers to very specific practices related to authenticating the results of a single experiment. In other contexts, it describes a range of interrelated issues related to how research is conducted, published, and interpreted. For this reason, I’ve started to move away from explicitly invoking the term when I talk to researchers. Instead, I’ve tried to frame my various research and outreach projects in terms of how they relate to fostering good research practice.
To me, “reproducibility” is about problems. Some of these problems are technical or methodological and will evolve with the development of new techniques and methods. Some of these problems are more systemic and necessitate taking a critical look at how research is disseminated, evaluated, and incentivized. But fostering good research practice is central to addressing all of these problems.
Especially in my current role, I am not particularly well equipped to speak to if a researcher should define statistical significance as p < 0.05, p < 0.005, or K > 3. What I am equipped to do is to help a researcher manage their research materials so they can be used, shared, and evaluated over time. It’s not that I think the term is not useful, but the problems conjured by reproducibility are so complex and context dependent that I’d rather just talk about solutions.
Resources for understanding reproducibility and improving research practice
Goodman A., Pepe A, Blocker A. W., Borgman C. L., Cranmer K., et al. (2014) Ten simple rules for the care and feeding of scientific data. PLOS Computational Biology 10(4): e1003542.
Ioannidis J. P. A. (2005) Why most published research findings are false. PLOS Medicine 2(8): e124.
Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2017). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., du Sert, N. P., et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021.
Wilson Gl, Bryan J., Cranston K., Kitzes J., Nederbragt L., et al. (2017) Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510.