Tag: open data

The Significance of Managing Research Data

Some of the most influential research tools of the last century were created to ensure the quality of beer and extrapolate the results of agriculture experiments conducted in the English countryside. Though ostensibly about the placement of a decimal point, an ongoing debate about the application of these tools also provides a window for understanding what it actually means to manage research data.

The p-value: A very quick introduction

Though now ubiquitous in experiment-based research, statistical techniques for extending inferences from small sample (e.g. the participants in a research study) to larger populations are actually a relatively recent invention. The t-test, an early and still widely used example of “small sample” statistics was developed by William Sealy Gossett in the early 20th century as an economical way of ensuring the quality of stout. Several years later, while assisting with long-term experiments on wheat and grass at Rothamsted Experimental Station, Ronald Fisher would build on the work of Gosset and others to develop a statistical framework based around the idea of comparing observations to the null hypothesis- the position that there is no significant difference between two or more specified sets of observations.

In Fisher’s significance testing framework, devices like t-tests are tests of the null hypothesis. The results of these tests indicate the likelihood of observing a result when the null hypothesis is true. The logic is a little tricky, but the core idea is that these tests give researchers a way of understanding the likelihood that their data is the result of sampling or experimental error. In quantitative terms, this likelihood is known as a p-value. In his highly influential 1925 book, Statistical Methods for Research Workers, Fisher would introduce an informal threshold for rejecting the null hypothesis: p < 0.05.

In one of the most influential sentences in modern research methodology, Ronald Fisher describes p = 0.05 as a convenient point for judging the significance of a statistical test. From: Fisher, R.A. (1925). Statistical Methods for Research Workers.

Despite the vehement objections of all three, Fisher’s work would later be synthesized with that of statisticians Jerzy Neyman and Egon Pearson into a suite of tools that are still widely used in many fields of research. In practice, p < 0.05 has since become a one-size-fits-all indicator of success. For decades it has been acknowledged that work that meets this criterion is generally more likely to be reported in the scholarly literature while work that doesn’t is generally relegated the proverbial file drawer.

Beyond p < 0.05

The p < 0.05 threshold has become a flashpoint the ongoing conversation about research practices, reproducibility, and replicability. Heated conversations about the use and misuse of p-values have been ongoing for decades, but over the summer a group of 72 influential researchers proposed a seemingly simple step forward- change the threshold from 0.05 to 0.005. According to the authors, “Reducing the p-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility.”.

As of this writing, two responses have been published. Both weigh the pros and cons of p < 0.005 and argue that the placement of a decimal point is less of a problem than the uncritical use of a single one-size-fits-all threshold across many different circumstances and fields of research. Both end on calls for greater transparency and stronger justifications for how decisions related to research design and statistical practice are made. If the initial paper proposed changing the answer from p < 0.05 to 0.005, both responses highlight the necessity of changing the question from one that is focused on statistics to one that incorporates research data management (RDM).

Ensuring that data can be used and evaluated in the future is one of the primary goals of RDM. For example, the RDM guide we’re developing does not have a space for assessing p-values. Instead, its focus is assessing and advancing practices related to planning for, saving, and documenting data and other research products. Such practices come with their own nuance, learning curves, and jargon, but are important elements to any effort to ensure that research decisions are transparent and justified.

Resources and Additional Reading

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. Nature Human Behaviour. doi: 10.1038/s41562-017-0189-z

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2017). Justify your alpha: A response to “Redefine statistical significance”PsyArxiv preprint. doi: 10.17605/OSF.IO/9S3Y6

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. arXiv preprint. arXiv: 1709.07588.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versaJournal of the American Statistical Association54(285), 30-34. doi: 10.1080/01621459.1959.10501497

Rosenthal, R. (1979). The file drawer problem and tolerance for null resultsPsychological Bulletin86(3), 638-641. doi: 10.1037/0033-2909.86.3.638


A few months back I received an invite to visit the University of Florida in sunny Gainesville.  The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program.  Phew! That was a lot of typing for the first two acronyms in my blog post’s title.  The third acronym  (OA) stands for Open Access, and the fourth acronym should be familiar.

I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA.  There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion.  In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing.  The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”.  Some of the themes that emerged from our discussion warrant a blog post.

First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey.  This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form.  Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft.  In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.

The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?”  This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX.  You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication.  Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.

empty library

No journals means empty library shelves. Perhaps the newly freed up space could be used to store curmudgeonly professors resistant to change.

Let’s pretend that we have overhauled both scientists and the publishing system as it is.  In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents.  They document their workflows and create python scripts that run in the command line for reproducible results.  Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.

One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality.  In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.

One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.

My answer was a resounding YES! Enter the DCXL project.  We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing.  One of the main goals of the DCXL project is to help “the little guy”.  These are often scientists working on relatively small datasets that can be contained in Excel files.

In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while.  Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.

Special thanks goes to Emilio Bruna (@brunalab) who not only scored me the invite to UF, but also hosted me for a lovely dinner during my visit (albeit NOT the Tasty Budda…)

Data Publishing and the Coproduction of Quality

This post is authored by Eric Kansa

There is a great deal of interest in the sciences and humanities around how to manage “data.” By “data,” I’m referring to content that has some formal and logical structure needed to meet the requirements of software processing. Of course, distinctions between structured versus unstructured data represent more of a continuum or spectrum than a sharp line. What sets data apart from texts however is that data are usually intended for transactional (with queries and visualizations) rather than narrative applications.

The uses of data versus texts make a big difference in how we perceive “quality.” If there is a typo in a text, it usually does not break the entire work. Human readers are pretty forgiving with respect to those sorts of errors, since humans interpret texts via pattern recognition heavily aided by background knowledge and expectations. Small deviations from a reader’s expectations about what should be in a text can be glossed over or even missed entirely. If noticed, many errors annoy rather than confuse. This inherently forgiving nature of text makes editing and copy-editing attention-demanding tasks. One has to struggle to see what is actually written on a page rather than getting the general gist of a written text.

Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions (many suggested by peer-review evaluations) before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. The final product is a work of collaborative “coproduction” between authors, editors, reviewers, and type-setters.

What does this have to do with data?

Human beings typically don’t read data. We use data mediated through software. The transactional nature of data introduces a different set of issues impacting the quality and usability of data. Whereas small errors in a text often go unnoticed, such errors can have dramatic impacts on the use and interpretation of a dataset. For instance, a misplaced decimal point in a numeric field can cause problems for even basic statistical calculations. Such errors can also break visualizations.

These issues don’t only impact single datasets, they can also wreak havoc in settings where multiple individual datasets need to be joined together. I work mainly on archaeological data dissemination. Archaeology is an inherently multidisciplinary practice, involving inputs from different specialists in the natural sciences (especially zoology, botany, human osteology, and geomorphology), the social sciences, and the humanities. Meaningful integration of these diverse sources of structured data represents a great information challenge for archaeology. Archaeology also creates vast quantities of other digital documentation. A single field project may result in tens of thousands of digital photos documenting everything from excavation contexts to recovered artifacts. Errors and inconsistencies in identifiers can create great problems in joining together disparate datasets, even from a single archaeological project.

It is a tremendous challenge to relate all of these different datasets and media files together in a usable manner. The challenge is further compounded because archaeology, like many small sciences, typically lacks widely used recording terminologies and standards. Each archaeological dataset is custom crafted by researchers to address a particular suite of research interests and needs. This means that workflows and supporting software to find and fix data problems needs to be pretty generalized.

Fortunately, archaeology is not alone in needing tools to promote data quality. Google Refine helps meet these needs. Google Refine leverages the transactional nature of data to summarize and filter datasets in ways that make many common errors apparent. Once errors are discovered, Google Refine has powerful editing tools to fix problems. Users can also undo edits to roll-back fixes and return a dataset to an earlier state.

With funding from the Alfred P. Sloan Foundation, we’re working to integrate Google Refine in a collaborative workflow called “Data Refine“. Again, the transactional nature of data helps shape this workflow. Because use of data is heavily mediated by software, datasets can be seen as an integral part of software. This thinking motivated us to experiment with using software debugging and issue tracking tools to help organize collaborative work on editing data. Debugging and issue tracking tools are widely used and established ways of improving software quality. They can play a similar role in the “debugging” of data.

We integrated Google Refine and the PHP-based Mantis issue tracker to support collaboration in improving data quality. In this approach, contributing researchers and data editors collaborate in the coproduction of higher quality, more intelligible and usable datasets. These workflows try to address both supply and demand needs in scholarship. Researchers face strong and well known career pressures. Tenure may be worth $2 million or more over the course of a career, and its alternative can mean complete ejection from a discipline.  A model of editorially supervised “data sharing as publication” can help better align the community’s interest in data dissemination with the realities of individual incentives. On the demand side, datasets must have sufficient quality and documentation. To give context, data often need to be related and linked with shared concepts and with other datasets available on the Web (as in the case of “Linked Open Data” scenarios).

All of these processes require effort. New skills, professional roles, and scholarly communication channels need to be created to meet the specific requirements of meaningful data sharing. Tools and workflows as discussed here can help make this effort a bit more efficient and better suited to how data are used in research.

Popular Demand for Public Data

Scanned image of a 1940 Census Schedule (from http://1940census.archives.gov)

The National Archives and Records Administration digitized 3.9 million schedules from the 1940 U.S. census

When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.

Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.

In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.

For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.

The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.

Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.

So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)