Data publishing is the new hot topic in a growing number of academic communities. The scholarly ecosophere is filled with listserv threads, colloquia and conference hallway chats punctuated with questions of why to do it, how to do it, where to do it, when to do it and even what to call this seemingly new breed of scholarly output. Scholars, and those who provide the tools and infrastructure to support them, are consumed with questions that don’t seem to have easy answers, and certainly not answers that span all disciplines. How can researchers gain credit for the data they produce, separate from and in addition to the analysis of those data as articulated in formal publications? How can scholars researching an area find relevant data sets within and across their own disciplines? How are data and methodologies most effectively reviewed, validated, and corrected? How are meaningful connections maintained between different versions, iterations, and “re-uses” of a given set of data?
The high-pitched level of debate on these topics is surprising in some ways given that datasets at least in certain fields have been readily available for awhile. The Inter-University Consortium for Political and Social Research (ICPSR) has been allowing social scientists to publish or find datasets since the early 1960s. Great datasets gathered and published under institutional auspices include UN Data, the UNESCO Statistical Yearbook, and the IMF Depository library program. Closer to home is the United States’ Federal Depository Library program, which since its establishment in 1841 has served as a distribution mechanism to ensure public access to governmental documents and data.
While these outlets are only viable solutions for some disciplines, their presence started me down a path exploring the history of data publishing in an effort to try to gain some perspective on the challenges we are facing today. Somewhat surprisingly, data publishing, conducted in a manner that would be recognized by today’s scholars, has been occurring for almost half a millennium. Yes, that’s right; we are now 500 years into producing, analyzing and publishing data.
These early activities centered on demographic data, presumably in an effort to identify and understand the dramatic patterns of life and death. Starting in the late 1500’s and prompted by the Plague, “Bills of Mortality” recording deaths within London began to be published and soon continued on a weekly basis. That raw data generation got noticed by a community-minded draper, the extremely bright, but non-university affiliated London resident John Graunt, who was inspired to gather those numerical lists, turn them into a dataset, analyze those data (looking for causes of death, ages of death, comparisons of rates between London and elsewhere, etc.) and publish both the dataset and his findings regarding population patterns in a groundbreaking work in 1662 “Natural and Political Observations Mentioned in a Following Index, and made up on the Bills of Mortality.” The work was submitted to the Royal Society of Philosophers, which recognized its merit and inducted the author into its fellowship. Graunt continued to extend his data and analysis, publishing new versions of each in subsequent years. Thus was born the first great work (at least in the Western world) of statistical analysis, or “political arithmetic” as it came to be called at that time.
Moving from the 16th and 17th centuries to the 18th brings us to another major point in data publishing history with Johann Sussmilch of Germany. Sussmilch was originally a cleric involved in a variety of intellectual pursuits, though unaffiliated with a university, at least initially. Sussmilch’s interests included theology, statistics and linguistics. He was eventually appointed to the Royal Academy of Sciences and Fine Arts for his linguistic scholarship. Sussmilch’s great work was the “Divine Order,” an ambitious effort to collect detailed data about the population in Prussia in order to prove his religious theory of “Rational Theology.” In other words, Sussmilch was engaged in a basic research program–he had a theory, formed a research question, collected the data required to test that theory, analyzed his data, and then published his results along with this data.
The rigorous quality of Sussmilch’s work (both the data and the analysis) elevated it far beyond his original and personal religious motivations, leading it to have a wide impact throughout parts of Europe. It became a focal point of exchange between scholars across countries and prompted debate over his data collection methodology and interpretation. Put another way, Sussmilch’s work inspired his colleagues to engage in the modern model of “scholarly communication” – engaging in a spirited critical dialogue which in turn resulted in changes to the next edition of the work (for instance, separate tables for immigration and racial data). Published first in 1741, it was updated and reprinted six times through 1798.
In this earlier time, as in our own, the drive to engage with other intellectuals was paramount. Publishing, sharing, critiquing and modifying data production efforts and analysis was seemingly as much a driving force among this community as it is among the scholars of today. Researchers of the 17th and 18th centuries dealt with issues of attribution, review of data veracity and analytical methodology and even versioning. The surprising discovery of apparent similarities across such a large gulf of time prompts many questions. If data could be published and shared centuries ago, why are we faced with such tremendous challenges to do the same today? Are we overlooking approaches from the past that could help us today? Or are we glossing over the difficulties of the past?
More research would have to be done to answer these questions thoroughly, but perhaps a gesture can be made in that regard by identifying some of the contrasting aspects between yesterday and today. Taking the examples from above as a jumping off point, perhaps the most striking difference between past activities and the goals articulated in conversations about data publishing today is that the data publication efforts of the past were accompanied by an equally important piece of analysis, and the research community was interested in both. The conclusions drawn from the data were held to scrutiny as were the data and data collection methods that provided their foundation. All of the components of the research were of concern. These scholars were not interested in publishing the data on their own, but rather wanted to present them along with their arguments, with each underscoring the other.
Another difference is the changing relationships between individual researchers and the entities that support them. Not only do we have governments and academic institutions, but we have a new contemporary player, the corporation, which is driven by a substantially different motivation from entities of past ages. In addition, a broader range of disciplines is now concerned with data publication, and perhaps those disciplines face stumbling blocks not at issue for the social scientists working with demographic and public health data. Given the known heterogeneity of scholarly communication practices across different fields, there seems to be no reason to think that data publishing needs, expectations and concerns would not also vary. And of course, the most obvious difference between then and now is with tools and technology. Have those advancements altered fundamental data publishing practices and if so, how?
These are interesting, but complex questions to pursue. Fortunately, what the above examples of our data publishing antecedents have hopefully revealed is that there are meaningful touchstones to use as reference points as we attempt to address these points. Data publishing has a rich, resonant past stretching back hundreds of years, providing us with an opportunity to reach into that past to better understand the trajectory that has brought us to this moment, thereby helping us more effectively grapple with the questions that seem to confound us today.