(index page)

A new opportunity to build a better (data) future

Last month I left my comfort zone.

After 30 years of working as an engineer, developer, and technical leader at Scripps Institution of Oceanography (SIO at UC San Diego), I started a new career as a Senior Product Manager and Research Data Specialist with UC Curation Center (UC3) at the California Digital Library. While it may sound like a big change, it was more of steady evolution.

Although my projects at SIO were initially focused on scientific instrumentation, software development, and engineering specifications, I found the curation of the in situ data to be fascinating and better aligned with my skills and preferences. This led to service opportunities which included leadership positions within national and international data initiatives, and those projects allowed me to collaborate with members of UC3.

Joining their team was the next logical step.

The transition from being part of the technical staff in a research setting to being a hands-on data advocate in UC3 has been an invigorating challenge so far, and it provides an excellent opportunity to build on my foundation of knowledge and grow in new areas.

It’s an honor to pick up where my predecessor, Daniella Lowenberg, left off. I’ve long admired her approach to all things data. I am grateful for the extraordinary measures that she and John Chodacki have taken to bring me up to speed as soon as possible.

Data publishing is a dynamic young field and my colleagues and I will be able to help shape the conversations, initiatives, and tools that serve the international research community. I look forward to working with my new colleagues as we advocate for open data and help build and implement infrastructure to make data more discoverable, interoperable, and reusable.

Data Publishing and the Coproduction of Quality

This post is authored by Eric Kansa

There is a great deal of interest in the sciences and humanities around how to manage “data.” By “data,” I’m referring to content that has some formal and logical structure needed to meet the requirements of software processing. Of course, distinctions between structured versus unstructured data represent more of a continuum or spectrum than a sharp line. What sets data apart from texts however is that data are usually intended for transactional (with queries and visualizations) rather than narrative applications.

The uses of data versus texts make a big difference in how we perceive “quality.” If there is a typo in a text, it usually does not break the entire work. Human readers are pretty forgiving with respect to those sorts of errors, since humans interpret texts via pattern recognition heavily aided by background knowledge and expectations. Small deviations from a reader’s expectations about what should be in a text can be glossed over or even missed entirely. If noticed, many errors annoy rather than confuse. This inherently forgiving nature of text makes editing and copy-editing attention-demanding tasks. One has to struggle to see what is actually written on a page rather than getting the general gist of a written text.

Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions (many suggested by peer-review evaluations) before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. The final product is a work of collaborative “coproduction” between authors, editors, reviewers, and type-setters.

What does this have to do with data?

Human beings typically don’t read data. We use data mediated through software. The transactional nature of data introduces a different set of issues impacting the quality and usability of data. Whereas small errors in a text often go unnoticed, such errors can have dramatic impacts on the use and interpretation of a dataset. For instance, a misplaced decimal point in a numeric field can cause problems for even basic statistical calculations. Such errors can also break visualizations.

These issues don’t only impact single datasets, they can also wreak havoc in settings where multiple individual datasets need to be joined together. I work mainly on archaeological data dissemination. Archaeology is an inherently multidisciplinary practice, involving inputs from different specialists in the natural sciences (especially zoology, botany, human osteology, and geomorphology), the social sciences, and the humanities. Meaningful integration of these diverse sources of structured data represents a great information challenge for archaeology. Archaeology also creates vast quantities of other digital documentation. A single field project may result in tens of thousands of digital photos documenting everything from excavation contexts to recovered artifacts. Errors and inconsistencies in identifiers can create great problems in joining together disparate datasets, even from a single archaeological project.

It is a tremendous challenge to relate all of these different datasets and media files together in a usable manner. The challenge is further compounded because archaeology, like many small sciences, typically lacks widely used recording terminologies and standards. Each archaeological dataset is custom crafted by researchers to address a particular suite of research interests and needs. This means that workflows and supporting software to find and fix data problems needs to be pretty generalized.

Fortunately, archaeology is not alone in needing tools to promote data quality. Google Refine helps meet these needs. Google Refine leverages the transactional nature of data to summarize and filter datasets in ways that make many common errors apparent. Once errors are discovered, Google Refine has powerful editing tools to fix problems. Users can also undo edits to roll-back fixes and return a dataset to an earlier state.

With funding from the Alfred P. Sloan Foundation, we’re working to integrate Google Refine in a collaborative workflow called “Data Refine“. Again, the transactional nature of data helps shape this workflow. Because use of data is heavily mediated by software, datasets can be seen as an integral part of software. This thinking motivated us to experiment with using software debugging and issue tracking tools to help organize collaborative work on editing data. Debugging and issue tracking tools are widely used and established ways of improving software quality. They can play a similar role in the “debugging” of data.

We integrated Google Refine and the PHP-based Mantis issue tracker to support collaboration in improving data quality. In this approach, contributing researchers and data editors collaborate in the coproduction of higher quality, more intelligible and usable datasets. These workflows try to address both supply and demand needs in scholarship. Researchers face strong and well known career pressures. Tenure may be worth $2 million or more over the course of a career, and its alternative can mean complete ejection from a discipline. A model of editorially supervised “data sharing as publication” can help better align the community’s interest in data dissemination with the realities of individual incentives. On the demand side, datasets must have sufficient quality and documentation. To give context, data often need to be related and linked with shared concepts and with other datasets available on the Web (as in the case of “Linked Open Data” scenarios).

All of these processes require effort. New skills, professional roles, and scholarly communication channels need to be created to meet the specific requirements of meaningful data sharing. Tools and workflows as discussed here can help make this effort a bit more efficient and better suited to how data are used in research.

The Future of Metrics in Science

Ask any researcher what they need for tenure, and the answer is virtually the same across institutions and disciplines: publications. The “publish or perish” model has reigned supreme for generations of scientists, despite its rather annoying ignorance of having quality over quantity publications, how many collaborations have been established, or even the novelty or difficulty of a particular research project. This archaic measure of impact tends to rely measures like a scientist’s number of citations and the impact factor of the journals in which they publish.

With the upswing in blogs, Twitter feeds, and academic social sites like Mendeley, Zotero, and (my favorite) CiteULike, some folks are working on developing a new model for measuring one’s impact on science. Jason Priem, a graduate student at UNC’s School of Information and Library Science, coined the term “altmetrics” rather recently, and the idea has taken off like wildfire.

altmetrics is the creation and study of new metrics based on the Social Web for analyzing, and informing scholarship.

The concept is simple: instead of using traditional metrics for measuring impact (citation counts, journal impact factors), Priem and his colleagues want to take into account more modern measures of impact like number of bookmarks, shares, or re-tweets. In addition, altmetrics seeks to consider not only publications, but associated data or code downloads.

sex pistols — The original alternatives: The Sex Pistols. From Arroz Do Ceu (limpa-vias.blogspot.com). Read more about the beginnings of alternative rock in Dave Thompson’s book “Alternative Rock”.

Old-school scientists and Luddites might balk at the idea of measuring a scientist’s impact on the community by the number of re-tweets their article received, or by the number of downloads of their dataset. This reaction can be attributed to several causes, one of which may be an irrational fear of change. But the reality is that the landscape of science is changing dramatically, and the trend towards social media as a scientific tool is only likely to continue. See my blog post on why scientists should tweet for more information on the benefits of embracing one of the aspects of this trend.

Need another reason to get onboard? Funders see the value in altmetrics. Priem, along with his co-PI (and my DataONE colleague) Heather Piwowar, just received $125K from the Sloan Foundation to expand their Total Impact project. Check out the Total Impact website for more information, or read the UNC SILS news story about the grant.

The DCXL project feeds right into the concept of altmetrics. By providing citations for datasets that are housed in data centers, the impact of a scientist’s data can be easily incorporated into their impact factor.

Data Publishing–the First 500 Years

Data publishing is the new hot topic in a growing number of academic communities. The scholarly ecosophere is filled with listserv threads, colloquia and conference hallway chats punctuated with questions of why to do it, how to do it, where to do it, when to do it and even what to call this seemingly new breed of scholarly output. Scholars, and those who provide the tools and infrastructure to support them, are consumed with questions that don’t seem to have easy answers, and certainly not answers that span all disciplines. How can researchers gain credit for the data they produce, separate from and in addition to the analysis of those data as articulated in formal publications? How can scholars researching an area find relevant data sets within and across their own disciplines? How are data and methodologies most effectively reviewed, validated, and corrected? How are meaningful connections maintained between different versions, iterations, and “re-uses” of a given set of data?

The high-pitched level of debate on these topics is surprising in some ways given that datasets at least in certain fields have been readily available for awhile. The Inter-University Consortium for Political and Social Research (ICPSR) has been allowing social scientists to publish or find datasets since the early 1960s. Great datasets gathered and published under institutional auspices include UN Data, the UNESCO Statistical Yearbook, and the IMF Depository library program. Closer to home is the United States’ Federal Depository Library program, which since its establishment in 1841 has served as a distribution mechanism to ensure public access to governmental documents and data.

While these outlets are only viable solutions for some disciplines, their presence started me down a path exploring the history of data publishing in an effort to try to gain some perspective on the challenges we are facing today. Somewhat surprisingly, data publishing, conducted in a manner that would be recognized by today’s scholars, has been occurring for almost half a millennium. Yes, that’s right; we are now 500 years into producing, analyzing and publishing data.

These early activities centered on demographic data, presumably in an effort to identify and understand the dramatic patterns of life and death. Starting in the late 1500’s and prompted by the Plague, “Bills of Mortality” recording deaths within London began to be published and soon continued on a weekly basis. That raw data generation got noticed by a community-minded draper, the extremely bright, but non-university affiliated London resident John Graunt, who was inspired to gather those numerical lists, turn them into a dataset, analyze those data (looking for causes of death, ages of death, comparisons of rates between London and elsewhere, etc.) and publish both the dataset and his findings regarding population patterns in a groundbreaking work in 1662 “Natural and Political Observations Mentioned in a Following Index, and made up on the Bills of Mortality.” The work was submitted to the Royal Society of Philosophers, which recognized its merit and inducted the author into its fellowship. Graunt continued to extend his data and analysis, publishing new versions of each in subsequent years. Thus was born the first great work (at least in the Western world) of statistical analysis, or “political arithmetic” as it came to be called at that time.

Moving from the 16th and 17th centuries to the 18th brings us to another major point in data publishing history with Johann Sussmilch of Germany. Sussmilch was originally a cleric involved in a variety of intellectual pursuits, though unaffiliated with a university, at least initially. Sussmilch’s interests included theology, statistics and linguistics. He was eventually appointed to the Royal Academy of Sciences and Fine Arts for his linguistic scholarship. Sussmilch’s great work was the “Divine Order,” an ambitious effort to collect detailed data about the population in Prussia in order to prove his religious theory of “Rational Theology.” In other words, Sussmilch was engaged in a basic research program–he had a theory, formed a research question, collected the data required to test that theory, analyzed his data, and then published his results along with this data.

The rigorous quality of Sussmilch’s work (both the data and the analysis) elevated it far beyond his original and personal religious motivations, leading it to have a wide impact throughout parts of Europe. It became a focal point of exchange between scholars across countries and prompted debate over his data collection methodology and interpretation. Put another way, Sussmilch’s work inspired his colleagues to engage in the modern model of “scholarly communication” – engaging in a spirited critical dialogue which in turn resulted in changes to the next edition of the work (for instance, separate tables for immigration and racial data). Published first in 1741, it was updated and reprinted six times through 1798.

In this earlier time, as in our own, the drive to engage with other intellectuals was paramount. Publishing, sharing, critiquing and modifying data production efforts and analysis was seemingly as much a driving force among this community as it is among the scholars of today. Researchers of the 17th and 18th centuries dealt with issues of attribution, review of data veracity and analytical methodology and even versioning. The surprising discovery of apparent similarities across such a large gulf of time prompts many questions. If data could be published and shared centuries ago, why are we faced with such tremendous challenges to do the same today? Are we overlooking approaches from the past that could help us today? Or are we glossing over the difficulties of the past?

More research would have to be done to answer these questions thoroughly, but perhaps a gesture can be made in that regard by identifying some of the contrasting aspects between yesterday and today. Taking the examples from above as a jumping off point, perhaps the most striking difference between past activities and the goals articulated in conversations about data publishing today is that the data publication efforts of the past were accompanied by an equally important piece of analysis, and the research community was interested in both. The conclusions drawn from the data were held to scrutiny as were the data and data collection methods that provided their foundation. All of the components of the research were of concern. These scholars were not interested in publishing the data on their own, but rather wanted to present them along with their arguments, with each underscoring the other.

Another difference is the changing relationships between individual researchers and the entities that support them. Not only do we have governments and academic institutions, but we have a new contemporary player, the corporation, which is driven by a substantially different motivation from entities of past ages. In addition, a broader range of disciplines is now concerned with data publication, and perhaps those disciplines face stumbling blocks not at issue for the social scientists working with demographic and public health data. Given the known heterogeneity of scholarly communication practices across different fields, there seems to be no reason to think that data publishing needs, expectations and concerns would not also vary. And of course, the most obvious difference between then and now is with tools and technology. Have those advancements altered fundamental data publishing practices and if so, how?

These are interesting, but complex questions to pursue. Fortunately, what the above examples of our data publishing antecedents have hopefully revealed is that there are meaningful touchstones to use as reference points as we attempt to address these points. Data publishing has a rich, resonant past stretching back hundreds of years, providing us with an opportunity to reach into that past to better understand the trajectory that has brought us to this moment, thereby helping us more effectively grapple with the questions that seem to confound us today.