Today, we are pleased to announce the publication Making Data Count in Scientific Data. John Kratz and Carly Strasser led the research effort to understand the needs and values of both the researchers who create and use data and of the data managers who preserve and publish it. The Making Data Count project is a collaboration between the CDL, PLOS, and DataONE to define and implement a practical suite of metrics for evaluating the impact of datasets, which is a necessary prerequisite to widespread recognition of datasets as first class scholarly objects.
We started the project with research to understand what metrics would be meaningful to stakeholders and what metrics we can practically collect. We conducted a literature review, focus groups, and– the subject of today’s paper– a pair of online surveys for researchers and data managers.
In November and December of 2014, 247 researchers and 73 data repository managers answered our questions about data sharing, use, and metrics. Survey and anonymized data are available in the Dash repository. These responses told us, among other things, which existing Article Level Metrics (ALMs) might be profitably applied to data:
- Social media: We should not worry excessively about capturing social media (Twitter, Facebook, etc.) activity around data yet, because there is not much to capture. Only 9% of researchers said they would “definitely” use social media to look for a dataset.
- Page views: Page views are widely collected by repositories but neither researchers nor data managers consider them meaningful. (It stands to reason that, unlike a paper, you can’t have engaged very deeply with a dataset if all you’ve done is read about it.)
- Downloads: Download counts, on the other hand, are both highly valuable and practical to collect. Downloads were a resounding second-choice metric for researchers and 85% of repositories already track them.
- Citations: Citations are the coin of the academic realm. They were by far the most interesting metric to both researchers and data managers. Unfortunately, citations are much more difficult than download counts to work with, and relatively few repositories track them. Beyond technical complexity, the biggest challenge is cultural: data citation practices are inconsistent at best, and formal data citation is rare. Despite the difficulty, the value of citations is too high to ignore, even in the short term.
We have already begun to collect data on the sample project corpus– the entire DataONE collection of 100k+ datasets. Using this pilot corpus, we see preliminary indications of researcher engagement with data across a number of online channels not previously thought to be in use by scholars. The results of this pilot will complement the survey described in today’s paper with real measurement of data-related activities “in the wild.”
For more conclusions and in-depth discussion of the initial research, see the paper, which is open access and available here: http://dx.doi.org/10.1038/sdata.2015.39. Stay tuned for analysis and results of the DataONE data-level metrics data on the Making Data Count project page: http://lagotto.io/MDC/.