Understanding the Vision Behind Make Data Count and the Open Global Data Citation Corpus

John Chodacki, August 20, 2024

Posted in: data metrics, Make Data Count

As the scientific community increasingly embraces open data, the question of how these datasets are being accessed and utilized becomes ever more pressing. Researchers, funders, and policymakers alike are keen to understand the impact and reach of the data they produce, support, and use. This is where the vision of Make Data Count (MDC) and the Data Citation Corpus comes into play.

What is Make Data Count?

Make Data Count is an international initiative aimed at transforming how we measure the impact of open research data. Traditionally, the scholarly community has focused on citations to articles as a metric of impact. However, as research becomes more data-intensive, it’s clear that we need new metrics to capture the influence and reuse of datasets. MDC is committed to developing evidence-based data metrics that go beyond traditional measures, allowing for a more comprehensive understanding of data usage.

MDC’s efforts focus on creating the infrastructure and standards needed to track, collect, and report data usage and citation metrics. This includes not only citations to datasets within scholarly articles but also how data is used across various fields and sectors. The ultimate goal is to provide a holistic view of how open data contributes to scientific progress, policy-making, and beyond.

For more details on the roadmap and future developments of Make Data Count, you can explore the MDC Roadmap

CDL’s Role in Make Data Count

The University of California Curation Center (UC3) at California Digital Library (CDL) has been a key player in the Make Data Count initiative since it’s inception. CDL’s expertise in managing collaborative projects and its commitment to open data practices have been instrumental in the development and implementation of MDC’s goals. Over the years, CDL team members have provided strategic oversight, technical infrastructure. Currently, CDL team members are members on MDC’s advisory committee and works with other key partners such as DataCite and the MDC Director, Iratxe Puebla, on MDC project execution. CDL continues to play a vital role in fostering collaborations with other institutions and organizations to expand the reach and impact of MDC.

A Centralized Resource for Data Citations

The Data Citation Corpus, developed in collaboration with the Chan Zuckerberg Initiative (CZI) and the Wellcome Trust, is a cornerstone of this vision. The Corpus aims to be a vast, open repository of data citations from diverse sources and disciplines, providing a centralized resource for understanding how data is being cited and reused.

This initiative addresses a significant challenge in the current landscape: the fragmented and incomplete nature of data citation information. While data citations are increasingly being created, the existing workflows for collecting and propagating these citations are often leaky, leading to gaps in the persistent identifier (PID) metadata. Furthermore, in some fields, especially within the life sciences, data sharing via repositories that use accession numbers instead of DOIs is common, which further complicates the collection of metadata on data reuse.

Data Citation Corpus is aggregate data citations from a variety of sources, including:Persistent Identifier Authorities: DataCite and Crossref, which collect citations as part of their DOI registration metadata.
Third-Party Aggregators: Organizations using advanced techniques like full-text mining and machine learning to identify mentions of data in the full text of articles.

The corpus is being developed in iterative stages, with the initial prototype already incorporating data citations from DataCite event data and the CZI Knowledge Graph. This prototype allows for visualizations based on parameters like institution or data repository, providing valuable insights into how datasets are being cited and used across the research ecosystem.

As the project progresses, the goal is to expand the Data Citation Corpus to include additional sources and features, ultimately creating a resource that different stakeholders—researchers, funders, institutions, and policymakers—can use to integrate data usage information into their work.

Expanding the Corpus and Engaging the Community

To further the goals of expanding and refining the Data Citation Corpus, MDC is hosting a hackathon on September 4, 2024, focused on building curation workflows for the corpus. The hackathon will bring together data scientists, developers, and engineers to work on two key projects: developing user interfaces for the corpus and creating workflows for community-driven curation of data citations.

The hackathon will take place in two locations, with sessions at the Wellcome Trust in London and the California Digital Library in Oakland, California. Participants will collaborate on innovative solutions that will be presented the following day at the MDC Summit.

Stay tuned for a follow-up post where we will share the outcomes of the hackathon and the exciting developments that emerge from this collaborative effort.