Skip to main content

(index page)

Understanding the Vision Behind Make Data Count and the Open Global Data Citation Corpus

As the scientific community increasingly embraces open data, the question of how these datasets are being accessed and utilized becomes ever more pressing. Researchers, funders, and policymakers alike are keen to understand the impact and reach of the data they produce, support, and use. This is where the vision of Make Data Count (MDC) and the Data Citation Corpus comes into play.

What is Make Data Count?

Make Data Count is an international initiative aimed at transforming how we measure the impact of open research data. Traditionally, the scholarly community has focused on citations to articles as a metric of impact. However, as research becomes more data-intensive, it’s clear that we need new metrics to capture the influence and reuse of datasets. MDC is committed to developing evidence-based data metrics that go beyond traditional measures, allowing for a more comprehensive understanding of data usage.

MDC’s efforts focus on creating the infrastructure and standards needed to track, collect, and report data usage and citation metrics. This includes not only citations to datasets within scholarly articles but also how data is used across various fields and sectors. The ultimate goal is to provide a holistic view of how open data contributes to scientific progress, policy-making, and beyond.

For more details on the roadmap and future developments of Make Data Count, you can explore the MDC Roadmap

CDL’s Role in Make Data Count

The University of California Curation Center (UC3) at California Digital Library (CDL) has been a key player in the Make Data Count initiative since it’s inception. CDL’s expertise in managing collaborative projects and its commitment to open data practices have been instrumental in the development and implementation of MDC’s goals. Over the years, CDL team members have provided strategic oversight, technical infrastructure. Currently, CDL team members are members on MDC’s advisory committee and works with other key partners such as DataCite and the MDC Director, Iratxe Puebla, on MDC project execution. CDL continues to play a vital role in fostering collaborations with other institutions and organizations to expand the reach and impact of MDC.

A Centralized Resource for Data Citations

The Data Citation Corpus, developed in collaboration with the Chan Zuckerberg Initiative (CZI) and the Wellcome Trust, is a cornerstone of this vision. The Corpus aims to be a vast, open repository of data citations from diverse sources and disciplines, providing a centralized resource for understanding how data is being cited and reused.

This initiative addresses a significant challenge in the current landscape: the fragmented and incomplete nature of data citation information. While data citations are increasingly being created, the existing workflows for collecting and propagating these citations are often leaky, leading to gaps in the persistent identifier (PID) metadata. Furthermore, in some fields, especially within the life sciences, data sharing via repositories that use accession numbers instead of DOIs is common, which further complicates the collection of metadata on data reuse.

The corpus is being developed in iterative stages, with the initial prototype already incorporating data citations from DataCite event data and the CZI Knowledge Graph. This prototype allows for visualizations based on parameters like institution or data repository, providing valuable insights into how datasets are being cited and used across the research ecosystem.

As the project progresses, the goal is to expand the Data Citation Corpus to include additional sources and features, ultimately creating a resource that different stakeholders—researchers, funders, institutions, and policymakers—can use to integrate data usage information into their work. 

Expanding the Corpus and Engaging the Community

To further the goals of expanding and refining the Data Citation Corpus, MDC is hosting a hackathon on September 4, 2024, focused on building curation workflows for the corpus. The hackathon will bring together data scientists, developers, and engineers to work on two key projects: developing user interfaces for the corpus and creating workflows for community-driven curation of data citations.

The hackathon will take place in two locations, with sessions at the Wellcome Trust in London and the California Digital Library in Oakland, California. Participants will collaborate on innovative solutions that will be presented the following day at the MDC Summit.

Stay tuned for a follow-up post where we will share the outcomes of the hackathon and the exciting developments that emerge from this collaborative effort.

Embracing a New Era of Data Curation: A Vision for Openness and Innovation at UC3

At the University of California Curation Center (UC3), our commitment to advancing data curation and publishing is deeply rooted in our belief in open access and the open data movement. For years, we’ve worked to support researchers and ensure that UC scholarship resonates beyond academia. Our recent efforts, including our successful partnership with Dryad, are part of a broader strategy to amplify UC research and foster a more connected and open scientific landscape. As with all areas of our work, the world of research data is evolving rapidly, and the UC3 data curation team is embracing this change. Following the successful conclusion of the Dryad co-development work, we have now been exploring new projects that continue to support the research data community. This blog post outlines our direction and describes a few ways our team is leveraging past successes and continuing to evolve.

Overcoming Challenges, Exploring Opportunities

Publishing research data is complex, especially when ensuring it is Findable, Accessible, Interoperable, and Reusable (FAIR). However, advancements in artificial intelligence (AI) may offer exciting opportunities. Through our work with partners at the Generalist Repository Ecosystem Initiative (GREI), we have started to investigate AI tools that can help streamline curation to provide data creators with real-time feedback, targeted guidance, and even dynamic visualizations. This approach simplifies and enhances the publication process, making it more accessible and valuable to researchers.

Revolutionizing the Curation Process

High-quality, accessible research data is essential for progress in any field. Common dataset issues such as missing or inconsistent metadata, formatting errors, and lack of standardization can hinder progress. We are evaluating approaches to transform manual data curation processes to address these challenges directly. By doing so, we aim to unlock the full potential of datasets, enabling greater collaboration and reproducibility and accelerating progress across different fields.

Our two-part strategy focuses on:

  1. Pre-Deposit Support: Researchers receive interactive assistance in preparing their data for publication, ensuring it is ready for widespread dissemination and interoperable use.
  2. Post-Deposit Enhancement: This process involves reviewing and enhancing published datasets to improve their quality, usability, and potential for further research and applications.

Collaboration for a Brighter Future

Our team has many past collaborations within the repository ecosystem.  Through these partnerships, we have been able to learn and strengthen the data publishing space. Our team’s continued participation in GREI expands our community of data repository infrastructure even further. This initiative brings together seven generalist repositories, including Zenodo, figshare, Dryad, Vivli, Mendeley Data, Center for Open Science, and Dataverse, allowing UC3 to leverage a wide range of tools and expertise for handling diverse datasets and complex curation tasks. By collaborating with the multiple repositories in GREI, we have opportunities to learn and work with varying approaches to managing and sharing data.

Data Packages: A Key to Unlocking Potential

One promising avenue our team has been evaluating is the use and utility of Data Packages, a concept pioneered by the Frictionless Data project at the Open Knowledge Foundation. Data Packages elevate the value of datasets by ensuring data and essential metadata are prepared in predictable structures, making them self-explanatory, easily shareable, and reusable. This enhances discoverability and usability for researchers. Data repositories can implement data packages by providing tools and guidance for consistent metadata creation. Researchers benefit from streamlined data submission processes automatically generating well-documented and accessible data packages. Implementing Data Packages is a key part of our broader strategy. Our initial API experiments across different repositories have shown promising results. While still in the early stages, we see significant potential to scale this work, transforming how data is published and curated.

Building a Sustainable Future Together

Our new direction is not just about technology; it’s about building a culture of openness and collaboration. By partnering with other organizations facing similar challenges, we can “future-proof” our efforts and ensure our solutions are sustainable and adaptable to the ever-changing landscape.

We are exploring innovative tools and workflows, including automated data quality assessment tools, AI-powered metadata enrichment tools, dynamic data visualization platforms, API integrations for seamless submissions, and data packaging tools. All these efforts aim to improve data curation and publishing. We are actively seeking comprehensive solutions to ensure scalability and optimize resource allocation for final implementation.

We are excited about creating a more connected and open scientific landscape where research data can achieve greater reach and impact. If you want to learn more, please contact the UC3 Data Publishing Product Manager, Steve Diggs, at steve.diggs@ucop.edu.

High-Quality Metadata: A Collective Responsibility and Opportunity

Cross-posted from the Upstream Blog: https://upstream.force11.org/high-quality-metadata/

Our community and tools rely on high-quality DOI metadata for building connections and obtaining efficiencies. However, the current model – where improvements to this metadata are limited to its creators or done within service-level silos – perpetuates a system of large-scale gaps, inefficiency, and disconnection. It doesn’t have to be this way. By collaboratively building open, robust, and scalable systems for enriching DOI metadata, we can leverage the work of our community to break down these barriers and improve the state and interconnectedness of research information.

On August 3, 2024, at the FORCE11 conference in Los Angeles, the University of California Curation Center (UC3) hosted the first in what will be a series of discussions about community enrichment of DOI metadata: why we need it, how to do it, and who would like to be involved. As part of the California Digital Library (CDL), UC3 is an established leader in collaborative, open infrastructure and persistent identifier (PID) projects. This effort builds on our existing work to enhance scholarly communication and research data management using the collective expertise of our community. To broaden the scope of this engagement and include those who could not attend, we’d like to share the initial thoughts that guided this discussion, organized as a series of observations, principles, and goals.

Collaborative Infrastructure as a Shared Source of Truth

Building the corpus of DOI metadata over many years has taught our community an important lesson: when we work together to define how infrastructure should exist, how we want to build and improve upon it, we arrive at better outcomes than when we do this work alone. Collective stewardship of our shared sources of truth is what allows us to make the right decisions for as many people as we can.

It is thus unsurprising that Crossref and DataCite, two of the primary organizations responsible for this work, have grown in scope and impact alongside the systems they have brought into existence. In pursuing the immense value and network effects that result from solving the same problems in the same places, they have demonstrated that it is possible to align a diverse set of actors around the goals of open infrastructure and open research information. In their embrace of the Principles of Open Scholarly Infrastructure (POSI), in the example and sponsorship they have provided to new services, through their constant advocacy, they have steered our efforts toward greater openness and continued improvement.

The fruits of this labor extend beyond their own services to everything that is derived therefrom. From bibliometric analysis to research evaluation, from discovery services to funder impact tracking, this rich web of scholarly metadata is the basis for so much of our work. It bridges open and closed systems, fosters interoperability, and helps to guarantee the integrity of the scholarly record.

Enriching Metadata in Service-level Silos Creates Inefficiencies and Disconnects

Despite these successes, a persistent problem has arisen from the maintenance model of DOI metadata. In its current conception, making corrections or improvements to records are the almost exclusive remit of their depositors. As a result, much of the work to improve records is done in services that consume DOI metadata, as opposed to at their sources. 

To a great extent, these efforts are admirable. They demonstrate the ingenuity and resilience of our community to route around any obstacles we encounter. This service-level enrichment, however, also leads to the duplication of work and a more fragmentary, isolated view of research information that our shared efforts seek to avoid. When the collaboration and observability derived from the “one place, one thing” model of DOI metadata is removed, changes to records occur multiple times in many different places. Each change introduces the potential for discrepancies that have to then be reconciled. Since DOI metadata relies heavily on the accurate linking between authors, institutions, and other works, these individual discrepancies can quickly compound into aggregate views that are wildly divergent from their sources and each other. 

This is, again, contrary to our building and maintenance of DOI infrastructure as our source of truth. We derive value from DOIs by having a persistent reference to an object, a description of that object, and by being able to perform some basic validation that the object exists. Reliance on service-level enrichment leads to a more unstable arrangement, where to either provide or discern a more complete description, we have to stitch together different views of an object in multiple services that have no corresponding guarantee of stability, provenance, or persistence. As a result, organizations can invest a great deal of time into service-level workflows, only to lose access to them, for the services to change or degrade, and for all of their efforts to become non-transferable or lost.

A More Comprehensive Form of Research Information Can Be Achieved Through Diverse and Consensus-Based Descriptions

While it is important to acknowledge the complications that result from service-level enrichment, the history of this work has also shown that it is necessary to synthesize many forms of improvement to achieve complete and accurate descriptions. The investment made by users in these services results from them being permitted to make changes that cannot be made at the source and because it is simply not true that a depositor of DOI metadata always has the time, resources, or ability to produce a better form of it. Perhaps more importantly, the depositor can also not anticipate in advance what every user will require from their records. Instead, it is the diverse feedback from all users that captures their corresponding needs from this metadata, correcting for gaps, errors and biases that may be present when we rely on the depositor as the sole source of truth.

At the same time, to guarantee that records remain usable for all, we need to build consensus mechanisms that define how and when changes should be applied, as well as when they are correct and appropriate. Here, we could have an extensive discussion about these specificities, but the point should never be to anticipate every possible scenario in advance. Instead, we should determine what structures are needed to navigate these issues as a community, from their most basic to their most complex. There are countless examples we can draw from: ROR’s community curation model, the coopetition framework of the Generalist Repository Ecosystem Initiative (GREI), the rigorous analysis done by the Centre for Science and Technology Studies (CWTS), all of which demonstrate that collaborative, community-driven approaches are both effective and sustainable in guiding improvements to our sources of truth.

Empowering the Community to Validate and Improve Metadata

While unsurprising relative to the many systems we know to be doing this work, that records may be more frequently improved outside their sources than they are within them suggests the need for a change in approach. Past, successful efforts to improve DOI metadata have focused on lobbying depositors to contribute better and more complete records. Although still needed for certain aspects of records that can only be improved by their authors, the overall work should be refocused away from this advocacy model, relative to what we know can be accomplished from service-level improvements. 

Specifically, we must allow for the same community enrichment of DOI metadata to occur at the source, meaning Crossref and DataCite, such that these records are maintained at a comparable level of quality and completeness. By doing so, we better reflect the existing reality where users are direct contributors to this metadata, further refining it from the baseline provided by depositors to be more comprehensive, correct, and aligned with their needs. Existing work being done at the service-level can then move upstream, and achieve the same visibility and collective stewardship that has been integral to the success of DOI infrastructure. 

New Systems for Enrichment Should Be Open, Reproducible, Scalable, and Technically Sound

This visibility and stewardship requires open and reproducible enrichment processes. At the most basic level, openness and reproducibility are needed to validate both the quality and performance of any enrichment process. Without them, we have no way of accurately determining whether a given set of improvements meet the needs of the community or are useful to apply at scale. We likewise establish confidence in enrichment by allowing users to validate things like the representativeness of our benchmarks, the soundness of our designs, and the overall improvements that result from any work. This openness also allows users to immediately leverage and iterate upon any enrichment process, such that they can derive value from it separate from or in the absence of its implementation.

To succeed in this way, the work of enrichment must also be able to transition from any one system to another. Who has the resources, interest, and expertise to engage in these activities can and will shift over time. Openness and reproducibility ensure that we can adapt to these changes, transfer responsibilities, welcome new contributors, and accommodate attrition.

Enrichment Systems Require Shared Standards and Provenance Information

We know from past efforts that we need to bring together a diverse group of users and a disparate set of systems to improve DOI metadata. We likewise can gather from the success of DOI infrastructure and the enrichment found in service-level descriptions that this is an achievable outcome. However, to realize this aim also requires that community enrichment occur in consistent and actionable ways.

At a practical level, what this means is shared formats for describing enrichment that can be generated by any system and include provenance information linking the enrichment back to its source. Whether a user is submitting an individual correction or some matching process is updating to millions of records, we should indicate the source for these actions such that they can be evaluated, approved, or reverted, as needed. Enrichment must likewise be described in machine-actionable ways, meaning that if we establish consensus or thresholds for forms of improvements, these can be acted on automatically and occur at scale. 

This approach has firm precedents in our existing systems. Both Crossref and DataCite’s schemas have been refined through multiple iterations of planning and community feedback and are used in a constant stream of reference, creation and updates to existing works. We can thus use these as models to rationalize enrichment within their well-defined frameworks. 

Moving Forward Together

Community enrichment of DOI metadata poses significant challenges, but not insurmountable ones. Our initial meeting in Los Angeles reaffirmed the community’s interest in tackling this together, just as we have done with other successful infrastructure initiatives. Through collaboration and use of our shared expertise, we can build a better, more connected system of research information. UC3 will be continuing these critical discussions, and we encourage you to stay engaged with us. If you have any additional questions or would like to contribute further to these conversations, please feel free to reach out to me at adam.buttrick@ucop.edu. We hope you will join us in this work!

Cross-posted from the Upstream Blog: https://upstream.force11.org/high-quality-metadata/