(index page)
Why Award DOIs Matter: Strengthening Discovery Across UC’s Funding Programs
The Research Grants Program Office (RGPO) at the University of California Office of the President manages one of the UC system’s most impactful research portfolios, comprising over $100 million in yearly awards across programs such as the California Breast Cancer Research Program and the California HIV/AIDS Program. These diverse and impactful funding activities are complemented by rigorous internal data practices for tracking their impact, including providing rich and detailed descriptions of these activities in their public-facing grants database.
As persistent identifier (PID) enthusiasts, the availability of this high-quality data source immediately presented itself to the UC3 team as a unique opportunity. By leveraging the RGPO’s comprehensive metadata to generate award DOIs in DataCite, we could bridge the gap between their accounting and the larger research ecosystem, broadcasting the full scope of RGPO’s impact to this even larger audience.
Why DOIs?
Registering DOIs for awards provides a high-level view of all of the RGPOs’ work across its funding programs, describing them in a unified fashion using the DataCite schema, which provides a persistent, machine-readable reference and improves the visibility of these funding activities. By assigning DOIs, RGPO awards become connected to the broader persistent identifier ecosystem, meaning that other systems can easily discover, link to, and reuse information about these awards and their associated research outputs. Ultimately, this helps close gaps between internal and external systems, creating a more comprehensive picture of the University of California’s impact and RGPO’s role in that success.
How we did it
We worked closely with the DataCite team to analyze existing practices for representing awards in their schema, ensuring that everything was modeled correctly. This included identifying and resolving inconsistencies in representations.
Once those issues were resolved and the model was set, the registration process itself was straightforward:
- We mapped RGPO award data to the DataCite schema.
- Added ROR IDs for the funder (University of California Office of the President) and other relevant entities.
- Linked research outputs to the awards by including DOIs for their related works.
- Generated the XML for the award records and registered via the DataCite API.
- Finally, we provided the RGPO with a report of these registrations so that the DOIs could be integrated back into the RGPO’s grants database.
What’s Next: Automation and Research Graph Connections
We’re now focused on two primary next steps:
1. We’re working to automate more of the DOI registration and update processes to ensure new or changed awards are registered more frequently than the current manual updates.
2. We’re collaborating with OpenAlex on their new grant-funded project to incorporate better and more complete funding metadata into OpenAlex’s scholarly graph. As one of the first registrants of award DOIs with DataCite, OpenAlex is using CDL’s detailed account of the RGPO’s funding activities to model the ingestion and mapping of DataCite award DOIs more broadly.
Our efforts open RGPO grant projects to further enrichment of metadata and connections. This work includes matching grant-funded research outputs with their corresponding award DOIs, both by matching unstructured publication references in the funder metadata and by mining full-text publications to identify links that were not explicitly asserted in their DOI metadata. The goal is that once these connections are identified, they can also be incorporated back into the award DOIs and DataCite, thereby making their description more comprehensive and complete.
Our hope is that this work demonstrates the value of PIDs for awards and encourages other funders to adopt a similar approach. Registering award DOIs doesn’t just improve local data quality; it strengthens global research infrastructure and helps make the impact of publicly funded research more visible and more connected.
UC3 New Year Series: Persistent Identifiers at CDL in 2025
CDL’s persistent identifier portfolio, which includes the Research Organization Registry (ROR), EZID, Name-to-Thing (N2T), and the new Collaborative Metadata Enrichment Taskforce (COMET) initiative, had a busy and productive year in 2024, seeing record adoption, interest, and making significant technical improvements across its services. These complementary work streams help build a more rational, networked, and efficient research ecosystem – one where persistent identifiers allow for seamless connections between researchers, institutions, and scholarly outputs, reducing redundant efforts, unnecessary costs, and making everyone’s work more visible and impactful. As we move into 2025, I’m excited to bring you a look ahead into what we have planned for this year.
ROR
ROR is a global, community-led registry of open persistent identifiers for research and funding organizations, operated as a collaborative initiative by the California Digital Library, Crossref, and DataCite. As a trusted, free, and openly available service, ROR has become the standard for organizational identification in the scholarly communications ecosystem. The story of ROR in 2025 will be seizing the opportunities provided by this widespread adoption with better performance, improved services, and higher-quality data.
This work will begin with a Q1 launch of a new and improved version of our affiliation matching service, which has been battle-tested in OpenAlex and used to make millions of new connections between authors, works, and institutions. From here, we will further improve our API’s performance by implementing response caching for repeat requests, speeding up response times and reducing overall resource usage. Once this is complete, we will round things out by implementing a client identification system, allowing ROR to better manage its traffic, while also keeping our API services available at the same generous level of public access.
Concurrent with this technical work, ROR will pursue a number of improvements to the quality and completeness of its data, building on work done in 2024, as well as embracing new and emerging opportunities. This will include many regional improvement efforts, with work already underway in Portugal and Japan, better representation of publishers, society organizations and funders, as well as the addition of new external identifiers and improved domain field coverage in support of all of these efforts. ROR will also continue to refine its curation processes to meet the growing needs of its community. In 2024, ROR processed over 8,000 curation requests—a 44% increase from 2023—with trends indicating that we should expect to receive 1,000 requests per month by year’s end. Our goal is to continue publishing the same, high quality data on our monthly release schedule, even in the face of this increased demand!
EZID and N2T
EZID and N2T are complementary persistent identifier services that enable reliable, long-term access to research outputs. EZID provides identifier creation and management services focused on ARKs and DOIs, while N2T serves as a global resolver that ensures identifiers remain reliable and actionable over time.
In 2024, EZID was sharply focused on improving its reliability and performance. This included moving to OpenSearch to power our search functionality and rewriting many of our background jobs and database queries to increase their speed and efficiency. This work has resulted in a more stable and high-performing service, capable of handling the large increase in traffic that resulted from EZID assuming resolution functionality for its own ARK identifiers. Building from this foundation, in 2025, we will continue to optimize our underlying systems, while also adding support for DataCite schema v4.6 and improving EZID’s user interface. Our UI updates will be focused on improving the application’s accessibility, such that all users can effectively manage their persistent identifiers. These coordinated improvements will guarantee that EZID remains a dependable and inclusive platform for persistent identifier management.
Alongside EZID’s improvements, we completed another major milestone in 2024: rebuilding N2T as a modern Python service. With the new flexibility and performance this provides, our 2025 plans include rolling out additional service enhancements for N2T that will better support ARK curation workflows, adding public-facing resolution statistics, and fully deprecating the legacy instance of this application. This work will continue to strengthen N2T’s role as essential infrastructure for all the great identifier usage that occurs outside and in parallel to the DOI ecosystem.
COMET
The COMET initiative, launched in November of 2024, seeks to address a critical problem in DOI metadata management. Currently, only record owners can update DOI metadata, even when others have improvements to contribute. This leads organizations to maintain their own enhanced versions of the same records in separate systems, resulting in duplicated effort and inconsistent representations of research outputs, both individually and in aggregate. COMET’s solution is to create an open framework that allows the community to contribute validated metadata improvements directly to DOI records, unlocking tremendous new value and efficiencies at the sources of this metadata.
To date, COMET has brought together experts from publishers, libraries, funding organizations, and infrastructure services in a series of listening sessions focused on the vision, product definition, and governance model for a service that would realize its goals. In March 2025, these efforts will culminate in a community call-to-action, soliciting partnerships, funding, and other resources to help build this service. Subscribe to COMET’s email list to receive up-to-date news or follow its LinkedIn page for updates.
As I hope this all conveys, it has never been a better, more energizing time to both help build and participate in the persistent identifier ecosystem. Here’s to 2025 and all the exciting work ahead for the UC3 and the scholarly communications community!
High-Quality Metadata: A Collective Responsibility and Opportunity
Cross-posted from the Upstream Blog: https://upstream.force11.org/high-quality-metadata/
Our community and tools rely on high-quality DOI metadata for building connections and obtaining efficiencies. However, the current model – where improvements to this metadata are limited to its creators or done within service-level silos – perpetuates a system of large-scale gaps, inefficiency, and disconnection. It doesn’t have to be this way. By collaboratively building open, robust, and scalable systems for enriching DOI metadata, we can leverage the work of our community to break down these barriers and improve the state and interconnectedness of research information.
On August 3, 2024, at the FORCE11 conference in Los Angeles, the University of California Curation Center (UC3) hosted the first in what will be a series of discussions about community enrichment of DOI metadata: why we need it, how to do it, and who would like to be involved. As part of the California Digital Library (CDL), UC3 is an established leader in collaborative, open infrastructure and persistent identifier (PID) projects. This effort builds on our existing work to enhance scholarly communication and research data management using the collective expertise of our community. To broaden the scope of this engagement and include those who could not attend, we’d like to share the initial thoughts that guided this discussion, organized as a series of observations, principles, and goals.
Collaborative Infrastructure as a Shared Source of Truth
Building the corpus of DOI metadata over many years has taught our community an important lesson: when we work together to define how infrastructure should exist, how we want to build and improve upon it, we arrive at better outcomes than when we do this work alone. Collective stewardship of our shared sources of truth is what allows us to make the right decisions for as many people as we can.
It is thus unsurprising that Crossref and DataCite, two of the primary organizations responsible for this work, have grown in scope and impact alongside the systems they have brought into existence. In pursuing the immense value and network effects that result from solving the same problems in the same places, they have demonstrated that it is possible to align a diverse set of actors around the goals of open infrastructure and open research information. In their embrace of the Principles of Open Scholarly Infrastructure (POSI), in the example and sponsorship they have provided to new services, through their constant advocacy, they have steered our efforts toward greater openness and continued improvement.
The fruits of this labor extend beyond their own services to everything that is derived therefrom. From bibliometric analysis to research evaluation, from discovery services to funder impact tracking, this rich web of scholarly metadata is the basis for so much of our work. It bridges open and closed systems, fosters interoperability, and helps to guarantee the integrity of the scholarly record.
Enriching Metadata in Service-level Silos Creates Inefficiencies and Disconnects
Despite these successes, a persistent problem has arisen from the maintenance model of DOI metadata. In its current conception, making corrections or improvements to records are the almost exclusive remit of their depositors. As a result, much of the work to improve records is done in services that consume DOI metadata, as opposed to at their sources.
To a great extent, these efforts are admirable. They demonstrate the ingenuity and resilience of our community to route around any obstacles we encounter. This service-level enrichment, however, also leads to the duplication of work and a more fragmentary, isolated view of research information that our shared efforts seek to avoid. When the collaboration and observability derived from the “one place, one thing” model of DOI metadata is removed, changes to records occur multiple times in many different places. Each change introduces the potential for discrepancies that have to then be reconciled. Since DOI metadata relies heavily on the accurate linking between authors, institutions, and other works, these individual discrepancies can quickly compound into aggregate views that are wildly divergent from their sources and each other.
This is, again, contrary to our building and maintenance of DOI infrastructure as our source of truth. We derive value from DOIs by having a persistent reference to an object, a description of that object, and by being able to perform some basic validation that the object exists. Reliance on service-level enrichment leads to a more unstable arrangement, where to either provide or discern a more complete description, we have to stitch together different views of an object in multiple services that have no corresponding guarantee of stability, provenance, or persistence. As a result, organizations can invest a great deal of time into service-level workflows, only to lose access to them, for the services to change or degrade, and for all of their efforts to become non-transferable or lost.
A More Comprehensive Form of Research Information Can Be Achieved Through Diverse and Consensus-Based Descriptions
While it is important to acknowledge the complications that result from service-level enrichment, the history of this work has also shown that it is necessary to synthesize many forms of improvement to achieve complete and accurate descriptions. The investment made by users in these services results from them being permitted to make changes that cannot be made at the source and because it is simply not true that a depositor of DOI metadata always has the time, resources, or ability to produce a better form of it. Perhaps more importantly, the depositor can also not anticipate in advance what every user will require from their records. Instead, it is the diverse feedback from all users that captures their corresponding needs from this metadata, correcting for gaps, errors and biases that may be present when we rely on the depositor as the sole source of truth.
At the same time, to guarantee that records remain usable for all, we need to build consensus mechanisms that define how and when changes should be applied, as well as when they are correct and appropriate. Here, we could have an extensive discussion about these specificities, but the point should never be to anticipate every possible scenario in advance. Instead, we should determine what structures are needed to navigate these issues as a community, from their most basic to their most complex. There are countless examples we can draw from: ROR’s community curation model, the coopetition framework of the Generalist Repository Ecosystem Initiative (GREI), the rigorous analysis done by the Centre for Science and Technology Studies (CWTS), all of which demonstrate that collaborative, community-driven approaches are both effective and sustainable in guiding improvements to our sources of truth.
Empowering the Community to Validate and Improve Metadata
While unsurprising relative to the many systems we know to be doing this work, that records may be more frequently improved outside their sources than they are within them suggests the need for a change in approach. Past, successful efforts to improve DOI metadata have focused on lobbying depositors to contribute better and more complete records. Although still needed for certain aspects of records that can only be improved by their authors, the overall work should be refocused away from this advocacy model, relative to what we know can be accomplished from service-level improvements.
Specifically, we must allow for the same community enrichment of DOI metadata to occur at the source, meaning Crossref and DataCite, such that these records are maintained at a comparable level of quality and completeness. By doing so, we better reflect the existing reality where users are direct contributors to this metadata, further refining it from the baseline provided by depositors to be more comprehensive, correct, and aligned with their needs. Existing work being done at the service-level can then move upstream, and achieve the same visibility and collective stewardship that has been integral to the success of DOI infrastructure.
New Systems for Enrichment Should Be Open, Reproducible, Scalable, and Technically Sound
This visibility and stewardship requires open and reproducible enrichment processes. At the most basic level, openness and reproducibility are needed to validate both the quality and performance of any enrichment process. Without them, we have no way of accurately determining whether a given set of improvements meet the needs of the community or are useful to apply at scale. We likewise establish confidence in enrichment by allowing users to validate things like the representativeness of our benchmarks, the soundness of our designs, and the overall improvements that result from any work. This openness also allows users to immediately leverage and iterate upon any enrichment process, such that they can derive value from it separate from or in the absence of its implementation.
To succeed in this way, the work of enrichment must also be able to transition from any one system to another. Who has the resources, interest, and expertise to engage in these activities can and will shift over time. Openness and reproducibility ensure that we can adapt to these changes, transfer responsibilities, welcome new contributors, and accommodate attrition.
Enrichment Systems Require Shared Standards and Provenance Information
We know from past efforts that we need to bring together a diverse group of users and a disparate set of systems to improve DOI metadata. We likewise can gather from the success of DOI infrastructure and the enrichment found in service-level descriptions that this is an achievable outcome. However, to realize this aim also requires that community enrichment occur in consistent and actionable ways.
At a practical level, what this means is shared formats for describing enrichment that can be generated by any system and include provenance information linking the enrichment back to its source. Whether a user is submitting an individual correction or some matching process is updating to millions of records, we should indicate the source for these actions such that they can be evaluated, approved, or reverted, as needed. Enrichment must likewise be described in machine-actionable ways, meaning that if we establish consensus or thresholds for forms of improvements, these can be acted on automatically and occur at scale.
This approach has firm precedents in our existing systems. Both Crossref and DataCite’s schemas have been refined through multiple iterations of planning and community feedback and are used in a constant stream of reference, creation and updates to existing works. We can thus use these as models to rationalize enrichment within their well-defined frameworks.
Moving Forward Together
Community enrichment of DOI metadata poses significant challenges, but not insurmountable ones. Our initial meeting in Los Angeles reaffirmed the community’s interest in tackling this together, just as we have done with other successful infrastructure initiatives. Through collaboration and use of our shared expertise, we can build a better, more connected system of research information. UC3 will be continuing these critical discussions, and we encourage you to stay engaged with us. If you have any additional questions or would like to contribute further to these conversations, please feel free to reach out to me at adam.buttrick@ucop.edu. We hope you will join us in this work!
Cross-posted from the Upstream Blog: https://upstream.force11.org/high-quality-metadata/