Skip to main content

(index page)

Exploring How AI Can Help Research Data Management

At UC3, several of our latest initiatives involve integrating AI tools, with a particular focus on improving metadata and assisting researchers with creating best practice DMPs.

A clear philosophy guides UC3’s approach to the use of generative AI: addressing researchers’ and the broader research community’s needs, keeping humans as the authority, complementing human work for scale and efficiency, and prioritizing open-source solutions where possible. 

Improving ROR Metadata

One key application of AI we are exploring is enhancing the quality and scale of our metadata curation activities, including those for the Research Organization Registry (ROR). ROR, a widely adopted persistent identifier service for research organizations, operates on a model where anyone can submit a request to add or update its records. This community-focused approach to curation has allowed ROR to grow rapidly by gathering diverse and valuable feedback from a global userbase. However, as one might expect with crowd-sourced data, it also has inherent complexities that require special attention to maintain consistency and quality. 

AI helps by taking these diverse user inputs and automatically transforming them into clean, structured, authoritative outputs in the ROR dataset. For adding records to the registry, this automation seamlessly handles data standardization, formatting, and enrichment tasks that would otherwise require specialized logic and manual intervention to achieve. For updates to the registry, AI can transform natural language descriptions of desired changes into structured modifications, described using ROR’s data model. These interventions have dramatically accelerated ROR’s request processing ability, enabling the service to now efficiently handle its growing request volume and process over 1,000 user-submitted requests per month. 

Despite these advances, achieving 100% accuracy or completeness with these methods is neither possible nor desirable. Instead, we choose to pursue hybrid approaches that balance the efficiency and scalability of GenAI with the measured judgment and domain expertise that only human curators can provide. In doing so, we can embrace both innovation and authoritative oversight, allowing ROR to further grow in its position as a reliable, community-driven infrastructure, in service to the complex needs of the global research ecosystem. 

DMP Chef: Exploring AI-powered DMP Generation 

Another example of our AI exploration is “DMP Chef,” a large language model (LLM) based DMP generator. We are in the initial stages of this work, partnering with the California Medical Innovations Institute (CalMI2) to develop a new tool that allows researchers to provide simple descriptions of their work, from which the DMP Chef can generate a draft DMP. We are currently developing this tool to work with NIH DMPs and plan to follow up on this work by working with NSF and other templates.

The current process involves asking researchers for a short description of their study and the types of data they plan to collect, then using a detailed prompt to have the LLM draft an initial DMP using NIH’s template for review. To test the initial quality, we used the NIH exemplar DMP, extracting the study design and data types from Element 1, and then feeding that information into the tool. We compared the generated output with the actual DMP section by section. Our next step is to recruit data librarians to review these generated DMPs for quality and comprehensiveness.

We’re seeing some moderate initial success with off-the-shelf LLM models, including open-source models, and plan to continue working on refining the quality by exploring options such as asking additional questions to the researcher, generating sections to separate, and feeding the LLM additional policy documents. Our goal is to help create an initial draft of a high-quality plan that researchers can then refine to their needs, suggesting best practice repositories and standards based on their specific data.

Matching Related Works: Connecting Plans to Outputs

We’re also developing new tools to automatically connect DMPs to the research outputs they describe, such as datasets, articles, and software. These new connections improve the discoverability of research data and make it easier for researchers, funders, and administrators to see the complete picture of a project’s outputs. Our approach combines structured metadata from maDMPs with information from sources like DataCite, Crossref, OpenAlex, and the Make Data Count Citation Corpus. We utilize machine learning, incorporating embeddings generated by large language models and vector similarity search, to compare the text from the title and abstract of a DMP with those descriptive fields within the datasets, rather than relying solely on metadata for authors and funders. A human reviewer then confirms the matches to ensure accuracy and reduce the manual reporting burden on researchers. You can read more about this feature at the DMP Tool Blog.

UC3’s AI initiatives are focused on making research data easier to find, connect, and trust. By pairing AI-driven efficiencies with human expertise, we can accelerate workflows while maintaining the accuracy, transparency, and trust essential to research.

Why CDL Is Investing in COMET: A Community Centered Path to Richer Metadata

When the California Digital Library (CDL) signed the Barcelona Declaration in April 2025, it marked a deeper institutional commitment to building open and community-led research infrastructure. At the heart of this commitment is a recognition that metadata is not a passive byproduct of scholarship, but an active force that shapes how research is discovered, connected, cited, and reused. To build an ecosystem where metadata reflects the values of openness, equity, and trust, we must ensure that its stewardship is shared, inclusive, and sustainable.

This is why CDL’s University of California Curation Center (UC3) program is investing in COMET (Collaborative Metadata Enrichment Taskforce). COMET is both a vision and a framework for creating a healthier metadata ecosystem, where persistent identifiers are enriched and maintained through transparent, distributed workflows that engage the full research community. The principles below represent the building blocks of the COMET model and the foundation of CDL’s participation therein:

How COMET Emerged and CDL’s Participation

COMET emerged from a shared realization across the scholarly infrastructure community: if we want metadata that is trustworthy, complete, and actionable, we need to design systems that allow more people to contribute to it and more institutions to shape its governance. This vision came into sharper focus during a series of workshops at FORCE2024 held in Los Angeles and the Barcelona Declaration Community Meeting held in Paris, where participants from across disciplines and sectors gathered to discuss new models for collaborative metadata curation. These sessions surfaced a common theme: metadata enrichment can’t be sustained by individual repositories or publishers alone. What’s needed is a coordinated, community-powered model that invites researchers, libraries, funders, and infrastructure providers to play an active role in improving the quality of metadata tied to persistent identifiers.

Out of these conversations, COMET was born. By early 2025, COMET had evolved into a formal FORCE11 Project and culminated in an open “Community Call to Action” that invited broad participation in shaping workflows, tools, and governance models for metadata enrichment.

CDL was an early and enthusiastic supporter because the vision aligned with our mission and we see an opportunity to help bring it to life. Our involvement isn’t passive. CDL’s UC3 program brings more than two decades of experience in digital curation, persistent identifier infrastructure, and open scholarly systems. We contribute governance know-how, technical insight from our work on initiatives like EZID, Crossref, ROR, and DataCite, and convening power across academic and infrastructure communities. We also see COMET as a proving ground: a space to pilot scalable, community-led metadata workflows that can extend across institutions, repositories, and disciplines.

For CDL, joining COMET is a continuation of our long-standing commitment to open, shared infrastructure and collective progress. It’s an investment in a future where metadata is openly enriched, transparently verified, and valued by the very communities who depend on it.

What Community Participation Means

When libraries and institutions like CDL engage with efforts like COMET, the benefits extend far beyond improved metadata. Our participation brings a deep commitment to equity, transparency, and public stewardship with values that help shape infrastructure for the public good. By contributing expertise in curation, governance, and metadata standards, libraries ensure that research information is more complete, discoverable, and reusable across repositories, researcher profiles, and campus systems.

Shared governance is a central feature of COMET’s approach, and institutional involvement helps ensure that decisions reflect the needs of a global, diverse, distributed community. When we engage in this work, they align their local priorities with broader efforts to create trustworthy, persistent, and openly governed metadata. This alignment reduces redundancy, increases impact, and builds capacity for meaningful contributions across the ecosystem.

But the benefits of this work aren’t just at the institutional level. For researchers and end users, the results are tangible: better discovery, clearer provenance, and richer metadata that supports citation, reuse, and reproducibility. And for funders, repositories, and service providers, this community-driven model offers a scalable alternative to siloed or proprietary solutions that emphasize interoperability, transparency, and accountability.

That’s why we believe that COMET offers more than just a framework for metadata enrichment. It provides an opportunity for us to embody our mission-driven values and help build the connective infrastructure that research depends on. For CDL, supporting COMET is a way to double down on its long-standing commitment to open, community-led infrastructure. It’s about creating shared pathways to trust, equity, and impact where metadata isn’t hidden or locked down, but serves as the connective tissue for discovery and collaboration.

Webinar Series: Insights from the Machine Actionable Data Management Plans Pilot

Want to learn about how technological advancements in data management plans can benefit research at your university? Have you heard the term “machine actionable” a lot but aren’t sure what it is or why it’s important? Are you looking for strategies to reduce burden on researchers and administrators in working on data management plans?

Join our free webinar series to learn from several US institutions that explored and piloted machine-actionable approaches to data management plans (DMPs).

Funded by the Institute of Museum and Library Services (award LG-254861-OLS-23), and led jointly by the California Digital Library (CDL) and the Association of Research Libraries (ARL), the Machine Actionable Plans (MAP) Pilot initiative enabled institutions to test and pilot data management plans that are machine-actionable and facilitate communication with other university research and IT systems. Each institution developed its own projects in alignment with their institutional mission, and with their specific challenges and opportunities taken into consideration. The DMP Tool team also worked with pilot partners to test features and advance technical developments to improve usability, best practice adoption, compliance, and efficiency.

In this series of webinars, we invite librarians, administrators, data managers, IT & security staff to find out more about the motivations of these institutions to explore machine-actionable DMP integrations: what they did, how they did it, and what they learned. For those interested in more technical aspects of integrations, some webinars will also provide detail on the API of the DMP Tool, along with more detailed implementation instructions and advice.

Webinar 1: Streamlining Research Support: Lessons from maDMP Pilots  

This webinar is for those looking to improve the efficiency, collaboration, and coordination of research support within their institutions. Learn from several institutions about their explorations of maDMP integrations to facilitate automated notifications for coordination across campus, and about how they used the pilot more broadly to facilitate discovery and collaboration within their institutions. This webinar will provide an overview of each institution’s activity, rather than detailed instructions about integrations.

Presenters include:  Katherine E. Koziar, Briana Wham, Matt Carson, Andrew Johnson

Register

Webinar 2: Creative Approaches for Seamless and Efficient Resource Allocation 

Don’t miss this webinar if you’re interested in new ways to enable efficient resource allocation. Institutions will share their experiences in leveraging maDMPs to develop integrations for automation systems that enable such allocations. This webinar will provide an overview of each institution’s activity, rather than detailed technical instructions about integrations.

Presenters include:  Katherine E. Koziar, Andrew Johnson

Register

Webinar 3: Five Technological Advancements in DMPs to Benefit Your Organization 

If you’re interested in emerging technologies within the pilot project and the DMP Tool and how they can help your institution expedite research sharing, compliance, and operational efficiency, this webinar will provide a strong introduction. We’ll also hear from pilot partners about promising AI developments related to reviewing DMPs, and will hear more detail on technical advancements coming to the DMP Tool based on feedback from the pilot. 

Presenters include:  Jim Taylor, Becky Grady

Register

Webinar 4: How to Implement Machine-Actionable DMPs at your Institution

If you want to find out more about specific integrations and how to implement maDMPs, this webinar is for you. Hear from the DMP Tool team about the API, common challenges and how to overcome them, and actionable recommendations for campus buy-in.

Presenters include:  Becky Grady, Brian Riley

Register

Working Toward a Common Standard API for Machine-Actionable DMPs

DMP Tool and the Research Data Alliance

Our work at DMP Tool has been shaped from the ground up through collaborations at the Research Data Alliance (RDA). From the earliest conversations about machine-actionable Data Management Plans (maDMPs) to the creation of the DMP common standard and the DMP ID, the RDA has served as the convening space where we’ve found shared purpose, co-developed solutions, and built lasting partnerships with peers across the globe. That same spirit is captured in the Salzburg Manifesto on Active DMPs, which outlines a vision for DMPs as living, integrated components of the research lifecycle. That vision continues today, as we are helping launch a new initiative at RDA to update a common API standard for DMP service providers. This effort will help ensure our systems can connect more seamlessly and serve the broader research ecosystem more effectively. This post gives some context on why this new effort is needed, what we’ve done so far for it, and what we have coming next.

DMP Tool implementation of the RDA common standard

The DMP Tool team were early advocates of maDMPs and saw the potential value of capturing structured information during the creation of a DMP. The goal is to use as many persistent identifiers (PIDs) as possible to help facilitate integrations with external systems. To gather this data, we introduced new fields into the DMP Tool to capture detailed information about project contributors (ORCIDsRORs, and CRediT roles) as well as what repositories (re3data), metadata standards (RDA metadata standards) and licenses (SPDX) would be used when creating a project’s research outputs. These new data points are captured alongside the traditional DMP narrative. We also started allowing researchers to publish their DMPs. This process generates a DMP ID, a DOI customized to capture and deliver DMP-focused metadata. This approach allows the DMP to be discoverable in knowledge graphs like DataCite Commons. Once the DOI is registered, the DMP Tool provides a landing page for the DOI.

Screenshot of the DMP Tool showing how to register your plan for a DMP ID

One of the main points of collecting all of this structured metadata is to facilitate integrations with other systems. To make that possible, we introduced a new version of the API that outputs the DMP metadata in the common standard developed with RDA. Our first integration was with the RSpace electronic lab notebook system. When a researcher is working in RSpace, they are able to connect RSpace with the DMP Tool to fetch their DMPs in PDF format and store the document alongside their other research outputs. Once connected, RSpace is able to send the DMP Tool the DOIs of any research outputs that the researcher deposits in repositories like Dataverse or Zenodo. These DOIs are then available as part of the DMPs structured metadata.

Moving the Standard Forward 

The original RDA DMP common standard was released 3 years ago. Since that time, systems like the DMP Tool have found areas where we need to deviate from the base standard. This is a normal process when any standard is developed and first put into use. We have discovered key fields that should be added to the standard (e.g., contributor affiliation information) and areas that don’t really make sense to capture within the DMP itself (e.g., the PID systems a particular repository supports). 

Other DMP systems have also been implementing the common standard and making it available via API calls, but this was done without conformity as to how an external system can access those APIs. This results in systems like RSpace needing to develop and maintain separate integrations for each tool. Over time, this extra work leads to fewer integrations between systems, making each more siloed.

RDA is made up of Interest Groups and Working Groups where members across the world join together to work on a common topic, making guidelines, best practices, tools, standards, and other resources for the wider community. To tackle this use case and address shared issues, our RDA group decided to release a new version of the common standard, v1.2, and forming a new working group to develop API standards that each tool should support. Members of the DMP community gathered together at the end of March to discuss both topics. The DMP systems represented at the meeting included ArgosDAMAPData Stewardship WizardDMPonlineDMP OPIDoRDMP ToolDMPTuuli, and ROHub.

Our DMP Tool team attended the meeting to make sure that the needs of our funders, researchers and institutions were properly represented. The meeting was split into two parts: 

Photograph of 14 meeting attendees representing a variety of service providers in a conference room
Meeting attendees, representing a variety of DMP service providers, worked together on the common standard

Next steps

The original common metadata standard working group plans to incorporate the proposed non-breaking changes this summer as release v1.2. We have also committed to keep the conversation going about future enhancements as we work towards v2.

Meanwhile, the new RDA working group also hopes to release an official API specification this summer. The individual tools would then be tasked with ensuring that their systems support the new API endpoints. For our part, the DMP Tool will ensure that our new website supports this API standard when it launches, as well as additional endpoints specific to our application. The goal is that integrator services like RSpace will then be able to connect more easily with any DMP service, making connections across the research system more robust.

Anyone can review the new DMP common API for maDMP working group proposed work statement. We would value your input, and if you’re interested in joining the group and contributing to the API specification, you can join RDA (its free!) and join our Working Group.

UC3 New Year Series: Data Publishing at CDL in 2025 

Structured, well-documented, and FAIR-aligned data is the foundation of effective research dissemination. However, data publishing activities have often focused on the last step in the research process. This puts energy on helping researchers clean up disorganized data sets and placing them in repositories. While this is essential to ensuring accessibility and preservation of important data outputs, it is also important to connect the dots and address the underlying issues that lead to poor data quality in the first place. Our previous development work and continuing membership with Dryad are great examples of this commitment to supporting well-formatted deposits.  However, it has also always been the strategy of the UC3 data publishing team to invest in people through training, comprehensive documentation, institutional support and policies, and innovative tools. Our goal is to connect those dots and help empower the research community with the skills and knowledge to create high-quality, well-structured data from the outset. 

In 2025, we aim to create a more open, transparent, and sustainable data-sharing future by combining emerging technologies with structured training programs. This dual approach improves data deposit quality and empowers researchers to contribute to a more efficient data-publishing ecosystem.

AI Tools for Data Publishing

Conversations with repository managers often highlight recurring challenges: incomplete documentation, missing README files, or data files that don’t match metadata standards. Automated “nudges” can catch such issues at the point of deposit. The vision is for AI-based systems to serve as virtual coaches that flag inconsistencies and as active collaborators capable of implementing necessary changes where appropriate. These tools will be able to modify metadata directly, generate appropriate README files, and restructure dataset instructions when needed—transforming how researchers prepare and deposit their data.

One promising component of our 2025 strategy involves the development of AI-assisted curation tools, which we’ve begun exploring to provide researchers real-time feedback on their data deposits. This approach leverages artificial intelligence to identify potential metadata, documentation, and formatting issues before submission. However, we won’t be covering this topic in detail here. Those interested in our AI curation initiatives, please refer to our previous article, in which we discussed this thoroughly.

In this post, we highlight CDL’s collaboration with The Carpentries as a key component of the UC3 data publishing strategy for 2025, emphasizing the human side of our approach: training. The Carpentries teaches foundational coding and data science skills to researchers worldwide, and through our partnership, we directly address the skills gap in metadata, documentation, and data formatting.

Training Translates to Broader Impact

Good data practices are part of many successful interdisciplinary collaborations. For example, after the Deepwater Horizon spill in the Gulf of Mexico, researchers in fields as varied as biology, oceanography, engineering, and socioeconomics exploited consistent metadata standards to share thousands of datasets seamlessly. That synergy is best achieved when data management principles are embedded long before a crisis or urgent need arises. Planting the seeds of data literacy in labs and classrooms allows institutions to sidestep the friction and duplicative efforts that often accompany cross-institutional projects.

Robust training programs also help teams stay nimble when policies shift. As mandates continue to change—whether through federal agencies or international collaborations—researchers grounded in best practices can adapt quickly, avoiding costly do-overs. In this sense, the cost-effectiveness of up-front training becomes an investment in a more flexible, forward-looking data ecosystem.

Why Training Makes Data Publishing Easier (and Less Costly)

Early exposure to best practices in data management often prevents unnecessary clean-up and steep learning curves later on in a researcher’s career. This observation was echoed in discussions at a recent Earth Science Information Partners (ESIP) meeting, where a central theme was the value of weaving data skills into formal coursework—rather than treating them as optional add-ons for already overworked researchers. Students who learn these concepts in undergraduate or graduate courses, sometimes through a single assignment requiring a formal data management plan, become more adept at producing coherent, reusable datasets.

In many cases, the hands-on philosophy developed by The Carpentries aligns with such classroom activities. Whether using version control for a small-scale project or learning to structure metadata for a mock submission to a repository, these experiences reduce the likelihood of encountering major data-quality issues down the line. Once researchers join labs and undertake funded projects, they have the required knowledge to meet evolving mandates without incurring frantic, last-minute adjustments.

The Carpentries and CDL: A Long-Standing Partnership

For over a decade, the CDL has worked with The Carpentries to refine curricula on coding, documentation, and data management best practices. A 2017 grant from the Institute of Museum and Library Services (IMLS) helped expand “Library Carpentry,” allowing librarians to participate in curation actively. Last year, we received another IMLS award to help the Carpentries scale their operations and curriculum.  Over the years, UC3 staff have been closely involved in shaping these workshops, hosting sessions, and serving on governance councils to promote a broader culture of responsible data stewardship.

One of the main strengths of The Carpentries’ model is its train-the-trainer approach. Seeding new workshops across campuses and disciplines is possible by certifying volunteer instructors within organizations. This approach has found synergy with our participation in the Generalist Repository Ecosystem Initiative (GREI), a collaborative effort bringing together seven major generalist repositories, including Zenodo, Dryad, Vivli, Center for Open Science, and Dataverse. Through GREI, we’re expanding the reach and impact of data publishing best practices across diverse repository infrastructures.

Under the auspices of the GREI project, we’re working with selected Carpentries modules to address specific data publishing challenges across multiple repository environments. In 2025, we’ll pilot these modified modules in workshops to gain practical teaching experience with this GREI-relevant curriculum. This field testing will provide valuable instructor and participant feedback, allowing us to refine the content and delivery methods. This iterative approach ensures that these modules will ultimately integrate seamlessly into the broader Carpentries curriculum, creating sustainable resources that address the complexities of modern data publishing.

Moving Forward in 2025

High-quality data deposits rarely emerge by accident, they require intentional investment in training, documentation, institutional support, and tools. At UC3, we take a holistic approach, recognizing that creating better datasets goes beyond technical solutions – it demands strategic investments across the entire research data lifecycle.

By strengthening training programs, refining repository workflows, and making learning resources widely accessible, we help researchers at all levels produce well-structured, reusable data. Our ongoing collaborations with The Carpentries and GREI ensure that best practices continue to evolve alongside the research community’s needs. With these efforts, “deposit-ready data” can become the standard rather than the exception, reducing inefficiencies and accelerating scientific discovery. As we move through 2025 and beyond, our focus remains clear: building a sustainable, scalable, and human-centered data publishing ecosystem that empowers researchers and institutions alike

UC3 New Year Series: Persistent Identifiers at CDL in 2025

CDL’s persistent identifier portfolio, which includes the Research Organization Registry (ROR), EZID, Name-to-Thing (N2T), and the new Collaborative Metadata Enrichment Taskforce (COMET) initiative, had a busy and productive year in 2024, seeing record adoption, interest, and making significant technical improvements across its services. These complementary work streams help build a more rational, networked, and efficient research ecosystem – one where persistent identifiers allow for seamless connections between researchers, institutions, and scholarly outputs, reducing redundant efforts, unnecessary costs, and making everyone’s work more visible and impactful. As we move into 2025, I’m excited to bring you a look ahead into what we have planned for this year.

ROR

ROR  is a global, community-led registry of open persistent identifiers for research and funding organizations, operated as a collaborative initiative by the California Digital Library, Crossref, and DataCite. As a trusted, free, and openly available service, ROR has become the standard for organizational identification in the scholarly communications ecosystem. The story of ROR in 2025 will be seizing the opportunities provided by this widespread adoption with better performance, improved services, and higher-quality data.  

This work will begin with a Q1 launch of a new and improved version of our affiliation matching service, which has been battle-tested in OpenAlex and used to make millions of new connections between authors, works, and institutions. From here, we will further improve our API’s performance by implementing response caching for repeat requests, speeding up response times and reducing overall resource usage. Once this is complete, we will round things out by implementing a client identification system, allowing ROR to better manage its traffic, while also keeping our API services available at the same generous level of public access.

Concurrent with this technical work, ROR will pursue a number of improvements to the quality and completeness of its data, building on work done in 2024, as well as embracing new and emerging opportunities. This will include many regional improvement efforts, with work already underway in Portugal and Japan, better representation of publishers, society organizations and funders, as well as the addition of new external identifiers and improved domain field coverage in support of all of these efforts. ROR will also continue to refine its curation processes to meet the growing needs of its community. In 2024, ROR processed over 8,000 curation requests—a 44% increase from 2023—with trends indicating that we should expect to receive 1,000 requests per month by year’s end. Our goal is to continue publishing the same, high quality data on our monthly release schedule, even in the face of this increased demand!

EZID and N2T

EZID and N2T are complementary persistent identifier services that enable reliable, long-term access to research outputs. EZID provides identifier creation and management services focused on ARKs and DOIs, while N2T serves as a global resolver that ensures identifiers remain reliable and actionable over time. 

In 2024, EZID was sharply focused on improving its reliability and performance. This included moving to OpenSearch to power our search functionality and rewriting many of our background jobs and database queries to increase their speed and efficiency. This work has resulted in a more stable and high-performing service, capable of handling the large increase in traffic that resulted from EZID assuming resolution functionality for its own ARK identifiers. Building from this foundation, in 2025, we will continue to optimize our underlying systems, while also adding support for DataCite schema v4.6 and improving EZID’s user interface. Our UI updates will be focused on improving the application’s accessibility, such that all users can effectively manage their persistent identifiers. These coordinated improvements will guarantee that EZID remains a dependable and inclusive platform for persistent identifier management.

Alongside EZID’s improvements, we completed another major milestone in 2024: rebuilding N2T as a modern Python service. With the new flexibility and performance this provides, our 2025 plans include rolling out additional service enhancements for N2T that will better support ARK curation workflows, adding public-facing resolution statistics, and fully deprecating the legacy instance of this application. This work will continue to strengthen N2T’s role as essential infrastructure for all the great identifier usage that occurs outside and in parallel to the DOI ecosystem.

COMET

The COMET initiative, launched in November of 2024, seeks to address a critical problem in DOI metadata management. Currently, only record owners can update DOI metadata, even when others have improvements to contribute. This leads organizations to maintain their own enhanced versions of the same records in separate systems, resulting in duplicated effort and inconsistent representations of research outputs, both individually and in aggregate. COMET’s solution is to create an open framework that allows the community to contribute validated metadata improvements directly to DOI records, unlocking tremendous new value and efficiencies at the sources of this metadata.

To date, COMET has brought together experts from publishers, libraries, funding organizations, and infrastructure services in a series of listening sessions focused on the vision, product definition, and governance model for a service that would realize its goals. In March 2025, these efforts will culminate in a community call-to-action, soliciting partnerships, funding, and other resources to help build this service. Subscribe to COMET’s email list to receive up-to-date news or follow its LinkedIn page for updates.

As I hope this all conveys, it has never been a better, more energizing time to both help build and participate in the persistent identifier ecosystem. Here’s to 2025 and all the exciting work ahead for the UC3 and the scholarly communications community!

UC3 New Year Series: The Merritt Digital Preservation Repository in 2025

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2025.

With the close of 2024 and beginning of 2025, the Merritt team is preparing for a new year of exciting projects and engagements with libraries and organizations across the UC system. We wrapped up 2024 by fully revising our Ingest queueing system to allow for more granular control over submissions processing, an effort which laid the groundwork for upcoming features, many of which we’ll discuss here. These include potential user-facing metrics, submission status and a new, easier to use manifest format for batch submissions. 

Meanwhile, the team is always hard at work making the repository’s existing functionality more robust and future proof. We’ll also review some of these efforts and how they both provide reassurance to collection owners and improve users’ overall experience with the system.

Submission Status and Collection Reports

Implementing Merritt’s revamped queueing system was the most significant change to the way the repository ingests content that’s taken place since its inception. The new system establishes and allows for visibility into many more stages of the ingest process. 

Every individual job now moves through these phases:

Merritt job ingest stages

Internally, the Merritt team can monitor these steps through the repository’s administrative layer and intervene as necessary. For example, if one or more jobs in a batch fails, our team can assist by restarting a job or investigating what happened. Once a job is processing again, we can now also send additional notifications regarding the batch it belongs to.

While it’s possible for the team to view when a job or batch of jobs reaches any stage, Merritt currently does not provide this same insight to its users. We’ve already obtained preliminary feedback that such visibility would be helpful and are planning to gather additional input via an upcoming depositor tools survey regarding which stages are most useful and how to present them. In this vein, the survey will seek to capture information pertaining to your workflows for submitting content to the repository, and what new information and tools would make these more efficient for you. Our goal will be to identify trends in responses and focus on a key set of features to optimize Merritt’s usability.

Alongside visibility into a submission’s status, we would also like to provide additional information about the corresponding collection, post-submission. On every collection home page, we display specific metrics about the number of files, objects and object versions in the collection as well as storage space used. Given our recent work on object analysis and data visualization, we plan to surface additional data on the composition of a collection in its entirety. Our current thinking is to bubble up the MIME types of files, a date range that takes into account the oldest and newest objects in the collection, as well as the most recently updated object. A downloadable report with this data may also be an option, directly from the collection home. Again, stay tuned as we’ll be looking for your input on what’s most important to you in this case.

New Manifest Format

There are any number of ways to ingest content into Merritt – from single submissions through its UI, to API-based submissions and also use of several types of manifests that tell the system from where it should download a depositor’s content for processing.  

In essence, a manifest is a list of file locations on the internet saved to a simple but proprietary format. Depending on the type of manifest, it may or may not contain object metadata. Although Merritt’s current range of manifest options provides a great deal of flexibility for structuring submissions, it can present a steep learning curve if a user’s goal is to deposit large batches of complex objects.

To help make working with manifests a more intuitive process, we’ve drafted a new manifest schema that records objects in YAML. Unlike current manifests, this single schema can be adapted to support definition of any object or batch of objects. For example, it allows for defining a batch of objects, where each object can have its own list of files and object metadata, all in one .yml file. Currently, this approach requires use of multiple manifests – one that lists a series of manifest files, and the other manifests to which it refers (a.k.a. a “manifest-of-manifests”). Each of these latter files records an object and its individual metadata. The new YAML schema should allow for more efficient definition of multiple objects in a single manifest, while also being more intuitive and easier to approach.

Available Services

Object Analysis

The Merritt team has endeavored to provide collection owners the means to obtain more insights into the composition of the content being preserved in the repository. More specifically, we’ve leveraged Amazon OpenSearch to record and categorize characteristics of both the file hierarchy in objects as well metadata associated with them. The Object Analysis process is one by which we gather this data from the objects in a specific collection upon request, and then surface the results of subsequent categorization and tests in an OpenSearch dashboard for review. We then walk through the dashboard with the collection owner and library staff to explore insights. For more information, have a look at the Object Analysis Reference or our presentation at the Library of Congress and let us know if you would like to explore your collections! Our goal is to provide you with the critical information needed for taking preservation actions that promote the longevity of your content.

Ingest Workspace

Last fall we brainstormed on how we could more effectively help library partners ingest their collections into Merritt while also minimizing their overhead associated with staging content. For example, if a collection owner’s digitization project results in files saved to hard drives, staging those files shouldn’t necessarily require they have on-premises storage and local IT staff who can administer it while staging and validation actions occur. Through use of a workspace our team has designed, it’s possible to assist with copying content to S3 and leverage a custom process (implemented via AWS Lambda) to automate the generation of submission manifests.

If you’re spinning up a digitization project in the near future which entails a digital preservation element, let us know!

What We’re Working On

As mentioned, we’re always working to make Merritt more robust, secure and easy to maintain for the long run. There are many more granular reasons for undertaking this work, but this year we’re keen to buckle down on our disaster recovery plans, ensure we’re making use of the latest AWS SDK version to interact with cloud storage at multiple service providers and also continuing to set up the building blocks for autoscaling Merritt’s microservices.

Disaster Recovery Plans

It goes without saying that any digital preservation service and its accompanying policies are inherently geared to prevent data loss. However, when truly unexpected events happen, the best thing we can do is strategize for the worst. In our case, because the content we steward lives entirely in the cloud, we need to be prepared for any of those services to become suddenly unavailable. Discussion in the larger community surrounding digital preservation in the cloud has continued, especially as organizations consider moving forward without a local, on-prem copy of their collections. Given Merritt already operates in this manner, it should be able to pivot and recognize either of its two online copies as its primary copy, while continuing to run fixity checks and replicate data to new storage services. Note the system’s third copy in Glacier is not a primary copy candidate due to the excessive costs associated with pulling data out of Glacier for replication. So while we already have the means to back up critical components such as Merritt’s databases, recreate microservice hosts in a new geographic region, and a well known approach to redefining collection configuration (i.e. primary vs. secondary object copies), we need a more efficient implementation to perform the latter reconfiguration on the fly. We intend to work on this storage management implementation in late 2025, detailing its operation not only as part of a disaster recovery plan, but also in our upcoming CoreTrustSeal renewal for 2026.

Talking to the Cloud

Although it doesn’t happen often (rewind to 2010 when the AWS SDK for Java 1.0 was introduced), major version updates to Amazon’s SDK that enables programmatic communication with S3 are inevitable. That’s why we’ve already spent considerable time testing use of v2 of the SDK for Java across all three of our cloud storage providers. So we’ll be ready Amazon ends support for v1 in December of 2025.

Autoscaling for Efficiency

As you may have gathered from some of our regular CDL INFO newsletter posts, a long term goal for our team is to enable certain Merritt microservices to autoscale. In other words, the number and size of submissions to the repository elements that are constantly in flux. Most days we see hundreds of small metadata updates process for e.g. eScholarship publications being preserved in the system. On other days, depositors may submit upwards of a terabyte a piece of new content while the service concurrently processes direct deposits whose updates stem from collections in Nuxeo. Given this variability, our consideration of the impact to the environment of excessive compute resources in the cloud, and compute and storage costs, Merritt should only be running as many servers as it needs anytime. Particularly, because we run every microservice in a high-availability fashion, we should be able to increase the amount of compute on days of heavy submissions, and more critically, reduce the number of hosts when minimal content is being ingested. Revamping our queueing process was a major requirement for autoscaling. With this complete, we (and all UC3 teams) now need to migrate into our own AWS account to further refine control over repository infrastructure. This next step will move us ever closer to fulfilling our goal to autoscale. We’ll continue to share more information on this in our regular posts going forward, so keep an eye out for the latest news!

In closing, we’re excited to work with our library partners throughout the coming year on both potential new features and by collaborating on new projects via recently introduced tools. Reach out to us anytime!

UC3 New Year Series: Looking Ahead through 2025 for the DMP Tool

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2025.

We’re gearing up for a big year over at the DMP Tool!  Thousands of researchers and universities across the world use the DMP Tool to create data management plans (DMPs) and keep up with funder requirements and best practices.  As we kick off 2025, we wanted to share some of our major focus areas to improve the application and introduce powerful new capabilities.  We always want to be responsive to evolving community needs and policies, so these plans could change if needed.

The DMP Tool in 2025

Our primary goal for the year is to launch the rebuild of the DMP Tool application.  You can read more detail about this work in this blog post, but it will include the current functionality of the tool plus much more, still in a free, easy to use website.  The plan is still to release this by the end of 2025, likely in the later months (no exact date yet).  We’re making good progress towards a usable prototype of core functionality, like creating an account and making a template with basic question types.

Another common request is to offer more functionality within our API.  For example, people can already read registered DMPs through the API, but many librarians want to be able to access draft DMPs to integrate a feedback flow on their own university systems.  As part of our rebuild, we are moving to a system that is going to use the same API on the website as the one available to external partners (GraphQL for those interested).  This will allow almost any functionality on the website to be available through the API.  This should be released at the same time as the new tool, with documentation and training to come. Get your integration ideas ready!

Finally, we are continuing to work on our related works matching, tracking down published outputs and connecting them to a registered DMP.  This is part of an overall effort to make DMPs more valuable throughout the lifecycle of a project, not just at the grant submission stage, and to reduce burden on researchers, librarians, and funders to find related works.  It’s too early to tell when this will be released publicly on the website, but likely will come some time after the rebuild launch.

AI Exploration

While most of our focus will be on the above projects, we are in the early stages of exploring topics for future development of the DMP Tool.  One big area is in the use of generative AI to assist in reviewing or writing data management plans.  We’ve heard interest from both researchers and librarians in using AI to help construct plans.  People sometimes write their DMP the night before a grant is due and request feedback without enough time for librarians to provide it.  AI could help review these plans, if trained on relevant policy, to give immediate feedback when there’s not enough time for human review.

We’re also interested in exploring the possibility of an AI assistant to help write a DMP.  We know many people are more comfortable answering a series of multiple choice questions than they are in crafting a narrative, and it’s possible we could help turn that structured data into the narrative format that funders require, making it easier for researchers to write a plan and keeping the structured data for machine actionability. Another option is an AI chatbot within the tool that can help provide our best practice guidance in a more interactive format.  It will be important for us to balance taking some of the writing burden off of researchers while making sure that they are still the one responsible for the content within it.

These ideas are in early phases – it’s something we’ll be exploring but not releasing this year – however we’re excited about their potential to make DMPs easier to write.

Community Engagement

While it may seem we’ll be heads down working on these big projects, we want to make sure we’re communicating and participating in the wider community more than ever.  As we get towards a workable prototype of the new tool, we’ll be running more user research sessions.  The initial sessions, reviewed here, offered a lot of valuable insight that shaped the current designs, and we know once people get their hands on the new tool they’ll have more feedback.  If you haven’t already, sign up here to be on the list for future invites. 

We also want to be more transparent with the community about our operations and goals.  We’ve started putting together documents within our team about our Mission and Vision for the DMP Tool, which we’ll be sharing with everyone shortly.  Over 2025, we want to continue to work on artifacts like those we can share regularly so that you all know what our priorities are.  One goal is to create a living will, recommended by the Principles of Open Scholarly Infrastructure, outlining how we’d handle the potential winddown of CDL managing the DMP Tool.  This is a sensitive area because we have no plans to wind down the tool, and don’t want to give the impression that we will!  But it’s important for trust and transparency for us to have a plan in place if things change, as we know people care about the tool and their data within it.

Finally, we’ll be wrapping up our pilot project with ARL this year, where we had 10 institutions pilot implementation of machine-actionable DMPs at their university.  We’ve seen prototypes and mockups for integrations related to resource allocation, interdepartmental communication, security policies, AI-review, and so much more. We’ve brought on Clare Dean to help us create resources and toolkits, disseminate the findings, and host a series of webinars about what we’ve learned to help others implement at their own universities.  We’ll be presenting talks on the DMP Tool at IDCC25 in February, RDAP in March, and we plan to submit for other conferences throughout the year, including IDW/RDA, to share what we’ve learned with others. We hope to continue working with DMP-related groups in RDA to ensure our work is compatible with others in the space, and we’re following best practices for API development.

We hope you’re as excited for these projects as we are!  We’re a small team but we work with many amazing partners that help us achieve ambitious goals.  Keep an eye on this space for more to come.

Proposed revisions to the Principles of Open Scholarly Infrastructure (POSI)

Sustainable, community-driven infrastructure is essential for advancing open scholarship. That’s why UC3 not only championed the Principles of Open Scholarly Infrastructure (POSI) through our advocacy and authorship but also actively supported their adoption by key organizations like ROR, Dryad, DataCite, and Make Data Count. POSI has provided an invaluable framework for transparency, accountability, and community alignment.

As we look toward the future, we’re thrilled to see the next evolution of POSI taking shape with the proposed POSI 2.0 revisions. These updates, informed by real-world experiences of adopters, aim to refine the principles to ensure they remain practical and relevant in an ever-changing landscape. This evolution is not just about updating a framework—it’s about strengthening the foundation for a more open and resilient scholarly ecosystem.

Why does this matter to UC3? As active stewards of open infrastructure, we know that collective input is key to shaping effective, inclusive principles. POSI has empowered organizations to perform self-assessments, build trust with their communities, and advocate for long-term sustainability. We’ve seen firsthand how these principles can elevate not just individual organizations but the entire ecosystem.

With POSI 2.0, we’re calling on the scholarly community to contribute to this critical conversation. The proposed revisions are open for public comment, and this is your chance to help ensure that POSI continues to reflect the needs and aspirations of our diverse community.

How to Get Involved

  1. Review the Proposed Revisions: Dive into the draft of POSI 2.0 and explore the updates.
  2. Share Your Feedback: Take the short survey to share your thoughts and insights.
  3. Spread the Word: Encourage your networks to join this important dialogue.

Deadline: March 5, 2025
Learn More & Participate: https://openscholarlyinfrastructure.org/public-comment-v2/

Understanding the Vision Behind Make Data Count and the Open Global Data Citation Corpus

As the scientific community increasingly embraces open data, the question of how these datasets are being accessed and utilized becomes ever more pressing. Researchers, funders, and policymakers alike are keen to understand the impact and reach of the data they produce, support, and use. This is where the vision of Make Data Count (MDC) and the Data Citation Corpus comes into play.

What is Make Data Count?

Make Data Count is an international initiative aimed at transforming how we measure the impact of open research data. Traditionally, the scholarly community has focused on citations to articles as a metric of impact. However, as research becomes more data-intensive, it’s clear that we need new metrics to capture the influence and reuse of datasets. MDC is committed to developing evidence-based data metrics that go beyond traditional measures, allowing for a more comprehensive understanding of data usage.

MDC’s efforts focus on creating the infrastructure and standards needed to track, collect, and report data usage and citation metrics. This includes not only citations to datasets within scholarly articles but also how data is used across various fields and sectors. The ultimate goal is to provide a holistic view of how open data contributes to scientific progress, policy-making, and beyond.

For more details on the roadmap and future developments of Make Data Count, you can explore the MDC Roadmap

CDL’s Role in Make Data Count

The University of California Curation Center (UC3) at California Digital Library (CDL) has been a key player in the Make Data Count initiative since it’s inception. CDL’s expertise in managing collaborative projects and its commitment to open data practices have been instrumental in the development and implementation of MDC’s goals. Over the years, CDL team members have provided strategic oversight, technical infrastructure. Currently, CDL team members are members on MDC’s advisory committee and works with other key partners such as DataCite and the MDC Director, Iratxe Puebla, on MDC project execution. CDL continues to play a vital role in fostering collaborations with other institutions and organizations to expand the reach and impact of MDC.

A Centralized Resource for Data Citations

The Data Citation Corpus, developed in collaboration with the Chan Zuckerberg Initiative (CZI) and the Wellcome Trust, is a cornerstone of this vision. The Corpus aims to be a vast, open repository of data citations from diverse sources and disciplines, providing a centralized resource for understanding how data is being cited and reused.

This initiative addresses a significant challenge in the current landscape: the fragmented and incomplete nature of data citation information. While data citations are increasingly being created, the existing workflows for collecting and propagating these citations are often leaky, leading to gaps in the persistent identifier (PID) metadata. Furthermore, in some fields, especially within the life sciences, data sharing via repositories that use accession numbers instead of DOIs is common, which further complicates the collection of metadata on data reuse.

The corpus is being developed in iterative stages, with the initial prototype already incorporating data citations from DataCite event data and the CZI Knowledge Graph. This prototype allows for visualizations based on parameters like institution or data repository, providing valuable insights into how datasets are being cited and used across the research ecosystem.

As the project progresses, the goal is to expand the Data Citation Corpus to include additional sources and features, ultimately creating a resource that different stakeholders—researchers, funders, institutions, and policymakers—can use to integrate data usage information into their work. 

Expanding the Corpus and Engaging the Community

To further the goals of expanding and refining the Data Citation Corpus, MDC is hosting a hackathon on September 4, 2024, focused on building curation workflows for the corpus. The hackathon will bring together data scientists, developers, and engineers to work on two key projects: developing user interfaces for the corpus and creating workflows for community-driven curation of data citations.

The hackathon will take place in two locations, with sessions at the Wellcome Trust in London and the California Digital Library in Oakland, California. Participants will collaborate on innovative solutions that will be presented the following day at the MDC Summit.

Stay tuned for a follow-up post where we will share the outcomes of the hackathon and the exciting developments that emerge from this collaborative effort.