Skip to main content

(index page)

UC3 New Year Series: Data Management Planning in 2026

Welcome to the second post of UC3’s New Year blog post series, where different services of UC3 take a look at the coming year.  If you haven’t already read it, check out the first one on digital preservation.

Over in the world of Data Management Planning, we’ve got a lot of exciting work this year to share!

DMP Tool Rebuild

Our main project continues to be working on the rebuild of the DMP Tool.  While we initially hoped to have it ready early this year, we’re now targeting the summer of 2026.  This gives us more time to make sure it’s at a high level of quality, and also releases it at a time that will hopefully be less disruptive to people who teach classes using the DMP Tool.  There’s a chance it will take longer than the summer though – we’re focused on quality over speed.

We’ve done 3 rounds of user testing so far on the site, and each time has given us a lot of valuable information.  We’ve gotten a lot of positive feedback about new features we will be offering, such as alias email addresses, adding collaborators to templates, a revamped API, and much more.  Other changes, though, have caused some confusion for people used to the current tool, and through testing we have found opportunities to improve the workflow and usability of the new site.  These are the types of changes that mean the rebuild will take longer than initially planned to complete, but we think are worth the time to get right.

DMP Tool logo

To keep updates about the rebuild in one place, we have a Rebuild Hub page on our blog.  We’ll keep this page up to date with the latest information about the release date, FAQ, status updates, and more.  We plan to make posts leading up to the new release showing the major changes and giving guidance to make the transition as seamless as possible.  If you’d like to help with testing at any point, please sign up for our user panel to get invitations to future feedback sessions.

As we’ve said before, we’re limiting updates to the current tool so we can focus our limited resources on the rebuild; but of course we also want to keep the tool live and helpful during the transition.  We’re fixing any major issues that come up, such as keeping it up to date with new ROR API and schema, and addressing user tickets as quickly as possible.  We are trying to keep funder templates up to date as well, but the frequency of new information and potential changes has made it difficult to perfectly capture all updates to federal guidelines.  We want to make sure we have the most relevant information possible on the tool without changing templates too often (as that can lose organization guidance), so we’ve been collecting updates from our Editorial Board members for a template release in the near future.  If you see any instances where a template in our tool does not match a funder template, please reach to us by email so we can get it corrected.

Get Involved with API Integrations

With our rebuild is coming a complete revamped API to take advantage of our new machine-actionable functionality.  We’re currently looking for partners that would like early access to our new API in order to develop new integrations for our rebuild.  Our goal is that the new API can do anything the user interface can do, which means the sky (or more relevant, the cloud) is the limit for possible tools.  If you’ve been wanting to connect to our API for some sort of automation that our current API did not offer the capability for, we’d love to hear from you. You can hear more about past pilot integrations and how to work with our API at this recording of our webinar from the Machine-Actional Plans pilot project.  We’ll be following the common API standard being developed with the Research Data Alliance, meaning many integrations with our tool should work for other DMP service providers as well.  If you have an idea for an integration you’d like to build on our new API, please reach out to dmptool@ucop.edu

Matching to Published Research Outputs

We’ve talked before about a major project to use machine learning models to help match DMPs to their eventual research outputs, like datasets and software publications, to help make data from published DMPs easier to find and re-use.  This work has continued and we plan to release it with the rebuilt DMP Tool.  Since our last update, we’ve made some significant steps towards this goal, including:

Screenshot of a webpage that says "Published Research Outputs at the top and includes a list of scholarly research citations.  Next to each item in the list are buttons that say "Accept" and "Reject", as well as information about the work such as date found, source, and confidence of the match.
New user interface showing a list of published outputs that have been matched to a DMP in our rebuilt DMP Tool.  Interface is subject to change before release.

Improvements we plan to work on over 2026 include:

We’re excited for people to get to use this tool with the rebuild and start accepting and rejecting potential matches so we can learn from this and improve the matching algorithm further over time.  People will also be able to manually add DOIs as research outputs, like they can on the current tool, which will also help train the model over time on what we missed as potential matches.  This will be available for all DMPs that have been published, i.e., registered for a DMP ID.  Accepted works will be added to the metadata for the plan as related identifiers.

DMP Chef

Another exciting area we’re exploring is the use of generative AI to assist in writing Data Management Plans.  We’ve partnered with the FAIR Data Innovations Hub to work on the DMP Chef, a project to explore using large language models (LLMs) to draft DMPs.  Our goal is not to take away the key decisions in data management planning from a researcher, but instead to simplify the process as much as possible by asking a few critical questions, combining that responses with funder requirements that need to be met, and using those to produce a draft of a DMP for their review and edits.

We have promising early results, with both automated statistics and human evaluations showing the LLM-drafted DMPs can be comprehensive, accurate, and follow best practices.  Commercial models are performing better than the open-source models, but since we want to remain open-source, we’re looking at ways to improve the open-source models through additional retrieval augmented generation and other options.  And we’ll be testing carefully how accurate and helpful the output is, as well as looking at ways to help ensure researchers read and edit the plan as needed, rather than just accept the output right away.

DMP SourceOverall Satisfaction rating (1-5)Average Error Count per DMPAccuracy in guessing LLM vs Human
Human3.17.2 65%
LLMs (combined)3.44.943%
    Llama 3.32.67.570%
    GPT-4.14.22.315%
Results presented at the Research Data Alliance 2025 plenary, showing GPT-4.1 generated DMPs with higher satisfaction ratings and fewer errors reported than human-written exemplar DMPs from NIH.  N = 20 participants rating a DMP from each source, for a total of 60 DMP ratings

Over the course of 2026, we plan to keep testing and improving this model, starting with NIH and NSF plans.  The ultimate goal is a general use model that can be used within the DMP Tool for any funder to get a first draft of either a whole DMP or specific sections a researcher is struggling with.  We have a working prototype tool for DMP generation we will use for testing purposes, with integration into the DMP Tool planned for further out.  If you’d like to be part of testing out this new tool, please sign up for our user panel.

Thanks for reading about our major initiatives for the year!  Keep an eye out on this space for the next post in our series, about our 2026 plans for persistent identifiers.

We are grateful to the Institute of Museum and Library Services, the National Science Foundation, and the Chan Zuckerberg Initiative for each supporting core components of these initiatives.

UC3 New Year Series: Digital Preservation in 2026

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2026.

Welcome to UC3’s 2026 New Year blog post series! This first post dives into the team’s engagement with all things digital preservation, along with its further development of CDL’s digital preservation repository, Merritt.


Moving Merritt into the Realm of Containerization

Although our team has always been improving the Merritt repository with an eye on operational efficiency, security, transparency and durability, it was not until 2025 that we really entered the beginning of a major paradigm shift with how the system functions. This shift entails movement of Merritt’s microservices and queueing system into a fully containerized state. 

What’s in a container and why are we making this shift?

In essence, a container is a fully functional and portable cloud computing environment that surrounds an application. Inside a container, an application such as Merritt’s Ingest microservice has all of the resources it needs to run – including a common operating system, an allocation of memory, configuration files and any required software libraries and dependencies. As a result, the new Merritt system will be composed of a series of small, light-weight, secure containers that complete processing significant amounts of incoming, archival content more quickly.

That’s also to say, going forward, each Merritt service will no longer operate on a customized, often expensive AWS Elastic Compute Cloud (EC2) instance. Instead, services will operate in containers that are orchestrated (i.e. automatically managed) by Amazon’s Elastic Container Service (ECS). 

There are numerous advantages of running Merritt using this new strategy and infrastructure. First and foremost, it allows microservices to scale according to the activity of depositors. In other words, when multiple depositors have sent multi-gigabyte or multi-terrabyte batches to ingest, Merritt’s Ingest and Storage microservices will scale up as needed – meaning more containers that operate the software that performs ingest and storage operations will be started automatically. 

And, as opposed to manually requesting additional, more costly EC2 instances be instantiated temporarily by DevOps, each new container spins up via ECS automatically in minutes – with the latest operating system and dependency updates already in place. Just as importantly, once the load on these services reduces, unneeded containers will be automatically spun down. Doing so keeps the repository’s overall footprint as small as possible. Fewer unused compute resources means less power consumption and heat generation at data centers, and in turn lowered impact on the environment. 

Given all these benefits, we are incredibly excited to be within reach of our containerization goal. Merritt’s administrative service has already been containerized and its UI is next on the list. After that, we’ll move Merritt’s Audit and Replication services to containers, followed by Inventory, Ingest, Storage and finally ZooKeeper. Throughout the year, we’ll keep you informed of our progress through upcoming, monthly CDL INFO newsletter posts.

Community Engagement

At the end of 2025, the UC Libraries Digital Preservation Working Group (DPWG) completed its charge. By year’s end the group had submitted both a comprehensive gap analysis and a new digital preservation framework to the Direction and Oversight Committee (DOC) in final form. 

The gap analysis, based on the Digital Preservation Coalition’s Rapid Assessment Model (DPC RAM), allowed the team to evaluate to what degree main campus libraries met multiple criteria considered key to good digital preservation practices. The gap analysis subgroup effectively transformed the RAM maturity model into a survey instrument, gathering over 50 responses from participants.

As the gap analysis was underway, a separate DPWG subgroup reviewed existing digital preservation frameworks from a number academic institutions and consortia. This group evaluated which framework elements were most applicable to the needs of UC campus libraries. Frameworks and associated documentation from ICPSR, University of Washington Libraries, Yale, Northwestern, and Harvard, among others, including our own campuses at San Diego and Santa Cruz were explored. 

What became evident during this process was that the forthcoming framework needed to be practicable by a large number of independently operating libraries with mixed resources and varying priorities – but with the added requirement to report back to a critical, long standing governance structure. In this case, the University Committee on Library and Scholarly Communication, or UCOLASC.

Through an iterative process that was informed by gap analysis results, a new digital preservation framework was created. Its key components include a range of operating principles, identification of potential campus designees and discussion surrounding shared services – specifically with a slant towards development involving collaborative design, transparent governance and inclusiveness of campus library voices.

Throughout the year, we look forward to collaborating with campus libraries on the introduction of the new framework and how they can potentially take advantage of what it has to offer.

Please note that both the gap analysis and framework have not yet been published, but links to them will be made available through this site when they become available!

Coalition for Content Provenance and Authenticity (C2PA)

A few years ago a group of major players in the technology and media space including Adobe, Microsoft, Google, BBC and Sony began collaboration around what would become a public specification for the promotion of content authenticity and provenance. This specification and its implementation by vendors and organizations strives to provide the means to include a living manifest of metadata in digital materials. A C2PA manifest can record when a digital asset first came to be along with myriad changes that span the asset’s lifecycle. Such structured and secure metadata is purposed to record the provenance of the item while also describing the ongoing changes made to it by actors, be they human or machine. In essence, through purposeful metadata handling, it becomes possible to identify if the asset was altered or utilized in a fashion that was not intended by its creator and subsequent users.

Through engagement with the C2PA for G+LAM working group driven by members of the Library of Congress, the Merritt team has contributed to the development of a newly defined C2PA use case for Open Access journal publications, as well as a forthcoming, public call-to-action white paper addressing high-level issues related to AI and content authenticity and provenance. This latter deliverable is intended for cultural heritage administrators and practitioners. It will cover “potential risks and opportunities presented by recent AI technologies and outline potential directions for future collaborative research and experimental applications.” 

Stay tuned, as a draft should be posted for public comment in February.

Merritt’s CoreTrustSeal (re)Certification

This year we will re-apply for Merritt’s certification by CoreTrustSeal (CTS). At the same time, 2026 marks the beginning of a standard three year period where revised CTS requirements come into play. The upshot being that we will need to address many more requirements in comparison to Merritt’s last certification – which is a good thing!

In a similar vein the Merritt service manager (myself) has been part of the CTS Assembly of Reviewers, a group of nearly 100 individuals responsible for reviewing CTS submissions from repositories around the world. Which also means I’m very much looking forward to going through the exercise of revamping our application and all of the supporting information that’s needed for a successful certification. An internal audit such as this one, driven by an organization with international roots and a vast amount of combined experience presents an incredibly beneficial opportunity to introspect. And through introspection, we’ll better our repository and the services it provides for our colleagues across the entire UC system.


Again – welcome to UC3’s 2026 New Year Series. The next post on Data Management Plans and DMP Tool should arrive next week!

Why Award DOIs Matter: Strengthening Discovery Across UC’s Funding Programs

The Research Grants Program Office (RGPO) at the University of California Office of the President manages one of the UC system’s most impactful research portfolios, comprising over $100 million in yearly awards across programs such as the California Breast Cancer Research Program and the California HIV/AIDS Program. These diverse and impactful funding activities are complemented by rigorous internal data practices for tracking their impact, including providing rich and detailed descriptions of these activities in their public-facing grants database. 

As persistent identifier (PID) enthusiasts, the availability of this high-quality data source immediately presented itself to the UC3 team as a unique opportunity. By leveraging the RGPO’s comprehensive metadata to generate award DOIs in DataCite, we could bridge the gap between their accounting and the larger research ecosystem, broadcasting the full scope of RGPO’s impact to this even larger audience.

Why DOIs?

Registering DOIs for awards provides a high-level view of all of the RGPOs’ work across its funding programs, describing them in a unified fashion using the DataCite schema, which provides a persistent, machine-readable reference and improves the visibility of these funding activities. By assigning DOIs, RGPO awards become connected to the broader persistent identifier ecosystem, meaning that other systems can easily discover, link to, and reuse information about these awards and their associated research outputs. Ultimately, this helps close gaps between internal and external systems, creating a more comprehensive picture of the University of California’s impact and RGPO’s role in that success. 

How we did it

We worked closely with the DataCite team to analyze existing practices for representing awards in their schema, ensuring that everything was modeled correctly. This included identifying and resolving inconsistencies in representations.

Once those issues were resolved and the model was set, the registration process itself was straightforward:

  1. We mapped RGPO award data to the DataCite schema.
  2. Added ROR IDs for the funder (University of California Office of the President) and other relevant entities.
  3. Linked research outputs to the awards by including DOIs for their related works.
  4. Generated the XML for the award records and registered via the DataCite API.
  5. Finally, we provided the RGPO with a report of these registrations so that the DOIs could be integrated back into the RGPO’s grants database.

What’s Next: Automation and Research Graph Connections

We’re now focused on two primary next steps: 

1. We’re working to automate more of the DOI registration and update processes to ensure new or changed awards are registered more frequently than the current manual updates.

2. We’re collaborating with OpenAlex on their new grant-funded project to incorporate better and more complete funding metadata into OpenAlex’s scholarly graph. As one of the first registrants of award DOIs with DataCite, OpenAlex is using CDL’s detailed account of the RGPO’s funding activities to model the ingestion and mapping of DataCite award DOIs more broadly. 

Our efforts open RGPO grant projects to further enrichment of metadata and connections. This work includes matching grant-funded research outputs with their corresponding award DOIs, both by matching unstructured publication references in the funder metadata and by mining full-text publications to identify links that were not explicitly asserted in their DOI metadata. The goal is that once these connections are identified, they can also be incorporated back into the award DOIs and DataCite, thereby making their description more comprehensive and complete.  

Our hope is that this work demonstrates the value of PIDs for awards and encourages other funders to adopt a similar approach. Registering award DOIs doesn’t just improve local data quality; it strengthens global research infrastructure and helps make the impact of publicly funded research more visible and more connected.

Exploring How AI Can Help Research Data Management

At UC3, several of our latest initiatives involve integrating AI tools, with a particular focus on improving metadata and assisting researchers with creating best practice DMPs.

A clear philosophy guides UC3’s approach to the use of generative AI: addressing researchers’ and the broader research community’s needs, keeping humans as the authority, complementing human work for scale and efficiency, and prioritizing open-source solutions where possible. 

Improving ROR Metadata

One key application of AI we are exploring is enhancing the quality and scale of our metadata curation activities, including those for the Research Organization Registry (ROR). ROR, a widely adopted persistent identifier service for research organizations, operates on a model where anyone can submit a request to add or update its records. This community-focused approach to curation has allowed ROR to grow rapidly by gathering diverse and valuable feedback from a global userbase. However, as one might expect with crowd-sourced data, it also has inherent complexities that require special attention to maintain consistency and quality. 

AI helps by taking these diverse user inputs and automatically transforming them into clean, structured, authoritative outputs in the ROR dataset. For adding records to the registry, this automation seamlessly handles data standardization, formatting, and enrichment tasks that would otherwise require specialized logic and manual intervention to achieve. For updates to the registry, AI can transform natural language descriptions of desired changes into structured modifications, described using ROR’s data model. These interventions have dramatically accelerated ROR’s request processing ability, enabling the service to now efficiently handle its growing request volume and process over 1,000 user-submitted requests per month. 

Despite these advances, achieving 100% accuracy or completeness with these methods is neither possible nor desirable. Instead, we choose to pursue hybrid approaches that balance the efficiency and scalability of GenAI with the measured judgment and domain expertise that only human curators can provide. In doing so, we can embrace both innovation and authoritative oversight, allowing ROR to further grow in its position as a reliable, community-driven infrastructure, in service to the complex needs of the global research ecosystem. 

DMP Chef: Exploring AI-powered DMP Generation 

Another example of our AI exploration is “DMP Chef,” a large language model (LLM) based DMP generator. We are in the initial stages of this work, partnering with the California Medical Innovations Institute (CalMI2) to develop a new tool that allows researchers to provide simple descriptions of their work, from which the DMP Chef can generate a draft DMP. We are currently developing this tool to work with NIH DMPs and plan to follow up on this work by working with NSF and other templates.

The current process involves asking researchers for a short description of their study and the types of data they plan to collect, then using a detailed prompt to have the LLM draft an initial DMP using NIH’s template for review. To test the initial quality, we used the NIH exemplar DMP, extracting the study design and data types from Element 1, and then feeding that information into the tool. We compared the generated output with the actual DMP section by section. Our next step is to recruit data librarians to review these generated DMPs for quality and comprehensiveness.

We’re seeing some moderate initial success with off-the-shelf LLM models, including open-source models, and plan to continue working on refining the quality by exploring options such as asking additional questions to the researcher, generating sections to separate, and feeding the LLM additional policy documents. Our goal is to help create an initial draft of a high-quality plan that researchers can then refine to their needs, suggesting best practice repositories and standards based on their specific data.

Matching Related Works: Connecting Plans to Outputs

We’re also developing new tools to automatically connect DMPs to the research outputs they describe, such as datasets, articles, and software. These new connections improve the discoverability of research data and make it easier for researchers, funders, and administrators to see the complete picture of a project’s outputs. Our approach combines structured metadata from maDMPs with information from sources like DataCite, Crossref, OpenAlex, and the Make Data Count Citation Corpus. We utilize machine learning, incorporating embeddings generated by large language models and vector similarity search, to compare the text from the title and abstract of a DMP with those descriptive fields within the datasets, rather than relying solely on metadata for authors and funders. A human reviewer then confirms the matches to ensure accuracy and reduce the manual reporting burden on researchers. You can read more about this feature at the DMP Tool Blog.

UC3’s AI initiatives are focused on making research data easier to find, connect, and trust. By pairing AI-driven efficiencies with human expertise, we can accelerate workflows while maintaining the accuracy, transparency, and trust essential to research.

Why CDL Is Investing in COMET: A Community Centered Path to Richer Metadata

When the California Digital Library (CDL) signed the Barcelona Declaration in April 2025, it marked a deeper institutional commitment to building open and community-led research infrastructure. At the heart of this commitment is a recognition that metadata is not a passive byproduct of scholarship, but an active force that shapes how research is discovered, connected, cited, and reused. To build an ecosystem where metadata reflects the values of openness, equity, and trust, we must ensure that its stewardship is shared, inclusive, and sustainable.

This is why CDL’s University of California Curation Center (UC3) program is investing in COMET (Collaborative Metadata Enrichment Taskforce). COMET is both a vision and a framework for creating a healthier metadata ecosystem, where persistent identifiers are enriched and maintained through transparent, distributed workflows that engage the full research community. The principles below represent the building blocks of the COMET model and the foundation of CDL’s participation therein:

How COMET Emerged and CDL’s Participation

COMET emerged from a shared realization across the scholarly infrastructure community: if we want metadata that is trustworthy, complete, and actionable, we need to design systems that allow more people to contribute to it and more institutions to shape its governance. This vision came into sharper focus during a series of workshops at FORCE2024 held in Los Angeles and the Barcelona Declaration Community Meeting held in Paris, where participants from across disciplines and sectors gathered to discuss new models for collaborative metadata curation. These sessions surfaced a common theme: metadata enrichment can’t be sustained by individual repositories or publishers alone. What’s needed is a coordinated, community-powered model that invites researchers, libraries, funders, and infrastructure providers to play an active role in improving the quality of metadata tied to persistent identifiers.

Out of these conversations, COMET was born. By early 2025, COMET had evolved into a formal FORCE11 Project and culminated in an open “Community Call to Action” that invited broad participation in shaping workflows, tools, and governance models for metadata enrichment.

CDL was an early and enthusiastic supporter because the vision aligned with our mission and we see an opportunity to help bring it to life. Our involvement isn’t passive. CDL’s UC3 program brings more than two decades of experience in digital curation, persistent identifier infrastructure, and open scholarly systems. We contribute governance know-how, technical insight from our work on initiatives like EZID, Crossref, ROR, and DataCite, and convening power across academic and infrastructure communities. We also see COMET as a proving ground: a space to pilot scalable, community-led metadata workflows that can extend across institutions, repositories, and disciplines.

For CDL, joining COMET is a continuation of our long-standing commitment to open, shared infrastructure and collective progress. It’s an investment in a future where metadata is openly enriched, transparently verified, and valued by the very communities who depend on it.

What Community Participation Means

When libraries and institutions like CDL engage with efforts like COMET, the benefits extend far beyond improved metadata. Our participation brings a deep commitment to equity, transparency, and public stewardship with values that help shape infrastructure for the public good. By contributing expertise in curation, governance, and metadata standards, libraries ensure that research information is more complete, discoverable, and reusable across repositories, researcher profiles, and campus systems.

Shared governance is a central feature of COMET’s approach, and institutional involvement helps ensure that decisions reflect the needs of a global, diverse, distributed community. When we engage in this work, they align their local priorities with broader efforts to create trustworthy, persistent, and openly governed metadata. This alignment reduces redundancy, increases impact, and builds capacity for meaningful contributions across the ecosystem.

But the benefits of this work aren’t just at the institutional level. For researchers and end users, the results are tangible: better discovery, clearer provenance, and richer metadata that supports citation, reuse, and reproducibility. And for funders, repositories, and service providers, this community-driven model offers a scalable alternative to siloed or proprietary solutions that emphasize interoperability, transparency, and accountability.

That’s why we believe that COMET offers more than just a framework for metadata enrichment. It provides an opportunity for us to embody our mission-driven values and help build the connective infrastructure that research depends on. For CDL, supporting COMET is a way to double down on its long-standing commitment to open, community-led infrastructure. It’s about creating shared pathways to trust, equity, and impact where metadata isn’t hidden or locked down, but serves as the connective tissue for discovery and collaboration.

Webinar Series: Insights from the Machine Actionable Data Management Plans Pilot

Want to learn about how technological advancements in data management plans can benefit research at your university? Have you heard the term “machine actionable” a lot but aren’t sure what it is or why it’s important? Are you looking for strategies to reduce burden on researchers and administrators in working on data management plans?

Join our free webinar series to learn from several US institutions that explored and piloted machine-actionable approaches to data management plans (DMPs).

Funded by the Institute of Museum and Library Services (award LG-254861-OLS-23), and led jointly by the California Digital Library (CDL) and the Association of Research Libraries (ARL), the Machine Actionable Plans (MAP) Pilot initiative enabled institutions to test and pilot data management plans that are machine-actionable and facilitate communication with other university research and IT systems. Each institution developed its own projects in alignment with their institutional mission, and with their specific challenges and opportunities taken into consideration. The DMP Tool team also worked with pilot partners to test features and advance technical developments to improve usability, best practice adoption, compliance, and efficiency.

In this series of webinars, we invite librarians, administrators, data managers, IT & security staff to find out more about the motivations of these institutions to explore machine-actionable DMP integrations: what they did, how they did it, and what they learned. For those interested in more technical aspects of integrations, some webinars will also provide detail on the API of the DMP Tool, along with more detailed implementation instructions and advice.

Webinar 1: Streamlining Research Support: Lessons from maDMP Pilots  

This webinar is for those looking to improve the efficiency, collaboration, and coordination of research support within their institutions. Learn from several institutions about their explorations of maDMP integrations to facilitate automated notifications for coordination across campus, and about how they used the pilot more broadly to facilitate discovery and collaboration within their institutions. This webinar will provide an overview of each institution’s activity, rather than detailed instructions about integrations.

Presenters include:  Katherine E. Koziar, Briana Wham, Matt Carson, Andrew Johnson

Register

Webinar 2: Creative Approaches for Seamless and Efficient Resource Allocation 

Don’t miss this webinar if you’re interested in new ways to enable efficient resource allocation. Institutions will share their experiences in leveraging maDMPs to develop integrations for automation systems that enable such allocations. This webinar will provide an overview of each institution’s activity, rather than detailed technical instructions about integrations.

Presenters include:  Katherine E. Koziar, Andrew Johnson

Register

Webinar 3: Five Technological Advancements in DMPs to Benefit Your Organization 

If you’re interested in emerging technologies within the pilot project and the DMP Tool and how they can help your institution expedite research sharing, compliance, and operational efficiency, this webinar will provide a strong introduction. We’ll also hear from pilot partners about promising AI developments related to reviewing DMPs, and will hear more detail on technical advancements coming to the DMP Tool based on feedback from the pilot. 

Presenters include:  Jim Taylor, Becky Grady

Register

Webinar 4: How to Implement Machine-Actionable DMPs at your Institution

If you want to find out more about specific integrations and how to implement maDMPs, this webinar is for you. Hear from the DMP Tool team about the API, common challenges and how to overcome them, and actionable recommendations for campus buy-in.

Presenters include:  Becky Grady, Brian Riley

Register

Working Toward a Common Standard API for Machine-Actionable DMPs

DMP Tool and the Research Data Alliance

Our work at DMP Tool has been shaped from the ground up through collaborations at the Research Data Alliance (RDA). From the earliest conversations about machine-actionable Data Management Plans (maDMPs) to the creation of the DMP common standard and the DMP ID, the RDA has served as the convening space where we’ve found shared purpose, co-developed solutions, and built lasting partnerships with peers across the globe. That same spirit is captured in the Salzburg Manifesto on Active DMPs, which outlines a vision for DMPs as living, integrated components of the research lifecycle. That vision continues today, as we are helping launch a new initiative at RDA to update a common API standard for DMP service providers. This effort will help ensure our systems can connect more seamlessly and serve the broader research ecosystem more effectively. This post gives some context on why this new effort is needed, what we’ve done so far for it, and what we have coming next.

DMP Tool implementation of the RDA common standard

The DMP Tool team were early advocates of maDMPs and saw the potential value of capturing structured information during the creation of a DMP. The goal is to use as many persistent identifiers (PIDs) as possible to help facilitate integrations with external systems. To gather this data, we introduced new fields into the DMP Tool to capture detailed information about project contributors (ORCIDsRORs, and CRediT roles) as well as what repositories (re3data), metadata standards (RDA metadata standards) and licenses (SPDX) would be used when creating a project’s research outputs. These new data points are captured alongside the traditional DMP narrative. We also started allowing researchers to publish their DMPs. This process generates a DMP ID, a DOI customized to capture and deliver DMP-focused metadata. This approach allows the DMP to be discoverable in knowledge graphs like DataCite Commons. Once the DOI is registered, the DMP Tool provides a landing page for the DOI.

Screenshot of the DMP Tool showing how to register your plan for a DMP ID

One of the main points of collecting all of this structured metadata is to facilitate integrations with other systems. To make that possible, we introduced a new version of the API that outputs the DMP metadata in the common standard developed with RDA. Our first integration was with the RSpace electronic lab notebook system. When a researcher is working in RSpace, they are able to connect RSpace with the DMP Tool to fetch their DMPs in PDF format and store the document alongside their other research outputs. Once connected, RSpace is able to send the DMP Tool the DOIs of any research outputs that the researcher deposits in repositories like Dataverse or Zenodo. These DOIs are then available as part of the DMPs structured metadata.

Moving the Standard Forward 

The original RDA DMP common standard was released 3 years ago. Since that time, systems like the DMP Tool have found areas where we need to deviate from the base standard. This is a normal process when any standard is developed and first put into use. We have discovered key fields that should be added to the standard (e.g., contributor affiliation information) and areas that don’t really make sense to capture within the DMP itself (e.g., the PID systems a particular repository supports). 

Other DMP systems have also been implementing the common standard and making it available via API calls, but this was done without conformity as to how an external system can access those APIs. This results in systems like RSpace needing to develop and maintain separate integrations for each tool. Over time, this extra work leads to fewer integrations between systems, making each more siloed.

RDA is made up of Interest Groups and Working Groups where members across the world join together to work on a common topic, making guidelines, best practices, tools, standards, and other resources for the wider community. To tackle this use case and address shared issues, our RDA group decided to release a new version of the common standard, v1.2, and forming a new working group to develop API standards that each tool should support. Members of the DMP community gathered together at the end of March to discuss both topics. The DMP systems represented at the meeting included ArgosDAMAPData Stewardship WizardDMPonlineDMP OPIDoRDMP ToolDMPTuuli, and ROHub.

Our DMP Tool team attended the meeting to make sure that the needs of our funders, researchers and institutions were properly represented. The meeting was split into two parts: 

Photograph of 14 meeting attendees representing a variety of service providers in a conference room
Meeting attendees, representing a variety of DMP service providers, worked together on the common standard

Next steps

The original common metadata standard working group plans to incorporate the proposed non-breaking changes this summer as release v1.2. We have also committed to keep the conversation going about future enhancements as we work towards v2.

Meanwhile, the new RDA working group also hopes to release an official API specification this summer. The individual tools would then be tasked with ensuring that their systems support the new API endpoints. For our part, the DMP Tool will ensure that our new website supports this API standard when it launches, as well as additional endpoints specific to our application. The goal is that integrator services like RSpace will then be able to connect more easily with any DMP service, making connections across the research system more robust.

Anyone can review the new DMP common API for maDMP working group proposed work statement. We would value your input, and if you’re interested in joining the group and contributing to the API specification, you can join RDA (its free!) and join our Working Group.

UC3 New Year Series: Data Publishing at CDL in 2025 

Structured, well-documented, and FAIR-aligned data is the foundation of effective research dissemination. However, data publishing activities have often focused on the last step in the research process. This puts energy on helping researchers clean up disorganized data sets and placing them in repositories. While this is essential to ensuring accessibility and preservation of important data outputs, it is also important to connect the dots and address the underlying issues that lead to poor data quality in the first place. Our previous development work and continuing membership with Dryad are great examples of this commitment to supporting well-formatted deposits.  However, it has also always been the strategy of the UC3 data publishing team to invest in people through training, comprehensive documentation, institutional support and policies, and innovative tools. Our goal is to connect those dots and help empower the research community with the skills and knowledge to create high-quality, well-structured data from the outset. 

In 2025, we aim to create a more open, transparent, and sustainable data-sharing future by combining emerging technologies with structured training programs. This dual approach improves data deposit quality and empowers researchers to contribute to a more efficient data-publishing ecosystem.

AI Tools for Data Publishing

Conversations with repository managers often highlight recurring challenges: incomplete documentation, missing README files, or data files that don’t match metadata standards. Automated “nudges” can catch such issues at the point of deposit. The vision is for AI-based systems to serve as virtual coaches that flag inconsistencies and as active collaborators capable of implementing necessary changes where appropriate. These tools will be able to modify metadata directly, generate appropriate README files, and restructure dataset instructions when needed—transforming how researchers prepare and deposit their data.

One promising component of our 2025 strategy involves the development of AI-assisted curation tools, which we’ve begun exploring to provide researchers real-time feedback on their data deposits. This approach leverages artificial intelligence to identify potential metadata, documentation, and formatting issues before submission. However, we won’t be covering this topic in detail here. Those interested in our AI curation initiatives, please refer to our previous article, in which we discussed this thoroughly.

In this post, we highlight CDL’s collaboration with The Carpentries as a key component of the UC3 data publishing strategy for 2025, emphasizing the human side of our approach: training. The Carpentries teaches foundational coding and data science skills to researchers worldwide, and through our partnership, we directly address the skills gap in metadata, documentation, and data formatting.

Training Translates to Broader Impact

Good data practices are part of many successful interdisciplinary collaborations. For example, after the Deepwater Horizon spill in the Gulf of Mexico, researchers in fields as varied as biology, oceanography, engineering, and socioeconomics exploited consistent metadata standards to share thousands of datasets seamlessly. That synergy is best achieved when data management principles are embedded long before a crisis or urgent need arises. Planting the seeds of data literacy in labs and classrooms allows institutions to sidestep the friction and duplicative efforts that often accompany cross-institutional projects.

Robust training programs also help teams stay nimble when policies shift. As mandates continue to change—whether through federal agencies or international collaborations—researchers grounded in best practices can adapt quickly, avoiding costly do-overs. In this sense, the cost-effectiveness of up-front training becomes an investment in a more flexible, forward-looking data ecosystem.

Why Training Makes Data Publishing Easier (and Less Costly)

Early exposure to best practices in data management often prevents unnecessary clean-up and steep learning curves later on in a researcher’s career. This observation was echoed in discussions at a recent Earth Science Information Partners (ESIP) meeting, where a central theme was the value of weaving data skills into formal coursework—rather than treating them as optional add-ons for already overworked researchers. Students who learn these concepts in undergraduate or graduate courses, sometimes through a single assignment requiring a formal data management plan, become more adept at producing coherent, reusable datasets.

In many cases, the hands-on philosophy developed by The Carpentries aligns with such classroom activities. Whether using version control for a small-scale project or learning to structure metadata for a mock submission to a repository, these experiences reduce the likelihood of encountering major data-quality issues down the line. Once researchers join labs and undertake funded projects, they have the required knowledge to meet evolving mandates without incurring frantic, last-minute adjustments.

The Carpentries and CDL: A Long-Standing Partnership

For over a decade, the CDL has worked with The Carpentries to refine curricula on coding, documentation, and data management best practices. A 2017 grant from the Institute of Museum and Library Services (IMLS) helped expand “Library Carpentry,” allowing librarians to participate in curation actively. Last year, we received another IMLS award to help the Carpentries scale their operations and curriculum.  Over the years, UC3 staff have been closely involved in shaping these workshops, hosting sessions, and serving on governance councils to promote a broader culture of responsible data stewardship.

One of the main strengths of The Carpentries’ model is its train-the-trainer approach. Seeding new workshops across campuses and disciplines is possible by certifying volunteer instructors within organizations. This approach has found synergy with our participation in the Generalist Repository Ecosystem Initiative (GREI), a collaborative effort bringing together seven major generalist repositories, including Zenodo, Dryad, Vivli, Center for Open Science, and Dataverse. Through GREI, we’re expanding the reach and impact of data publishing best practices across diverse repository infrastructures.

Under the auspices of the GREI project, we’re working with selected Carpentries modules to address specific data publishing challenges across multiple repository environments. In 2025, we’ll pilot these modified modules in workshops to gain practical teaching experience with this GREI-relevant curriculum. This field testing will provide valuable instructor and participant feedback, allowing us to refine the content and delivery methods. This iterative approach ensures that these modules will ultimately integrate seamlessly into the broader Carpentries curriculum, creating sustainable resources that address the complexities of modern data publishing.

Moving Forward in 2025

High-quality data deposits rarely emerge by accident, they require intentional investment in training, documentation, institutional support, and tools. At UC3, we take a holistic approach, recognizing that creating better datasets goes beyond technical solutions – it demands strategic investments across the entire research data lifecycle.

By strengthening training programs, refining repository workflows, and making learning resources widely accessible, we help researchers at all levels produce well-structured, reusable data. Our ongoing collaborations with The Carpentries and GREI ensure that best practices continue to evolve alongside the research community’s needs. With these efforts, “deposit-ready data” can become the standard rather than the exception, reducing inefficiencies and accelerating scientific discovery. As we move through 2025 and beyond, our focus remains clear: building a sustainable, scalable, and human-centered data publishing ecosystem that empowers researchers and institutions alike

UC3 New Year Series: Persistent Identifiers at CDL in 2025

CDL’s persistent identifier portfolio, which includes the Research Organization Registry (ROR), EZID, Name-to-Thing (N2T), and the new Collaborative Metadata Enrichment Taskforce (COMET) initiative, had a busy and productive year in 2024, seeing record adoption, interest, and making significant technical improvements across its services. These complementary work streams help build a more rational, networked, and efficient research ecosystem – one where persistent identifiers allow for seamless connections between researchers, institutions, and scholarly outputs, reducing redundant efforts, unnecessary costs, and making everyone’s work more visible and impactful. As we move into 2025, I’m excited to bring you a look ahead into what we have planned for this year.

ROR

ROR  is a global, community-led registry of open persistent identifiers for research and funding organizations, operated as a collaborative initiative by the California Digital Library, Crossref, and DataCite. As a trusted, free, and openly available service, ROR has become the standard for organizational identification in the scholarly communications ecosystem. The story of ROR in 2025 will be seizing the opportunities provided by this widespread adoption with better performance, improved services, and higher-quality data.  

This work will begin with a Q1 launch of a new and improved version of our affiliation matching service, which has been battle-tested in OpenAlex and used to make millions of new connections between authors, works, and institutions. From here, we will further improve our API’s performance by implementing response caching for repeat requests, speeding up response times and reducing overall resource usage. Once this is complete, we will round things out by implementing a client identification system, allowing ROR to better manage its traffic, while also keeping our API services available at the same generous level of public access.

Concurrent with this technical work, ROR will pursue a number of improvements to the quality and completeness of its data, building on work done in 2024, as well as embracing new and emerging opportunities. This will include many regional improvement efforts, with work already underway in Portugal and Japan, better representation of publishers, society organizations and funders, as well as the addition of new external identifiers and improved domain field coverage in support of all of these efforts. ROR will also continue to refine its curation processes to meet the growing needs of its community. In 2024, ROR processed over 8,000 curation requests—a 44% increase from 2023—with trends indicating that we should expect to receive 1,000 requests per month by year’s end. Our goal is to continue publishing the same, high quality data on our monthly release schedule, even in the face of this increased demand!

EZID and N2T

EZID and N2T are complementary persistent identifier services that enable reliable, long-term access to research outputs. EZID provides identifier creation and management services focused on ARKs and DOIs, while N2T serves as a global resolver that ensures identifiers remain reliable and actionable over time. 

In 2024, EZID was sharply focused on improving its reliability and performance. This included moving to OpenSearch to power our search functionality and rewriting many of our background jobs and database queries to increase their speed and efficiency. This work has resulted in a more stable and high-performing service, capable of handling the large increase in traffic that resulted from EZID assuming resolution functionality for its own ARK identifiers. Building from this foundation, in 2025, we will continue to optimize our underlying systems, while also adding support for DataCite schema v4.6 and improving EZID’s user interface. Our UI updates will be focused on improving the application’s accessibility, such that all users can effectively manage their persistent identifiers. These coordinated improvements will guarantee that EZID remains a dependable and inclusive platform for persistent identifier management.

Alongside EZID’s improvements, we completed another major milestone in 2024: rebuilding N2T as a modern Python service. With the new flexibility and performance this provides, our 2025 plans include rolling out additional service enhancements for N2T that will better support ARK curation workflows, adding public-facing resolution statistics, and fully deprecating the legacy instance of this application. This work will continue to strengthen N2T’s role as essential infrastructure for all the great identifier usage that occurs outside and in parallel to the DOI ecosystem.

COMET

The COMET initiative, launched in November of 2024, seeks to address a critical problem in DOI metadata management. Currently, only record owners can update DOI metadata, even when others have improvements to contribute. This leads organizations to maintain their own enhanced versions of the same records in separate systems, resulting in duplicated effort and inconsistent representations of research outputs, both individually and in aggregate. COMET’s solution is to create an open framework that allows the community to contribute validated metadata improvements directly to DOI records, unlocking tremendous new value and efficiencies at the sources of this metadata.

To date, COMET has brought together experts from publishers, libraries, funding organizations, and infrastructure services in a series of listening sessions focused on the vision, product definition, and governance model for a service that would realize its goals. In March 2025, these efforts will culminate in a community call-to-action, soliciting partnerships, funding, and other resources to help build this service. Subscribe to COMET’s email list to receive up-to-date news or follow its LinkedIn page for updates.

As I hope this all conveys, it has never been a better, more energizing time to both help build and participate in the persistent identifier ecosystem. Here’s to 2025 and all the exciting work ahead for the UC3 and the scholarly communications community!

UC3 New Year Series: The Merritt Digital Preservation Repository in 2025

At UC3, we’re dedicated to advancing the fields of digital curation, digital preservation, and open data practices. Over the years, we’ve built and supported a range of services and actively led and collaborated on initiatives to open scholarship. With this in mind, we’re kicking off a new series of blog posts to highlight our core areas of work and where we’re heading in 2025.

With the close of 2024 and beginning of 2025, the Merritt team is preparing for a new year of exciting projects and engagements with libraries and organizations across the UC system. We wrapped up 2024 by fully revising our Ingest queueing system to allow for more granular control over submissions processing, an effort which laid the groundwork for upcoming features, many of which we’ll discuss here. These include potential user-facing metrics, submission status and a new, easier to use manifest format for batch submissions. 

Meanwhile, the team is always hard at work making the repository’s existing functionality more robust and future proof. We’ll also review some of these efforts and how they both provide reassurance to collection owners and improve users’ overall experience with the system.

Submission Status and Collection Reports

Implementing Merritt’s revamped queueing system was the most significant change to the way the repository ingests content that’s taken place since its inception. The new system establishes and allows for visibility into many more stages of the ingest process. 

Every individual job now moves through these phases:

Merritt job ingest stages

Internally, the Merritt team can monitor these steps through the repository’s administrative layer and intervene as necessary. For example, if one or more jobs in a batch fails, our team can assist by restarting a job or investigating what happened. Once a job is processing again, we can now also send additional notifications regarding the batch it belongs to.

While it’s possible for the team to view when a job or batch of jobs reaches any stage, Merritt currently does not provide this same insight to its users. We’ve already obtained preliminary feedback that such visibility would be helpful and are planning to gather additional input via an upcoming depositor tools survey regarding which stages are most useful and how to present them. In this vein, the survey will seek to capture information pertaining to your workflows for submitting content to the repository, and what new information and tools would make these more efficient for you. Our goal will be to identify trends in responses and focus on a key set of features to optimize Merritt’s usability.

Alongside visibility into a submission’s status, we would also like to provide additional information about the corresponding collection, post-submission. On every collection home page, we display specific metrics about the number of files, objects and object versions in the collection as well as storage space used. Given our recent work on object analysis and data visualization, we plan to surface additional data on the composition of a collection in its entirety. Our current thinking is to bubble up the MIME types of files, a date range that takes into account the oldest and newest objects in the collection, as well as the most recently updated object. A downloadable report with this data may also be an option, directly from the collection home. Again, stay tuned as we’ll be looking for your input on what’s most important to you in this case.

New Manifest Format

There are any number of ways to ingest content into Merritt – from single submissions through its UI, to API-based submissions and also use of several types of manifests that tell the system from where it should download a depositor’s content for processing.  

In essence, a manifest is a list of file locations on the internet saved to a simple but proprietary format. Depending on the type of manifest, it may or may not contain object metadata. Although Merritt’s current range of manifest options provides a great deal of flexibility for structuring submissions, it can present a steep learning curve if a user’s goal is to deposit large batches of complex objects.

To help make working with manifests a more intuitive process, we’ve drafted a new manifest schema that records objects in YAML. Unlike current manifests, this single schema can be adapted to support definition of any object or batch of objects. For example, it allows for defining a batch of objects, where each object can have its own list of files and object metadata, all in one .yml file. Currently, this approach requires use of multiple manifests – one that lists a series of manifest files, and the other manifests to which it refers (a.k.a. a “manifest-of-manifests”). Each of these latter files records an object and its individual metadata. The new YAML schema should allow for more efficient definition of multiple objects in a single manifest, while also being more intuitive and easier to approach.

Available Services

Object Analysis

The Merritt team has endeavored to provide collection owners the means to obtain more insights into the composition of the content being preserved in the repository. More specifically, we’ve leveraged Amazon OpenSearch to record and categorize characteristics of both the file hierarchy in objects as well metadata associated with them. The Object Analysis process is one by which we gather this data from the objects in a specific collection upon request, and then surface the results of subsequent categorization and tests in an OpenSearch dashboard for review. We then walk through the dashboard with the collection owner and library staff to explore insights. For more information, have a look at the Object Analysis Reference or our presentation at the Library of Congress and let us know if you would like to explore your collections! Our goal is to provide you with the critical information needed for taking preservation actions that promote the longevity of your content.

Ingest Workspace

Last fall we brainstormed on how we could more effectively help library partners ingest their collections into Merritt while also minimizing their overhead associated with staging content. For example, if a collection owner’s digitization project results in files saved to hard drives, staging those files shouldn’t necessarily require they have on-premises storage and local IT staff who can administer it while staging and validation actions occur. Through use of a workspace our team has designed, it’s possible to assist with copying content to S3 and leverage a custom process (implemented via AWS Lambda) to automate the generation of submission manifests.

If you’re spinning up a digitization project in the near future which entails a digital preservation element, let us know!

What We’re Working On

As mentioned, we’re always working to make Merritt more robust, secure and easy to maintain for the long run. There are many more granular reasons for undertaking this work, but this year we’re keen to buckle down on our disaster recovery plans, ensure we’re making use of the latest AWS SDK version to interact with cloud storage at multiple service providers and also continuing to set up the building blocks for autoscaling Merritt’s microservices.

Disaster Recovery Plans

It goes without saying that any digital preservation service and its accompanying policies are inherently geared to prevent data loss. However, when truly unexpected events happen, the best thing we can do is strategize for the worst. In our case, because the content we steward lives entirely in the cloud, we need to be prepared for any of those services to become suddenly unavailable. Discussion in the larger community surrounding digital preservation in the cloud has continued, especially as organizations consider moving forward without a local, on-prem copy of their collections. Given Merritt already operates in this manner, it should be able to pivot and recognize either of its two online copies as its primary copy, while continuing to run fixity checks and replicate data to new storage services. Note the system’s third copy in Glacier is not a primary copy candidate due to the excessive costs associated with pulling data out of Glacier for replication. So while we already have the means to back up critical components such as Merritt’s databases, recreate microservice hosts in a new geographic region, and a well known approach to redefining collection configuration (i.e. primary vs. secondary object copies), we need a more efficient implementation to perform the latter reconfiguration on the fly. We intend to work on this storage management implementation in late 2025, detailing its operation not only as part of a disaster recovery plan, but also in our upcoming CoreTrustSeal renewal for 2026.

Talking to the Cloud

Although it doesn’t happen often (rewind to 2010 when the AWS SDK for Java 1.0 was introduced), major version updates to Amazon’s SDK that enables programmatic communication with S3 are inevitable. That’s why we’ve already spent considerable time testing use of v2 of the SDK for Java across all three of our cloud storage providers. So we’ll be ready Amazon ends support for v1 in December of 2025.

Autoscaling for Efficiency

As you may have gathered from some of our regular CDL INFO newsletter posts, a long term goal for our team is to enable certain Merritt microservices to autoscale. In other words, the number and size of submissions to the repository elements that are constantly in flux. Most days we see hundreds of small metadata updates process for e.g. eScholarship publications being preserved in the system. On other days, depositors may submit upwards of a terabyte a piece of new content while the service concurrently processes direct deposits whose updates stem from collections in Nuxeo. Given this variability, our consideration of the impact to the environment of excessive compute resources in the cloud, and compute and storage costs, Merritt should only be running as many servers as it needs anytime. Particularly, because we run every microservice in a high-availability fashion, we should be able to increase the amount of compute on days of heavy submissions, and more critically, reduce the number of hosts when minimal content is being ingested. Revamping our queueing process was a major requirement for autoscaling. With this complete, we (and all UC3 teams) now need to migrate into our own AWS account to further refine control over repository infrastructure. This next step will move us ever closer to fulfilling our goal to autoscale. We’ll continue to share more information on this in our regular posts going forward, so keep an eye out for the latest news!

In closing, we’re excited to work with our library partners throughout the coming year on both potential new features and by collaborating on new projects via recently introduced tools. Reach out to us anytime!