(index page)
Exploring How AI Can Help Research Data Management
At UC3, several of our latest initiatives involve integrating AI tools, with a particular focus on improving metadata and assisting researchers with creating best practice DMPs.
A clear philosophy guides UC3’s approach to the use of generative AI: addressing researchers’ and the broader research community’s needs, keeping humans as the authority, complementing human work for scale and efficiency, and prioritizing open-source solutions where possible.
Improving ROR Metadata
One key application of AI we are exploring is enhancing the quality and scale of our metadata curation activities, including those for the Research Organization Registry (ROR). ROR, a widely adopted persistent identifier service for research organizations, operates on a model where anyone can submit a request to add or update its records. This community-focused approach to curation has allowed ROR to grow rapidly by gathering diverse and valuable feedback from a global userbase. However, as one might expect with crowd-sourced data, it also has inherent complexities that require special attention to maintain consistency and quality.
AI helps by taking these diverse user inputs and automatically transforming them into clean, structured, authoritative outputs in the ROR dataset. For adding records to the registry, this automation seamlessly handles data standardization, formatting, and enrichment tasks that would otherwise require specialized logic and manual intervention to achieve. For updates to the registry, AI can transform natural language descriptions of desired changes into structured modifications, described using ROR’s data model. These interventions have dramatically accelerated ROR’s request processing ability, enabling the service to now efficiently handle its growing request volume and process over 1,000 user-submitted requests per month.
Despite these advances, achieving 100% accuracy or completeness with these methods is neither possible nor desirable. Instead, we choose to pursue hybrid approaches that balance the efficiency and scalability of GenAI with the measured judgment and domain expertise that only human curators can provide. In doing so, we can embrace both innovation and authoritative oversight, allowing ROR to further grow in its position as a reliable, community-driven infrastructure, in service to the complex needs of the global research ecosystem.
DMP Chef: Exploring AI-powered DMP Generation
Another example of our AI exploration is “DMP Chef,” a large language model (LLM) based DMP generator. We are in the initial stages of this work, partnering with the California Medical Innovations Institute (CalMI2) to develop a new tool that allows researchers to provide simple descriptions of their work, from which the DMP Chef can generate a draft DMP. We are currently developing this tool to work with NIH DMPs and plan to follow up on this work by working with NSF and other templates.
The current process involves asking researchers for a short description of their study and the types of data they plan to collect, then using a detailed prompt to have the LLM draft an initial DMP using NIH’s template for review. To test the initial quality, we used the NIH exemplar DMP, extracting the study design and data types from Element 1, and then feeding that information into the tool. We compared the generated output with the actual DMP section by section. Our next step is to recruit data librarians to review these generated DMPs for quality and comprehensiveness.
We’re seeing some moderate initial success with off-the-shelf LLM models, including open-source models, and plan to continue working on refining the quality by exploring options such as asking additional questions to the researcher, generating sections to separate, and feeding the LLM additional policy documents. Our goal is to help create an initial draft of a high-quality plan that researchers can then refine to their needs, suggesting best practice repositories and standards based on their specific data.
Matching Related Works: Connecting Plans to Outputs
We’re also developing new tools to automatically connect DMPs to the research outputs they describe, such as datasets, articles, and software. These new connections improve the discoverability of research data and make it easier for researchers, funders, and administrators to see the complete picture of a project’s outputs. Our approach combines structured metadata from maDMPs with information from sources like DataCite, Crossref, OpenAlex, and the Make Data Count Citation Corpus. We utilize machine learning, incorporating embeddings generated by large language models and vector similarity search, to compare the text from the title and abstract of a DMP with those descriptive fields within the datasets, rather than relying solely on metadata for authors and funders. A human reviewer then confirms the matches to ensure accuracy and reduce the manual reporting burden on researchers. You can read more about this feature at the DMP Tool Blog.
UC3’s AI initiatives are focused on making research data easier to find, connect, and trust. By pairing AI-driven efficiencies with human expertise, we can accelerate workflows while maintaining the accuracy, transparency, and trust essential to research.
Expanding the Carpentries Community in California Academic Libraries
As the Data Services Librarian at the University of California, San Francisco, and co-chair of the Library Carpentry Advisory Committee, I have been privileged to witness the transformative impact of data and software training for librarians. Whether it is a cataloging librarian figuring out how to automate data uploads, an instruction librarian realizing there is an easier way to clean their messy class registration data, or a research data management librarian who now has the language to connect with computational researchers, data training allows librarians to become invaluable resources to their libraries and academic communities. I am therefore very excited to share that I will be working with The Carpentries Team on a special project to develop the Carpentries community in academic libraries across California. The goal of this project is to create sustainable communities of academic libraries that can work together to make computational training more accessible to librarians and the communities that they serve. As part of this project, I will be reaching out to promote the work of the Carpentries in regional library groups, piloting new membership models for library networks and consortia, assisting with Library Carpentry workshops across the state, and offering more instructor training opportunities for California librarians.
Are you a California librarian interested in what the Carpentries can offer your library? Want to learn more about hosting a Library Carpentry workshop? Send me an email at ariel.deardorff@ucsf.edu.
This post was guest-authored by Ariel Deardorff, Data Services Librarian at UCSF, and originally published at The Carpentries blog.
Greg Janée in Transition

Greg Janée, lead developer for the EZID service, has recently stepped up to be the Director of the Data Curation Program at the UCSB Library. Greg, who joined CDL’s UC3 team at 50% time 10 years ago, shifted to CDL’s infrastructure team for a few years, and until June 2019 will work again with UC3 (at 25% time) as we transition his knowledge and expertise. During that time he will continue to work remotely in Santa Barbara, together with the new Identifier Product Manager / Research Data Specialist who will start mid-November.
Greg has had a long career with CDL and UCSB. He’s made major contributions to identifiers in digital libraries, especially CDL’s very popular EZID service, and also recently published work on “compact identifiers” and ARKs in the Open. His transition to full-time lead of the UCSB Data Curation Program aligns well with CDL’s efforts in research data management. Also in this role, Greg acts as the liaison between UCSB and their participation with Dash (soon to be Dryad), providing replication storage for UCSB’s collections and data publications.
Test-driving the Dash read-only API
The Dash Data Publication service is now allowing access to dataset metadata and public files through a read-only API. This API focuses on allowing metadata access through a RESTful API. Documentation is available at https://dash.ucop.edu/api/docs/index.html.

There are a number of ways to test out and access this API such as through programming language libraries or with the Linux curl command. This short tutorial gives examples of accessing the API using Postman software which is an easy GUI way to test out and browse an API and is available for the major desktop operating systems. If you’d like to follow along please download Postman from https://www.getpostman.com/ .
We are looking to receive feedback on the first of our Dash APIs, before we embark on building our submission API. Please get in touch with us with feedback or if you would be interested in setting up an API integration with the Dash service.
Create a Dash Collection in Postman
After you’ve installed Postman we want to open Postman and create a Dash collection to hold the queries against the API.
- Open Postman.
- Click New > Collection.

3. Enter the collection name and click Create.

Set Up Your First Request
1. Click the folder icon for the collection you just set up.

2. Click Add requests under this collection.

3. Fill in a name for your request, select to put it in the Dash collection you created earlier and click Save to Dash to create.

4. Click on the request you just created in the left bar and then click the headers tab.

5. Enter the following key and value in the header list. Key: Content-Type and Value: application/json. This header ensures that you’ll receive JSON data.

6. Enter the request URL in the box toward the top of the page. Leave the request type on “GET.” Enter https://dash.ucop.edu/api for the URL and click Save.
Try Your Request
1. Test out your request by clicking the Send button.
2. If everything is set up correctly you’ll see results like these.

Information about the API is being returned in JavaScript Object Notation (JSON) and includes a few features to become familiar with.
– A links section in the JSON exposes Hypertext Application Language (HAL) links that can guide you to other parts of the API, much like links in a web page allow you to browse other parts of a site.
– The self link refers to the current request.
– Other links can allow you to get further information to create other requests in the API.
– The curies section leads to some basic documentation that may be used by some software.
Following Links and Viewing Dataset Information
Postman has a nice feature that allows you to follow links in an API to create additional requests.
- Try it out by clicking the url path associated with stash:datasets which shows as /api/datasets.

2. You’ll see a new tab open for your new request toward the top of the screen and then you can submit or save the new request.

3. If you send this request you will see a lot of information about datasets in Dash.

Some things to point out about this request:
– The top-level links section contains paging links because this request returns a list of datasets. Not all datasets are returned at once, but if you needed to see more you could go to the next page.

– The list contains a count of items in the current page and a total for all items.
– When you look at the embedded datasets you’ll see additional links for each individual dataset, which you could also follow.

– You can view metadata for the most recent, successfully submitted version of each dataset that shows as dataset information here.

Hopefully this gives a general idea of how the API can be used and now you can create additional requests to browse the Dash API.
Hints for Navigating the Dash Data Model
– As you browse through the different links in the Dash API, it keep the following in mind.
– A dataset may have multiple versions. If it has only been edited or submitted once in the UI it will still have one version.
– The metadata shown at the dataset level is based on the latest published version and the dataset indicates the version number it is using to derive this metadata.
– Each version has descriptive metadata and files associated with it.
– To download files, look for the stash:download links. There are downloads for a dataset, a version and for an individual file. These links are standard HTTP downloads that could be downloaded using a web browser or other HTTP client.
– If you know the DOI of the dataset you wish to view, use a GET request for /api/datasets/<doi>.
– The DOI would be in a format such as doi:10.5072/FKK2K64GZ22 and needs to be URL encoded when included in the URL.
– See for example https://www.w3schools.com/tags/ref_urlencode.asp or https://www.urlencoder.org/ or use the URL encoding methods available in most programming languages.
– For datasets that are currently private for peer review, downloads will not become available until the privacy period has passed.
USING AMAZON S3 AND GLACIER FOR MERRITT- An Update
The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.
| Content Type | Primary Storage | Secondary Storage | Primary Retrieval | Secondary Retrieval |
| Public | S3 | UCSD UCLA |
S3 | UCSD UCLA |
| Private | UCSD | UCLA Glacier |
UCSD | UCLA |
In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.
The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers. Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged. Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.
CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting. While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.
Collaborative Web Archiving with Cobweb
A partnership between the CDL, Harvard Library, and UCLA Library has been awarded funding from IMLS to create Cobweb, a collaborative collection development platform for web archiving.
The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the technical and financial capacity of any single institution. To ensure that the limited resources of archiving programs are deployed most effectively, it is important that their curators know something about the collection development priorities and holdings of other, similarly-engaged institutions. Cobweb will meet this need by supporting three key functions: nominating, claiming, and holdings. The nomination function will let curators and stakeholders suggest web sites pertinent to specific thematic areas; the claiming function will allow archival programs to indicate an intention to capture some subset of nominated sites; and the holdings function will allow programs to document sites that have actually been captured.
How will Cobweb work? Imagine a fast-moving news event unfolding online via news reports, videos, blogs, and social media. Recognizing the importance of recording this event, a curator immediately creates a new Cobweb project and issues an open call for nominations. Scholars, subject area specialists, interested members of the public, and event participants themselves quickly respond, contributing to a site list more comprehensive than could be created by any one curator or institution. Archiving institutions review the site list and publicly claim responsibility for capturing portions of it that are consistent with their local policies and technical capabilities. After capture, the institutions’ holdings information is updated in Cobweb to disclose the various collections containing newly available content. It’s important to note that Cobweb collects only metadata; the actual archived web content would continue to be managed by the individual collecting organizations. Nevertheless, by distributing the responsibility, more content will be captured more quickly with less overall effort than would otherwise be possible.
Cobweb will help libraries and archives make better informed decisions regarding the allocation of their individual programmatic resources, and promote more effective institutional collaboration and sharing.
This project was made possible in part by the Institute of Museum and Library Services, #LG-70-16-0093-16.
CC BY and data: Not always a good fit

Simba dans le carton, jacme31, CC BY-SA 2.0
This post was originally published on the University of California Office of Scholarly Communication blog.
Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data. (more…)
Who “owns” your data?

Tug of war, Kathleen Tyler Conklin, CC BY-NC 2.0
This post was originally published on the University of California Office of Scholarly Communication blog.
Which of these is true?
“The PI owns the data.”
“The university owns the data.”
“Nobody can own it; data isn’t copyrightable.”
You’ve probably heard somebody say at least one of these things — confidently. Maybe you’ve heard all of them. Maybe about the same dataset (but in that case, hopefully not from the same person). So who really owns research data? Well, the short answer is “it depends.”
A longer answer is that determining ownership (and whether there’s even anything to own) can be frustratingly complicated — and, even when obvious, ownership only determines some of what can be done with data. Other things like policies, contracts, and laws may dictate certain terms in circumstances where ownership isn’t relevant — or even augment or overrule an owner where it is. To avoid an unpleasant surprise about what you can or can’t do with your data, you’ll want to plan ahead and think beyond the simple question of ownership. (more…)
PIDapalooza – What, Why, When, Who?

PIDapalooza, a community-led conference on persistent identifiers
November 9-10, 2016
Radisson Blu Saga Hotel
pidapalooza.org
PIDapalooza will bring together creators and users of persistent identifiers (PIDs) from around the world to shape the future PID landscape through the development of tools and services for the research community. PIDs support proper attribution and credit, promote collaboration and reuse, enable reproducibility of findings, foster faster and more efficient progress, and facilitate effective sharing, dissemination, and linking of scholarly works.
If you’re doing something interesting with persistent identifiers, or you want to, come to PIDapalooza and share your ideas with a crowd of committed innovators.
Conference themes include:
- PID myths. Are PIDs better in our minds than in reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
- Achieving persistence. So many factors affect persistence: mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
- PIDs for emerging uses. Long-term identifiers are no longer just for digital objects. We have use cases for people, organizations, vocabulary terms, and more. What additional use cases are you working on?
- Legacy PIDs. There are of thousands of venerable old identifier systems that people want to continue using and bring into the modern data citation ecosystem. How can we manage this effectively?
- The I-word. What would make heterogeneous PID systems “interoperate” optimally? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
- PIDagogy. It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
- PID stories. Which strategies worked? Which strategies failed? Tell us your horror stories! Share your victories!
- Kinds of persistence. What are the frontiers of ‘persistence’? We hear lots about fraud prevention with identifiers for scientific reproducibility, but what about data papers promoting PIDs for long-term access to reliably improving objects (software, pre-prints, datasets) or live data feeds?
PIDapalooza is organized by California Digital Library, Crossref, DataCite, and ORCID.
We believe that bringing together everyone who’s working with PIDs for two days of discussions, demos, workshops, brainstorming, and updates on the state of the art will catalyze the development of PID community tools and services.
And you can help by getting involved!.
Propose a session
Please send us your session ideas by September 18. We will notify you about your proposals in the first week of October.
Register to attend
Registration is now open — come join the festival with a crowd of like-minded innovators. And please help us spread the word about PIDapalooza in your community!
Stay tuned
Keep updated with the latest news at the PIDapalooza website and on Twitter (@PIDapalooza) in the coming weeks.
See you in November!
UC3 to Explore Amazon S3 and Glacier Use for Merritt Storage
The UC Curation Center (UC3) has offered innovative digital content access and preservation services to the UC community for over six years through its Merritt repository. Merritt was developed by UC3 to address unique needs for high-quality curation services at scale and a low price point. Recently, UC3 started looking into Amazon’s S3 and Glacier cloud storage products as a way to address cost concerns, fine-tune reliability issues, increase service options, and keep pace with ever-increasing scale in the volume, variety, and velocity of new content contributions.
The current Merritt pricing model, in effect since July 1, 2015, is based on recovering the costs of storage use, currently totally over 73 TB contributed from all 10 UC campuses. This content is now being replicated in UC private clouds supported by UCLA and UCSD. Since the closure earlier this year of the UCOP data center, the computational processes underlying Merritt, along with all other CDL services, have been moved to virtual machines in the Amazon AWS cloud. Collocating storage alongside this computational presence in AWS will provide increased data transfer throughput during Merritt deposit and retrieval. In addition, the integration of online S3 with near-line Glacier storage offers opportunities to lower storage costs by moving archival materials with no expectation of direct end-user access to Glacier. The cost for Glacier storage is about one quarter of that for S3, which is comparable with UCLA and UCSD pricing. Of course, the additional dispersed replication of Merritt-managed data in AWS will also increase overall reliability and long-term preservation assurance.
The integration of S3 and Glacier will supplement Merritt’s existing use of UC storage. Merritt’s storage function acts as a broker that automatically routes submitted content to the appropriate storage location based on its curatorially-defined access characteristics. Once Amazon storage has been added to Merritt, content tagged for public access will be routed to S3 for primary storage, from which it will be automatically replicated to a UC cloud. Retrieval requests for this content will be served from the S3 copy; should these requests fail (for example, if S3 is temporarily non-responsive), Merritt automatically retries from its secondary copy.
The path for content tagged for private access is somewhat different. It is initially routed to S3 for temporary storage until the replication to a UC cloud completes. The content is then moved into Glacier for permanent low-cost primary storage. Retrieval requests will be served from the UC cloud. In the unlikely event that this retrieval doesn’t success, there is no automatic retry from Glacier, since Glacier, while inexpensive for static storage, is costly for systematic retrieval. UC3 staff can, however, intervene manually to retrieve from Glacier if it becomes necessary. In the case of both public and private access, the digital content will continue to be managed with at least five copies spread across independent storage infrastructures and data centers.
The integration of Amazon S3 and Glacier into Merritt’s storage architecture will increase overall reliability and performance, while possibly leading to future reduction in costs. Once the integration is complete, UC3 will monitor AWS storage usage and associated costs through the end of the current Merritt service year in June 30, 2017, to determine the impact on Merritt pricing.