By Brian Riley, California Digital Library (CDL), and Mary O’Brien Uhlmansiek, The Association of Research Libraries (ARL)
In March, the lead developer for the DMP Tool, Brian Riley, attended a workshop on “Scientometrics Using Open Data” offered by the Centre for Science and Technology Studies (CWTS) at Leiden University. Participation in this session allowed us to share the work we are doing as part of the MAP Pilot project funded by the NSF and IMLS, and to collaborate on scientometric analyses using open data sources such as Crossref and DataCite.
The MAP Pilot project involves working with 10 institutions across the US to test connecting machine-actionable data management and sharing plans (maDMSPs) with related research outputs. Using research project metadata and persistent identifiers to query open data sources, it is somewhat easy to find research articles produced by a particular project, but not the datasets, software, and other artifacts described in a DMSP. We are investigating ways to improve their findability using automation, including machine learning/AI.
When maDMSPs are created in the DMP Tool, users can enter useful project metadata to enable queries with other systems. This includes ORCIDs for contributors, funding opportunity identifiers, RORs for affiliations and funders, anticipated project start and end dates, and the planned data repository for storage. The DMP Tool then assigns a DMP ID to the DMSP.
DMSPs are often created years before the research outputs. The DMSPs in the DMP Tool with good metadata are only 2-3 years old, and their DMSP outputs have not yet been published. Therefore, the institutions contributing to our pilot have been asked to find older, funded research projects and their outputs to use as test cases. Using a new feature to upload an existing DMSP, they will enter basic information about the project (i.e., title, PI, grant identifiers) for research funded by 4 major US agencies (NSF, NIH, DOE, and NASA) and for which we have the most developed API integrations. As potential DMSP outputs are identified, the pilot teams will verify their relation to the research.
Identifying related DMSP outputs within the DMP Tool will give data librarians and research/grant management offices insight into the outputs of research projects, academic departments, and the institution. Users can generate reports for compliance checks (was the data shared according to the funder’s policy), grant reporting, and research management activities.
With sufficient metadata, how do we find related DMSP outputs? We start by exploring open data sources like Crossref, DataCite, and COKI. For example, we explore DataCite’s GraphQL API to extract DataCite metadata and compare it with DMP Tool projects. We use an algorithm to compare and score each field in the records. Each data source structures its metadata differently, though, so we must transform that metadata into a standardized format. We then weigh or score the confidence level of any matches found. A high confidence level is when grant IDs match, but this is rare currently. Confidence levels improve with additional identifiers like ORCIDs, RORs, and repository IDs.
Some development challenges discussed at the workshop include:
- US funding agencies lack a standard way of sharing metadata via their APIs and rarely include Grant IDs. Grant IDs are important but not reliable yet for identification purposes.
- Research/DMSP outputs associated with older projects frequently lack identifiers such as ROR and ORCIDs in their metadata record.
- How can we find datasets and software related to published research articles in systems like COKI? Can we use an article’s references to find these artifacts? What other hooks will allow us to identify these related outputs, and how could improved metadata and the usage of identifiers help facilitate making these connections?
We are exploring adding more data aggregators to combine findings and create a clearer picture of a research project and its outputs. We will also explore methods to identify related works from research article reference sections, like dataset or software references. We are experimenting with ML/AI techniques to determine if a research output might be related to a DMSP.
Findings from the MAP Pilot will be published as reports and best practices for implementing maDMSP workflows at research institutions after the project ends in 2025. If interested in collaborating on this important developmental work, please contact muhlmansiek [at] arl [dot] org for more information.