(index page)

Additional preservation assurance with DPN

CDL is a founding member of the Digital Preservation Network (DPN), a coalition of over 50 academic libraries, foundations, and non-profit memory institutions dedicated to the long-term preservation of the scholarly and cultural record. UCLA and UCSD are also DPN members. DPN supports a high level of preservation assurance through widespread replication of digital assets across a geographically-dispersed network of five technically and administratively heterogeneous repositories. DPN membership agreements also incorporate language (a “quitclaim”) that ensures continuity of preservation management in the event a member organization cannot or chooses not to continue to exercise stewardship responsibility for material previously contributed to the network. As a benefit of membership, CDL has the opportunity to contribute up to 5 TB of content to DPN annually at no additional cost.

In late 2015 the UC Libraries Advisory Structure (UCLAS) Direction and Oversight Committee (DOC) formed a DPN allocation project team (DAPT) to investigate the question of how best to take advantage of this DPN capacity by UC members. The DAPT recommended that CDL’s 5 TB allotment should be used as “a common resource for systemwide benefit.” CDL determined that the following collection groups, drawn from content managed in UC3’s Merritt repository, meet that criterion:

Dash – Open datasets from UCB, UCD, UCI, UCM, UCR, UCSC, UCSF, and UCOP: 140 datasets, 11,192 files, 266 GB
DataONE/ONEShare – Open datasets from outside UC: 242 datasets, 10,048 files, 12 GB
Digital Special Collections
- California census data: 30 datasets, 6,100 files, 4 GB
- LSTA collections – archival assets from 94 California public libraries, archives, historical societies, and other local memory institutions: 31,469 archival assets, 943,711 files, 935 GB
- Online Archive of California (OAC): 306,273 archival assets, 4.8 million files, 144 GB
- McGraw-Hill eBooks: 289 eBooks, 6,069 files, 7 GB
- eScholarship Editions: 1,833 publications, 197,790 files, 9 GB
- Mark Twain Editions: 2,946 publications, 38,508 files, 223 MB
eScholarship – UC open access publications: 138,562 publications, 6.7 million files, 698 GB
ETDs – Electronic theses and dissertations from UCB, UCI, UCLA, UCM, UCR, UCSB, UCSC, UCSD, and UCSF: 30,207 ETDs, 380,751 files, 225 GB
UC3
- iPRES 2009 Conference proceedings: 46 papers, 1,602 files, 14 GB
- UCLA modular digital courses: 1 course, 2,652 files, 234 MB
- WAS – Archived copy of UC3’s deprecated Web Archiving Service: 4 objects, 59 files, 376 GB

All told, over 519,000 digital resources, 13.6 million files, and 3.3 TB have been successfully transferred to DPN, which maintains three independent external replicas, hosted across the Academic Preservation Trust (APT), HathiTrust, Texas Digital Library (TDL), and UCSD, in addition to the replication internal to the Merritt repository at the San Diego Supercomputer Center (SDSC) and the Amazon AWS S3 and Glacier storage clouds. (As impressive as these numbers sound, the DPN subset constitutes only about 19% by number and 4% by size of the full Merritt corpus.)

Due to a flurry of deposits by DPN members at the end of 2017, submission processing took longer than expected, extending into February 2018. To avoid a similar rush this year, the deposit of the 2018 Merritt material will begin earlier, with planning starting in September.

Farewell CDL!

A little over two years ago, after an exhausting day of packing up our apartment in Brooklyn, I turned to my partner and said “Hey, remember when said I wasn’t going to do a postdoc?”.

This was a joke, intended to offset the anxiety we were both feeling about our impending move across the country. But, after deciding to not pursue the “traditional” academic path (graduate school → postdoctoral fellowships → faculty position) and shifting from working in cognitive neuroscience labs to working in academic libraries, I had long assumed that my window into the liminal space occupied by postdocs had closed. That is, until I learned about the CLIR Postdoctoral Fellowship Program and saw an opportunity to dive headfirst into the wider world of scholarly communications and open science with the UC3 team at California Digital Library.

Today is my last day in the office at CDL and so much has happened in the world and for me personally and professionally over the course of my fellowship that I’m not sure anything I could write here would ever do it all justice. I suppose I could assess my time at CDL in terms of the number of posters, papers, and presentations I helped put together. I could mention my involvement with groups like BITSS and RDA. I could add up all the hours I’ve spent talking on Skype and Zoom or all the words I’ve written (and rewritten) in Slack and Google Docs. But really the most meaningful metric of my time at CDL would be the number of new colleagues, collaborators, and friends I’ve gained as a postdoc. I came to CDL because I wanted to become part of the broad community of folks working on research data in academic libraries. And now, as I’m about to move into a new position as the Data Services Librarian at Lane Medical Library, I can say that has happened more than I would have thought possible.

Looking back on the last two years, there are about a million people I owe a heartfelt thanks. If you’re out there and you don’t get an email from me, it’s almost definitely because I wrote something, decided it was completely insufficient, wrote something else, decided that was completely insufficient, and then got completely overwhelmed by the number of drafts in my mailbox. But seriously, thanks to everyone on the UC3 team, at CDL and the UC libraries, and beyond for everything you’ve done for me and for everything you’ve helped me do.

Looking forward to what comes next, I have about a million ideas for new projects. Some of are extensions of work I started during my fellowship while others are the product of the connections, insights, or interests I developed while at CDL. But, since this is my last blog post as a postdoc, I also want to devote some space one last UC3 project update.

Support Your Data

If there is a common thread that ties together all of the work I’ve done at CDL it is that I really want to bridge the communication gap that exists between researchers and data librarians. The most explicit manifestation of this has been the Support Your Data project.

If you’ve missed all my blog posts, posters, and presentations on the topic, the goal of the Support Your Data project is to create tools for researchers to assess and advance their own data management practices. With an immense amount of help from the UC3 team, I drafted a rubric that describes activities related to data management and sharing in a framework that we hope is familiar and useful to researchers. Complementing this rubric, we also created a series of short guides that give actionable advice on topics such as data management planning, data organization and storage, documentation, and data publishing. Because we assumed that different research communities (e.g. researchers in different disciplines, researchers at different institutions) have different data-related needs and access to different data-related resources, all of these materials were designed with an eye towards easy customization.

A full rundown of the Support Your Data project will be given in a forthcoming project report. The short version is that, now that the majority of the content has been drafted, the next step is to work on design and adoption. We want researchers and librarians to use these tools so we want to make sure the final products don’t look like something I’ve been working on in a series of Google spreadsheets. Though I will no longer be leading the project, this work will continue at CDL. That said, I have a lot of ideas about using the Support Your Data materials as they currently exist as a jumping off point for future projects.

Data Management Practices in Neuroscience

I’m still surprised I convinced a library to let me do a neuroimaging project. I mean, I’m not that surprised, I can be pretty convincing when I start arguing that neuroimaging is a perfect case study for studying how researchers actually manage their data. But I think it says a lot about the UC3 team that they fully supported me as I dove deep into the literature describing fMRI data analysis workflows, charted the history of data sharing in cognitive neuroscience, and wrangled all manner of acronyms (ahem, BIDS, BIDS).

As I outlined in a previous blog post, the idea to survey neuroimaging researchers literally started with a tweet. But, before too long, it became a full fledged collaborative research project. As a former imaging researcher, I am still marveling over the fact that my collaborator Ana Van Gulick- another neuroscientist turned research data in libraries person- and I managed to collect data from over 140 participants so quickly. Our principle aim was to provide valuable insights to be both the neuroimaging and data curation community, but this project also gave us the opportunity to practice what we preach and apply open science practices to our own work. A paper describing the results of our survey of the data management practices of MRI researchers is currently through the peer review process, but we’ve already published a preprint and made our materials and data openly available.

We definitely hope to continue working with the neuroimaging community, but we also plan to do follow-up surveys of other research communities. Given the growing emphasis on transparency and open science practices in the field, what do data management practices look like in psychology? We hope to find out soon!

Exploring Researcher Needs and Values Related to Software

One of the principle aims of my fellowship was to explore issues around software curation. Spoiler alert: Though the majority of my projects touched on the subject of research software in some way, I’m still not sure I’ve come up with a comprehensive definition of what “software curation” actually means in practice. Shoutout to my fellow software curation fellows who continue to bring their array of perspectives and high levels of expertise to this issue (and thanks for not rolling your eyes at the cognitive neuroscientist trying to understand how computers work).

Before I started at CDL I knew that I would be working with Yasmin AlNoamany, my counterpart at the UC Berkeley library, on a project involving research software. To extend previous work done by the UC3 around issues related to data publishing, we eventually decided to survey researchers on how their use, share, and value their software tools. Our results, which we hope will help libraries and other research support groups shape their service offerings, are described in this preprint. We’ve also made our materials and data openly available.

There is still a lot of work to be done defining the problems and solutions of software curation. Though we currently don’t have plans to do any follow-up studies, we have another paper in the works describing the rest of our results and our survey will definitely inform how I plan to organize software-related training and outreach in the future. The UC3 team will also be continuing to work in this area, through their involvement with The Carpentries.

But wait, there’s more

Earlier this week, after another exhausting day of packing up our apartment outside of Berkeley, I keep remarking to my partner “Hey, remember when I thought I’d never get a job at Stanford.”

This is a joke too. We’re not moving across the country this time, but the move feels just as significant. Two years ago I was sad to leave New York, but ultimately decided I needed to take a step forward in my career. Now, as I’m about to take another step, I’m very sad to leave CDL. I’ve very excited about what comes next, of course. But I will always be grateful for CLIR and the UC3 team giving me to opportunity to learn so much and connect with so many amazing friends, collaborators, and colleagues.

Thanks everyone!

How To Link Dash Data Publications With Your ORCiD Profile

Dash, the data publishing platform, is integrated with ORCiD, an author disambiguation service in a couple of different ways: you can login to Dash with ORCiD, and you (and co-authors) can display your ORCiD on dataset landing pages.

But, let’s take a step back.

What is an ORCiD? It is a unique identifier for you as a researcher. Increasingly funders, publishers, and institutions are requiring ORCiD iDs as a way to identify researchers and track research output. If you’re submitting articles (to journals), or other research output like data and code, take a minute to get yourself an ORCiD and connect all of your research output!

What is the benefit?

ORCiD is an identifier you can use for article and data publishing workflows, and it is also a public profile of your research work. It is a great way to display and track all of your research work.

How does this all relate to Dash?

As mentioned above Dash is integrated with ORCiD for login and credit purposes. But, for ORCiD to properly display datasets submitted to Dash, it is necessary that you get a DataCite profile.

A quick step to ensure your data publications automatically appear on your ORCiD profile

DataCite mints the Dash DOI (which can be used for access and citation of your dataset) and following this simple DataCite guide you can grant permissions for DataCite to send your dataset DOI information back to ORCiD. After you have adjusted your permissions to allow for this, anywhere that you submit (other data repositories) that utilize DataCite will begin displaying on your ORCiD profile just like an article.

Dash Releases First Submission REST API

Over the last year the Dash team has spent time surveying the community on incentives for and ways to drive adoption of data publishing practices. Lots of barriers have been around the ease of submitting data and that data publishing is outside of the status quo research workflows. To try and aid with this, the Dash team has implemented our first Submission REST API. Our hope is that this is the first step towards opening up integration opportunities with electronic lab notebooks and publishers, and allow for research data to be submitted in analytical environments. The first release of our API allows for a user to publish a new dataset or version an existing dataset with metadata and receive a citable DOI. By implementing versioning features, users are now able to update data dynamically.

To get started, check out our technical “How-To” guide. If you have any questions, feedback or would like to discuss integrations, please get in touch at uc3@ucop.edu.

From Networking to Curation: Summing Up the 2018 Data Curation Unconference

Authors: Vessela Ensberg (UC Davis), Jeanine Finn (Claremont Colleges), Greg Janée (UC Santa Barbara/California Digital Library), Amy Neeser (UC Berkeley), Scott Peterson (UC Berkeley)

The data curation unconference in 2018 took place prior to the UCDLFx meeting at UC Riverside. Nearly 30 attendees representing 11 institutions signed up to attend the event. In the course of three hours we completed two sessions and discussed seven problem statements. Topics ranged from building relationships to exchange of hands-on project experience. Below we highlight shared valuable experience and steps for moving forward in addressing these challenges.

How do you form long-term relationships with units outside the library? What do you propose when you initiate the conversation? What kind of involvement do you have in their work and vise-versa?

The key to establishing these relationships is to identify units with a similar research support mission. These relationships tend to be between the library and IT units, Grant Support or the Office of Research, educational technology centers, digital learning centers, campus learning groups, student makerspace groups or library student advisory groups. Some examples of collaborators are BIDS and DLab at UC Berkeley and CRESP at UC Riverside.

Meeting with these groups can identify any gaps in service. Collaboration between departments can also create a pool of consultants, wherein library staff will contribute their time and expertise to a growing pool of experts that can be used for consultations with various researchers. Another area for collaboration is co-teaching in different instruction areas. Finally, there are the opportunities for joint events. UC Berkeley holds consulting summits twice a year for all consultants from different departments to get together and talk in a semi-organized format that sometimes resembles an unconference, but can also be more focused with reports given from various working groups.. Examples of presented projects could be implementing docker in an instructional setting or how to acquire datasets and make them available.

Another approach to strengthening relationships with external units is to invest jointly in resources or positions. Due to the complexities associated with such collaborations, the parties may want to start with a proof of concept or a pilot project for a tool, or by hiring a limited-term position such as a CLIR Fellow.

In all approaches, it is important to communicate clearly the goals and how to assess the outcomes to everyone involved. In other words, the group agrees on what success looks like. A steering committee of stakeholders can be useful to identify the correct goals. These agreements should be documented, sometimes as a MOU. It is important to keep in mind that sustainability is key, and relationships require maintenance.

How do you work with non-traditional data for archiving? e.g., relational databases and digital humanities products?

There are three types of challenges in archiving non traditional products: communication, resources and need for workflows for novel digital research output. In additional to requiring the usual technological and human resources investment to process the materials, there are accommodations to be made for usability. For example, users expect to be able to stream video or to be able to search a database, while process of archiving the database requires flattening it. The challenge is only going to grow as more faculty are engaged in Digital Humanities, which poses the question of how their research output is going to be archived once they retire.

There are some solutions available to UC users. For instance, eScholarship now allows video streaming. Still, we need to meet the need of a having storage and access to the dissemination information package that is being actively used and retains many relevant functionalities in addition the archival information package that is rarely accessed. Ideally, we will see further integration of existing platforms (Dash and eScholarship) to enhance discovery. Perhaps a UC network can provide the infrastructure necessary to take the archival service a step further.

Have you assisted faculty by doing hands-on work with their data?

One paradigm for how libraries can interact with researchers in the area of data curation is the library providing consultation (only), e.g., giving advice on data management plans and repository selection. Another paradigm is the library acquiring (and assuming ownership of) the researcher’s data, and turning it into a library collection. Is there a middle ground? Are there paradigms in which the library plays a more active role in the handling and processing of researcher data?

The consensus that emerged out of a discussion of our collective experiences is that libraries generally do not perform hands-on work with researcher data. To the extent that librarians have worked with data, the focus has been strictly on pre-ingest, higher-level review. There are a variety of reasons for this, including limited resources and sustainability—no surprises there—but also the degree of faculty interest and the need for sufficient discipline knowledge. Some institutions (University of Michigan was noted) have policies that explicitly state (and limit) the degree of librarian involvement in faculty research.

It is in the area of metadata that libraries have played a much more hands-on role. There is ample precedent for librarians assisting with metadata preparation and review, particularly for the metadata backing dataset landing pages, since it is the metadata that directly supports library discovery services.

Ultimately, it was agreed that “success looks like researchers having the skills to curate their own data.”

How do we make publicly available data discoverable, and/or integrate into websites, help people find them?

There are two potential users who need to discover data: the casual user who is browsing and the researcher looking for deep datasets. Since users don’t think of the library catalog when they are looking for data, we need to utilize other means of communicating dataset locations. For example, research is often done on the causes and consequences of current events. With that need in mind UC Davis’ Michele Tobias communicated about generating a set of maps for the boundaries of American Viticulture Areas via a blog post. The blog was discovered by a journalist, who followed up on the dataset. Even though the dataset was not chosen to be featured in a later publication, it demonstrates how writing a story that connects events with datasets in order to support research assists their discovery. In a similar vein, clearly linking datasets with articles will assist their discovery for research.

It is also important to also keep in mind how our patrons search. Since many users start with Google, it is important that dataset metadata is discoverable that way, for example through schema.org for datasets or DataCite.

How do we better educate and advocate for data curation services with researchers?

The first step for successful advocacy is building relationships, following the discussion outlined in the beginning of this blog. In addition to the Office of Research and IT, Management Service Offices (MSO) were identified as important partners assisting with outreach. We should seek to make it easier for our partners to communicate with their network about our services, and provide them with a clear and eloquent message about what we offer.The message used for outreach needs to emphasize the free-of-charge services that sets libraries apart and frame the library as a resource for data services. The services are designed to make the researchers work easier.

Successfully assisting researchers provides word-of-the-mouth marketing that is very powerful. Examples of success stories of assisting researchers and raising the data services profile among them were the file transfer service at UCSD, providing training in data skills and using training to promote services and exploring the consequences of the claim that research data belongs to the UC Regents. Personal experience that results in a researcher vouching for the library results in a persuasive and impactful message.

What would a UC expertise consortia for RDM/curation look like?

We combined a long-term vision and a pragmatic approach in tackling this question. The Data Curation Network provided an inspiring example of how subject specific expertise can be exchanged across partnering institutions. After Lisa Johnson’s presentation for the CKG Deep Dive in September 2017, this collaborative model was probably on the minds of many participants. To move towards achieving this long-term vision, we proposed two actions to move forward: exchange of educational materials and catalog of expertise.

For the educational materials, including curriculum and training, we can use the CKG Google Drive folder. To use those effectively, we will also need a catalog and shared definitions. We can also share workflows in the same manner. Similarly, we wanted to have a catalog of skills relevant to working with research data and find out who in the Library possesses them. After some discussion we settled on undertaking this step within the members of the CKG. We will develop a survey instrument and apply for IRB approval to distribute it to CKG members and ask them to describe their skill set and the skillset of their team. Going forward, we envision presenting a larger project proposal to DOC.

How do we engage students and involve them?

There is no simple answer here. Participants discussed a variety of experiences that have met with some success in developing student engagement with data and data curation practices. The key theme was that it is necessary for library data services advocates to *both* welcome students into library spaces to take advantage of services, as well as go out to meet the students were they are. Students, graduates and undergraduates alike, are producing and publishing their research in greater numbers. We need to find ways to engage with them in the venues where they are already working and sharing their research products. For example, we discussed attending research poster presentations and asking authors about their funding plans as a way of introducing the library’s services for data management planning. Additionally, library programs that are not at first glance “data centric” can provide a gateway to a larger conversation about data management and preservations. Examples of these activities included workshops for establishing a scholarly identity/setting up an ORCID, developing OA educational materials, and learning how to use citation manager software. Connections with student organizations (like the graduate student organization) can also be beneficial in maintaining connections as students graduate.

New hire: Library Carpentry Community & Development Director

We’re excited to announce that Chris Erdmann has been hired as the Library Carpentry Community and Development Director starting May 4, 2018.

Chris has been working in the libraries for more than 21 years to integrate data management and workflows in database and library systems and has been working with research and library communities through training, consulting and tool development to build programs and empower people to work effectively with data. Chris received his MLIS at the University of Washington iSchool while working at the University’s Technology Transfer Office where he helped automate workflows and develop the unit’s web presence and analytics. He spent roughly ten years working alongside astronomers at the European Southern Observatory (ESO) and Harvard-Smithsonian Center for Astrophysics advancing library data mining and linking services, e.g. ESO Telescope Bibliography.

Also during this time, he led an experimental training series called Data Scientist Training for Librarians geared towards teaching librarians data savvy skills to help transform their library services to meet the needs of their research communities,and he recently joined the Library Carpentry governance group. He’s a co-author with Matt Burton, Liz Lyon, and Bonnie Tijerina on the recent report Shifting to Data Savvy: The Future of Data Science In Libraries, where Library Carpentry and The Carpentries are highlighted as a necessary next step for libraries to advance their research services.

Chris will be working with the Library Carpentry community and The Carpentries to start mapping out the infrastructure for growing the community, formalizing lesson development processes, expanding its pool of instructors, and inspiring more instructor trainers to meet the demand for Library Carpentry workshops around the globe and reach new regions and communities.

While this new position is hosted by the University of California Curation Center (UC3), the digital curation program of the California Digital Library (CDL), it is intended to support the work of the Library Carpentry governance committee on streamlining operations within The Carpentries. The position is funded by IMLS and focused on determining standard curriculum, growing instructor training for librarians and planning for community events like the upcoming Mozilla Sprint on Library Carpentry materials.

We are excited to have Chris on board! Feel free to reach to via Twitter (@libcce), GitHub or LinkedIn.

For more information on Library Carpentry: https://librarycarpentry.github.io & @libcarpentry