(index page)

Library Carpentry Receives Supplemental IMLS Grant

In November 2017, the California Digital Library (CDL) announced a two-year Institute of Museum and Library Services (IMLS) grant funded project to further advance the scope, adoption, and impact of Library Carpentry across the US. The grant enables CDL and the University of California Curation Center (UC3) to host Chris Erdmann, the Library Carpentry Community and Development Director, and his work with community members, Carpentries staff, and governance groups to integrate Library Carpentry as a lesson program, further develop the curriculum and lessons, grow the community of Carpentries instructors with library and information backgrounds, and continue outreach to raise awareness about Library Carpentry and The Carpentries in the broader library community.

To support the ongoing work of Library Carpentry and the data and software training to library- and information-related roles, we are happy to report that IMLS has awarded CDL supplemental funding. This supplemental funding will provide continued support for workshops and instructor training, as well as create a membership scholarship program to reach new library communities and consortiums. The funding will also provide continued support for Library Carpentry’s current goals to expand the pool of Carpentries trainers and instructors from library- and information-related roles and to complete and formalise curriculum and lessons currently being developed by community members. The CDL, The Carpentries, and the Library Carpentry Advisory Group are currently planning outreach to various library networks to see how we can work together towards providing data and software training to their communities. Members of these groups will be reaching out in the coming months. Also, this month (September 2019), The Carpentries will launch a new workshop request form that will respond to library driven and related workshops.

About CDL CDL was founded by the University of California in 1997 to take advantage of emerging technologies that were transforming the way digital information was being published and accessed. Since then, in collaboration with the ten UC campus libraries and other partners, CDL has assembled one of the world’s largest digital research libraries and changed the ways that faculty, students, and researchers discover and access information. We facilitate the licensing of online materials and develop shared services used throughout the UC system. Building on the foundations of the Melvyl Catalog, CDL has developed one of the largest online library catalogs in the United States and works in partnership with the UC campuses to bring the treasures of California’s libraries, museums, and cultural heritage organizations to the world. We continue to explore how services such as digital curation, scholarly publishing, archiving and preservation support research throughout the information lifecycle.

About The Carpentries The Carpentries builds global capacity in essential data and computational skills for conducting efficient, open, and reproducible research. We train and foster an active, inclusive, diverse community of learners and instructors that promotes and models the importance of software and data in research. We collaboratively develop openly-available lessons and deliver these lessons using evidence-based teaching practices. We focus on people conducting and supporting research.

Cross-posted on The Carpentries blog

UC Data Network: Lessons Learned

Scholars at the University of California need effective solutions to preserve their research data. This is essential for complying with funder mandates, publication requirements, policies, and evolving norms of scholarly best practice. However, several cost barriers have impeded consistent, comprehensive preservation of UC research data. In an attempt to tackle some of these challenges, California Digital Library (CDL) brought together campus Vice Chancellors of Research (VCRs), Chief Information Officers (CIOs)/Research IT, and University Librarians (ULs) from across the UC system to explore the creation of a UC Data Network (UCDN) as a distributed storage solution.

For the past 18 months, CDL has led an exploratory pilot preservation project to establish UCDN with three campuses. We have now decided to conclude this pilot and want to take this opportunity to reflect on our successes and challenges in tackling such an ambitious scope of work. There are many lessons learned. We offer this post as a way of capturing some of the main findings and takeaways of the UCDN activities.

UCDN pilot project

Campuses routinely grapple with how to offer long term-preservation for the research data our researchers create. The goal of the UCDN project was to chip away at one consistent hurdle: recurring data storage costs associated with long-term digital preservation. In early 2018, we brought together VCRs, CIOs, and ULs across the UC system to explore pilot ideas for tackling this hurdle. From those consultations we crafted a pilot project: Pilot campuses would make upfront capital investments in storage and CDL would plug that storage into our Merritt preservation repository. This storage, via the preservation repository, would then be used by UC’s Dash data publishing platform. In essence, the pilot entailed moving the costs of preserving published datasets from a recurring individual campus expense to a shared UC-wide investment.

What we learned

After nearly 18 months, we have decided to conclude the UCDN pilot. We have learned several lessons that can help guide where we go next.

Lesson #1. We need to make preservation a more compelling story for users. It was difficult to demonstrate UCDN’s value to researchers. We were piloting a service that focused on the back-end storage costs for back-end preservation services. This was not an easy story to tell and quite often our outreach to campuses and researchers was lost when describing this relationship.

Lesson #2. Project ownership is key. We knew that buy-in from multiple departments was key to the success of UCDN. Campus IT teams, libraries, and research offices all needed to own this effort and we were successful in getting traction at the beginning. However, as time progressed and storage provision became one immediate task, we saw that the project lost broad ownership. While commitment remained high, we were not able to find specific champions to ensure the pilot remained top priority.

Lesson #3. Smaller scale ≠ smaller scope. We started the project knowing that multiple campuses provisioning and maintaining storage for a pilot might be risky. To help mitigate this, we started with a set of 3-4 campuses. This smaller set of campuses, however, did not reduce the overall complexity of the project and we quickly saw that reducing the scale of the pilot did not reduce the scope of the effort: instead of working on a small pilot, we ended up trying to achieve a full solution at fewer places.

Lesson #4. Systemwide efforts are not necessarily (or uniquely) efficient. Our original premise was that a systemwide effort at data preservation would be the most efficient approach. However, as the pilot progressed, we realized that the wider academic community beyond UC was also grappling with similar cost issues. Pilot team members realized that appropriate economies of scale should actually come from collaborations beyond the UC system.

Lesson #5. We need to keep our eyes on the prize. Our original goal was to remove the cost barriers to data preservation. The UCDN pilot team remained focused on this as our goal and the pilot experience gave us the space to brainstorm alternative approaches to tackling this issue. This consistent focus on our ultimate goal eventually led to the partnership CDL forged with Dryad (described further below).

What’s next

While we have decided not to continue with the UCDN pilot, we now are in the position to leverage our lessons learned to move forward and achieve the original goals for the UCDN effort by focusing time and resources on our new Dryad partnership.

CDL is now putting the finishing touches on the rollout of the Dryad data publishing service across all UC campuses. Dryad is a trusted name in the researcher community and, with this new arrangement, it will be a space where UC researchers can publish their datasets in a repository with consistent preservation policies at no costs to the researcher, department, or campus. This means that UC will be able to simultaneously drive adoption of data publishing and long-term stewardship in one space…and without the hurdles associated with recurring storage costs. And with this, we will have met the original goals of the UCDN project.

csv,conf,v4: call for proposals

Although a ubiquitous term, the acronym CSV has varied meanings depending on who you ask. In the data space, CSV often translates to comma-separated values – a machine-readable data format used to store tabular data in plain text. To many, the format represents simplicity, interoperability, compactness, hackability, among other things.

From when it first launched in July 2014 as a conference for data makers everywhere, csv,conf adopted the comma-separated-values format in its branding metaphorically. Needless to say, as a data conference that brings together people from different disciplines and domains, conversations and anecdotes shared at csv,conf are not limited to the CSV file format.

On May 8-9, 2019, the fourth version of csv,conf is set to take place at Eliot Center in Portland, Oregon. Over two days, attendees will have the opportunity to hear about ongoing work, share skills, exchange ideas (and stickers!) and kickstart collaborations. You are welcome to submit session proposals for our 25-minute talk slots between now and end of day, February 9, 2019.

The commallama has now become a big and fun part of csv,conf. How did we settle on this llama? What is its significance? Is it even a llama? We hear your questions, and implore you to join us in Portland on May 8 and 9 to meet the commallama and find out!

We are keen on getting as many people as possible to csv,conf,v4, and will award travel grants to subsidize travel and associated costs for interested parties that lack the resources and support to get them to Portland. To that end, we have set up our honor-system, conference ticketing page on Eventbrite. We encourage you to get your conference tickets as soon as possible, keeping in mind that as a non-profit and community-run conference, proceeds from ticket sales will help cover our catering and venue costs in addition to offering travel support for speakers and attendees where needed.

From the first three conferences held in the last four years, csv,conf brought together over 500 participants from over 30 countries. And 300+ talks spanning over 180 hours have been presented, packaged and shared on our YouTube channel. Many post-conference narratives and think pieces, as well as interdisciplinary collaborations have also surfaced from previous conferences. This is only part of the story, and we can’t wait to see and hear from you in Portland in May, and are excited for all that awaits!

The UC3 team is part of the conference committee and happy to answer any questions you may have. Feel free to reach out to us at uc3@ucop.edu or to the full committee at csv-conf-coord@googlegroups.com.

csv,conf,v4: https://csvconf.com

Community-Owned Data Publishing Infrastructure

As a library community, we continue to struggle to find scalable approaches to offering open, shared, sustainable scholarly infrastructure. This is especially true in the data publishing and research data management space where institution-focused approaches to capturing and curating data may be hindering our ability to grow adoption by our researchers.

To alleviate this impasse and jumpstart a new community-led approach, California Digital Library is formally partnering with Dryad to build a globally-accessible, transparent, and low-cost data publishing and curation service. The goal of this partnership is to completely reimagine the potential for Dryad, acting as an open, free community hub for collecting and curating data for researchers. It is not intended to compete with existing institution-based services, but to complement and amplify each of our campus’ efforts.

We hope that we can start a global discussion with institutions worldwide on better ways to support institutions and researchers in the face of rapid commercialization of the research data space. We cannot do this alone. For our collective action to effectively leverage institutional knowledge and serve researchers as end users, we need a diverse group of institutions to participate in defining the goals and values of this activity.

What does this look like?

We are putting the finishing touches on the migration of the Dryad service onto CDL’s technical platform. Dryad is a trusted name in the researcher community and, with this technical shift, it will be a space where institutional members will have transparent reporting features and the ability to join a global data curation community. Dryad will also be positioned to enhance technical integrations (via API) with publishing partners to seamlessly capture data publications at the time of article publishing. This means that we will be able to simultaneously drive adoption of data publishing and offer digital curation and stewardship in one space.

CDL Awarded IMLS Grant for Community-Owned Data Publishing Infrastructure

Supporting shared scholarly infrastructure must be done by the community and for the community. To help jumpstart this process, California Digital Library and Dryad will facilitating several one-on-one discussions and community workshops in the coming months to determine the features and services most needed in our community.

Our first community workshop will be held in December after CNI in Washington, DC. With funding from an IMLS National Infrastructure grant, we will host a facilitated discussion on institutional values, needs, and potential community-based business models that meet our collective goals, support our researchers, and create a sustainable, attractive new Dryad service offering. Our goal is to chart a path forward for this movement and gain concrete institutional commitments to joining the Dryad community.

How do you get involved?

Please read the latest blog post from Melissanne Scheld, Dryad’s Executive Director, about the next steps for Dryad.

Institutions: If a member of senior leadership would be interested in participating in our one-day workshop on December 12, 2018, please contact the UC Curation Center (UC3) at CDL for more information.

Don’t Worry: There will be additional workshops planned in the US and abroad. We will keep you posted on future opportunities to get involved in this important initiative. Please contact contact the UC Curation Center (UC3) at CDL for more information for more information.

The blog is cross-posted at CDLinfo: https://www.cdlib.org/cdlinfo/2018/10/24/community-owned-data-publishing-infrastructure/

Tackling the storage costs of digital preservation

Over the past year, California Digital Library (CDL) has facilitated a discussion between UC campus Vice Chancellors of Research (VCRs), Chief Information Officers (CIOs), and University Librarians (ULs) to explore pilot ideas for breaking down the high data storage costs associated with digital preservation. Our goal is to work in small, incremental ways towards building a sustainable and reliable network of storage nodes for sharing and preserving research data that does not rely on uncertain funding sources. In addition, we are looking to find ways for campuses to retain copies of their datasets in a financially responsible manner. This exploration is codenamed UCDN (UC Data Network).

UCDN: capital investment into campus storage

During our consultation period, the most popular UCDN pilot idea that materialized was that campuses could break this logjam by making upfront capital investment in storage. The hypothesis is that if pre-established storage nodes can be leveraged for research data preservation then this could remove or reduce the need for recurring charges being sent to cash-strapped departments and this could offer a way for each campus to retain copies of their outputs.

This idea gained traction and, over this past summer, three campuses volunteered to participate in a pilot to explore this idea further: UCSF, UC Irvine, and UC Riverside.

New pilot projects at UCSF, UC Irvine, and UC Riverside

Starting in the fall of 2018, IT teams at UCSF, UC Irvine, and UC Riverside are provisioning storage nodes to support this pilot project. As each campus gets its storage online, campus teams that span research offices, IT teams, and libraries will set up new procedures and/or re-evaluate existing procedures regarding the preservation of research data. They will then use this as an opportunity to re-engage with research projects that benefit from this new investment.

In addition to these campus-based collaborations, CDL will also connect to each new storage node as back-end storage components of our Merritt preservation repository. This will allow us to automatically leverage the new campus investment any time a researcher from one of the pilot campuses uses the Dash data publishing platform for publishing their research data. In addition, any researcher from one of these three campuses who is interested in working on other research data preservation projects can contact their local campus teams or UC3 for more information on utilizing this storage.

How you can get involved?

UCDN is a unique approach to back-end storage and preservation for research data. This new pilot is meant to help with streamlining campus administrative processes and establish more logical resource sharing. Through this, we hope it will also allow for more consistent processes for research data preservation to emerge.

How can you leverage this back end system in your research projects? If you are a researcher or research team lead at UCSF, UCI, or UCR, you can utilize this new resources by continuing to use (or beginning to use):

Dash manual submissions: your campus offers the Dash data publishing platform for sharing datasets. Any researcher from UCSF, UCI, or UCR can sign-in at any time and submit a dataset to be published. All deposits are assigned a DataCite DOI to streamline citation and simplify the process for connecting your datasets to journal articles during the publishing process. We will leverage the new storage nodes for all manual deposits from these three campuses. You can learn more here:
- UCSF dash: https://datashare.ucsf.edu/
- UC Irvine dash: http://dash.lib.uci.edu/
- UC Riverside dash: https://dash.ucr.edu/
Dash by API: Dash offers a sophisticated API for submitting datasets directly from other environments i.e. electronic notebooks, code repositories, web scripts, etc. We will leverage the new storage nodes for all API deposits from these three campuses. You can learn more about how to integrate Dash (and digital preservation) by visiting our tech documentation here:
- Technical How-To guide: https://github.com/CDLUC3/stash/blob/master/stash_api/basic_submission.md
- Swagger API documentation: https://dash.ucop.edu/api/docs/index.html
Additional Projects: Researchers are routinely looking for digital preservation options for their research projects. When this is not available, this can sometimes result in orphaned datasets (those left on a hard drive, Box, or Drive) or orphaned data projects (those left on old lab webpages or old research collaboration pages). We can help move those datasets into Dash for long-term preservation (leveraging the new storage nodes). If you know of data in need of a long-term home, please contact UC3 or your campus data curation team:
- UCSF: https://guides.ucsf.edu/datamgmt
- UC Irvine: https://www.lib.uci.edu/dss/data-curation
- UC Riverside: https://library.ucr.edu/research-services/managing-your-data

A note about Dash: As we announced in May of 2018, CDL formally partnered with Dryad. We are in the process of migrating Dryad onto the Dash platform, at which point Dash will be rebranded as Dryad. This does not change the UCDN storage node pilot. The new Dryad service will continue to utilize localized shared storage for researchers at UCSF, UC Irvine, and UC Riverside.

Tackling the storage costs of preservation

By relying upon upfront capital investment of storage from IT teams rather than direct campus recharge to individual departments or libraries, we hope to remove common administrative and financial barriers to wider campus adoption of research data preservation. At the same time, we hope this will enable a simple way for campuses to retain copies of their research outputs in a financially sustainable way.

Researchers from UCSF, UCR, UCI: To learn more about how you can leverage these resources during the pilot, please contact UC3 or your campus research office, IT departments, or libraries.
Researchers from other UC campuses: While your campus is not piloting this approach to back-end storage, there are other preservation projects/services you can leverage. Please contact UC3 or your campus research office, IT departments, or libraries for details/ideas.

Lessons from Dat in the Lab: Webinar

We have received several inquiries about the status of our Dat-in-the-Lab project. To share our project outputs, we held a webinar on Friday, October 19, 2018. We spent the webinar showcasing our work and opening up a dialogue with the community on next steps.

As a reminder, the Dat-in-the-Lab project was funded by Gordon and Betty Moore Foundation and brought together researchers from the Center for Watershed Sciences (UC Davis), The Dawson Lab (UC Merced), UC Conservation Genomic Consortium (UCLA), Internet Archive, San Diego Supercomputing Center (SDSC), California Digital Library, the Dat Project, and Code for Science & Society.

Please learn more about our project and lessons learned by watching the recording of our webinar.

Lessons from Dat in the Lab – Agenda

Friday October 19th, 2018
8 am San Francisco / 11 am New York / 4pm London / 8:30 PM Delhi

Introduction and overview of the ‘Dat in the Lab’ project
Anacapa: Archiving and sharing analysis pipelines with Singularity and Dat
Discussion on containerized workflows and sharing
Questions and discussion
What’s next?

How to watch the webinar

Webinar: Dat in the Lab
Time: Oct 19, 2018 8:00 AM Pacific Time (US and Canada)
Recording of our webinar

dat-in-the-lab: Announcing The Dat Anacapa Container

Today we are releasing Anacapa Container, which enables reproducibility of research environment and data across campuses.

If you’ve been following our work over the last year you’ll be aware of the Dat in the Lab project, funded by the Gordon and Betty Moore foundation (read our previous writeups on a lab visit, eDNA, and containerization challenges). As this project comes to a close, we are excited to release this final piece of work. A final project wrap-up will be released later this fall.

The Anacapa Container project has been a collaboration between the Code for Science & Society team and researchers at five different University of California campuses: UCLA, UC Merced, UC Davis, UC Riverside and UC Santa Cruz. Our goal was to take the Anacapa pipeline from UCLA and use a combination of Dat plus containerization technologies to replicate the pipeline across the various University of California research cluster environments.

The Anacapa pipeline itself is a collection of software written in Bash, Python, R and Perl that takes eDNA sequences and performs computationally expensive and complex analysis on the data to do things such as detect which species were in the sample. Anacapa is the core analysis tool for the CALeDNA consortium, and there are a number of collaborating institutions within California that wish to use the pipeline. Additionally, there are now a growing number of research groups world-wide who are interested in re-using the Anacapa pipeline for their own local eDNA research.

Problem: Complex Software Installation

One of the most challenging parts of using any complex scientific pipeline is installing all of the necessary software dependencies to run it. This may not seem challenging at first, but scientific software is usually poorly documented, and rarely tested on research servers beyond the ones at the originating institution. A growing number of researchers now are using modern software development practices, such as writing user friendly documentation and putting their projects on GitHub, but it can still be weeks or months of effort to replicate the dependencies from the originating institution at a new software environment.

In our case, the CALeDNA consortium includes members of 6 universities. This means for a researcher at UC Merced that wants to run Anacapa, they would have to request that the UC Merced Research Cluster install a long list of specific versions of the R, Python, Perl and Shell bioinformatics utilities that UCLA’s Hoffman cluster provides. The UCLA-based authors of Anacapa may have never had to do this for certain software packages, as many of them may have already been installed via requests from other researchers who also use the Hoffman cluster. UC Merced has a different independently maintained research cluster. This means a completely different set of pre-installed packages, and a different Linux distribution. All of this results in a lot of back and forth between researchers and research cluster administrators at both campuses to try and debug the many differences that pop up when trying to replicate the exact environment that is composed of dozens of independent software packages.

When we started working on this project, one particular researcher had already spent two months working on getting the necessary packages installed locally, but had not yet finished them all. We realized we needed a way to simplify the installation of the Anacapa environment so that every new research group could avoid months of setup work.

Anacapa Container

We decided to use the Singularity containerization software as the main dependency for Dat Container. Singularity is an open source software container developed by folks from Lawrence Berkeley National Laboratory, learn more here. It has a security model that works well for university compute cluster users. While looking into the other options, we learned that the approach taken by the popular container software Docker requires sudo access. Unfortunately, most universities are unable to offer Docker containers to researchers as university compute clusters can not offer that to individual users. Singularity, on the other hand, uses a different technique to load the containerized environment without requiring sudo privileges. Docker is much more popular in the general tech industry, but this issue eliminated it as an option for us. Singularity is a young project that is developing quite rapidly and it has worked well for this application.

The Anacapa Container itself is a singularity image that we developed to include all of the software dependencies needed to run the Anacapa Toolkit from UCLA. We have a script called a Containerfile that step-by-step installs each software package into a Ubuntu Linux server operating system disk image file. At the end of the process, a single 2GB disk image file can be distributed. Instead of requiring that the numerous dependencies get installed onto new systems, the only dependency is the Singularity runtime. This simplifies the request a researcher has to maketo their system administrator.

To make it easy for people on other UC campuses to run Anacapa, we have been involved in getting Singularity installed at five UC campuses. Even though Singularity is a relatively straightforward package, we encounted numerous install errors that had to be corrected with tedious back and forth remote technical support with sysadmins. Even with our streamlined approach that only required one new package get installed (singularity), it was still painful at times. This issue is faced by anyone looking to share resources between institutions. We are hoping that we can improve this process for others who wish to share analysis environments. Singularity is now up and running at five UC campuses, and any future projects that use Singularity images as a distribution format will require zero new software package requests, as they already have everything they need.

Dat sits in the Anacapa software container and is used to replicate the details of the original Anacapa compute environment.This means that the as the container is replicated and reused, folks can use Dat to version and share their new versions of the container environment.

To work with the Anacapa Container, users only need to download the Anacapa Container file, run a singularity command, and they are in a shell prompt that has all of the Anacapa software pre-installed. Due to the containerization approach that Singularity takes, the heavy compute resources on the host machines are available as native resources in the container, meaning there is no loss in performance (as is the case with virtualization based approaches like VirtualBox).

Software as data

We often think as data as separate from analysis scripts, software, and compute environments. In reality, these are different types of digital information that can be handled by Dat. By treating the software as data, we can approach preservation, versioning, and sharing differently. We like the simplicity of the single file disk images that Singularity uses, as it fits well with the Dat ethos of sharing your research as one folder that includes your manuscript, your datasets, your papers website, and now your entire research software environment.

This another step towards easy software reproducibility. The image ensures all of the exact software versions required are being used at runtime. The traditional problem of a system wide update of a Python package breaking everyones existing scripts that depended on the old version is no longer an issue. Researchers can simply load the environment they want by grabbing a specific Singularity image.

By representing the software environment as a file that can get archived along with the dataset, we can ensure future researchers can always quickly get up and running in their quest to reproduce or modify the Anacapa pipeline. One of our partners on this project is the California Digital Library, who are a group inside the University of California that (among other things) are developing tools to ensure research datasets can get archived and made accessible forever. The challenge at hand is building a system that can coordinate dataset archiving across the giant distributed system that is the University of California and all of the external research groups that depend on UC data.

We have made this work available through DASH, which is a data repository hosted by California Digital Library. Any UC researcher has access to publish datasets through DASH, and we are hoping Anacapa Container can serve as a model for how to distribute reproducible software as part of the research dataset.

Dat and Singularity

Distributing the research container as a single file means that it can be used in conjunction with Dat as the distribution tool. We have only scratched the surface of the possibilities here, but we are looking forward to more partnerships with data repository providers like California Digital Library and the Internet Archive to build a distributed data archive that includes executable software containers. This would ensure data does not simply go to a data repository never to get used again, as the container would allow for the dataset to become interactive and available instantly.

Crossposted by https://blog.datproject.org/2018/09/18/announcing-the-dat-anacapa-container/

PIDapalooza 2019 – are you ready to rock!?!

Yes, it’s back and – with your support – it’s going to be better than ever! The third annual PIDapalooza open festival of persistent identifiers will take place at the Griffith Conference Centre, Dublin, Ireland on January 23-24, 2019 – and we hope you’ll join us there!

Hosted, once again, by California Digital Library, Crossref, DataCite, and ORCID, PIDapalooza will follow the same format as past events — rapid-fire, interactive, 30-60 minute sessions (presentations, discussions, debates, brainstorms, etc.) presented on three stages — plus main stage attractions, which will be announced shortly. New for this year is an unconference track, as suggested by several attendees last time.

In the meantime, get those creative juices flowing and send us your session PIDeas! What would you like to talk about? Hear about? Learn about? What’s important for your organization and your community and why? What’s working and what’s not? What’s needed and what’s missing? We want to hear from as many PID people as possible! Please use this form to send us your suggestions. The PIDapalooza Festival Committee will review all forms submitted by September 21, 2018 and decide on the lineup by mid-October.

As a reminder, the regular themes are:

PID myths: Are PIDs better in our minds than in reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
PIDs forever – achieving persistence: So many factors affect persistence: mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
PIDs for emerging uses: Long-term identifiers are no longer just for digital objects. We have use cases for people, organizations, vocabulary terms, and more. What additional use cases are you working on?
Legacy PIDs: There are of thousands of venerable old identifier systems that people want to continue using and bring into the modern data citation ecosystem. How can we manage this effectively?
Bridging worlds: What would make heterogeneous PID systems ‘interoperate’ optimally? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
PIDagogy: It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
PID stories: Which strategies worked? Which strategies failed? Tell us your horror stories! Share your victories!
Kinds of persistence: What are the frontiers of ‘persistence’? We hear lots about fraud prevention with identifiers for scientific reproducibility, but what about data papers promoting PIDs for long-term access to reliably improving objects (software, pre-prints, datasets) or live data feeds?

We’ll be posting more information on the PIDapalooza website over the coming months, as well as keeping you updated on Twitter (@pidapalooza).

In the meantime, what are you waiting for!? Book your place now — and we also strongly recommend that you book your accommodation early as there are other big conferences in Dublin that week.

PIDapalooza, Dublin, Ireland, January 23-24, 2019 – it’s a date!

Org ID: a recap and a hint of things to come

Over the past couple of years, a group of organizations with a shared purpose—California Digital Library, Crossref, DataCite, and ORCID—invested our time and energy into launching the Org ID initiative, with the goal of defining requirements for an open, community-led organization identifier registry. The goal of our initiative has been to offer a transparent, accessible process that builds a better system for all of our communities. As the working group chair, I wanted to provide an update on this initiative and let you know where our efforts are headed.

Community-led effort

FIrst, I would like to summarize all of the work that has gone into this project, a truly community-driven initiative, over the last two years:

A series of collaborative workshops were held at the Coalition for Networked Information (CNI) meeting in San Antonio TX (2016), the FORCE11 conference in Portland OR (2016), and at PIDapalooza in Reykjavik (2016).
Findings from these workshops were summarized in three documents, which we made openly available to the community for public comment:
- Organization Identifier Project: A Way Forward (PDF)
- Organization Identifier Provider Landscape (PDF)
- Technical Considerations for an Organization Identifier Registry (PDF)
A Working Group worked throughout 2017 and voted to approve a set of recommendations and principles for ‘governance’ and ‘product’:
- Governance Recommendations
- Product Principles and Recommendations
We then put out a Request for Information that sought expressions of interest from organizations to be involved in implementing and running an organization identifier registry.
There was a really good response to the RFI; reviewing the responses and thinking about next steps led to our most recent stakeholder meeting in Girona in January 2018, where ORCID, DataCite, and Crossref were tasked with drafting a proposal that meets the Working Group’s requirements for a community-led, organizational identifier registry.

Thank you

I want to take this opportunity to thank everyone who has contributed to this effort so far. We’ve been able to make good progress with the initiative because of the time and expertise many of you have volunteered. We have truly benefited from the support of the community, with representatives from Alfred P. Sloan Foundation; American Physical Society, California Digital Library, Cornell University, Crossref, DataCite, Digital Science, Editeur, Elsevier, Foundation for Earth Sciences, Hindawi, Jisc, ORCID, Ringgold, Springer Nature, The IP Registry, and U.S. Geological Survey involved throughout this initiative. And we couldn’t have done any of it without the help and guidance of our consultants, Helen Szigeti and Kristen Ratan.

The way forward

The recommendations from our initiative have been converted into a concrete plan for building a registry for research organizations. This plan will be posted in the coming weeks.

The initiative’s leadership group has already secured start-up resourcing and is getting ready to announce the launch plan—more details coming soon.

We hope that all stakeholders will continue to support the next phase of our work — look for announcements in the coming weeks about how to get involved.

As always, we welcome your feedback and involvement as this effort continues. Please contact me directly with any questions or comments at john.chodacki@ucop.edu. And thanks again for your help bringing an open organization identifier registry to fruition!

References

Bilder, G., Brown, J., & Demeranville, T. (2016). Organisation identifiers: current provider survey. ORCID. https://doi.org/10.5438/4716

Cruse, P., Haak, L., & Pentz, E. (2016). Organization Identifier Project: A Way Forward. ORCID. https://doi.org/10.5438/2906

Fenner, M., Paglione, L., Demeranville, T., & Bilder, G. (2016). Technical Considerations for an Organization Identifier Registry. https://doi.org/10.5438/7885

Laurel, H., Bilder, G., Brown, C., Cruse, P., Devenport, T., Fenner, M., … Smith, A. (2017). ORG ID WG Product Principles and Recommendations. https://doi.org/10.23640/07243.5402047

Laurel, H., Pentz, E., Cruse, P., & Chodacki, J. (2017). Organization Identifier Project: Request for Information. https://doi.org/10.23640/07243.5458162

Pentz, E., Cruse, P., Laurel, H., & Warner, S. (2017). ORG ID WG Governance Principles and Recommendations. https://doi.org/10.23640/07243.5402002

This was crossposted from the DataCite blog on Aug 2, 2018: https://doi.org/10.5438/67sj-4y05

A Carpentries-Based Approach to Teaching FAIR Data and Software Principles

originally posted by Chris Erdmann

Recently, I was lucky to participate in an innovative workshop held at Technische Informationsbibliothek (TIB) Hannover from 9 – 13 July, 2018, which paired The Carpentries’ pedagogical style of teaching and lesson material with in-depth background on the FAIR Data Principles. FAIR comprises a set of guiding principles to make data findable, accessible, interoperable and reusable (Wilkinson et al., 2016). FAIR is a relatively new initiative that is gaining momentum. Key stakeholders across the research lifecycle are exploring how the underlying FAIR Principles can be applied and assessed at various points. For instance, at the 2018 Research Data Alliance (RDA) Plenary Meeting in Berlin, FAIR was mentioned in at least 23 of the sessions:

Researchers are already starting to ask, How can my research be more FAIR? Thanks to Angelina Kraft (team lead, research data and scientific software) and Katrin Leinweber (research assistant) at TIB Hannover, we have a headstart on developing a training program for the research community on FAIR Data and Software. Angelina and Katrin were joined by Carpentries instructors Konrad Foerstner (ZB MED – Informationszentrum Lebenswissenschaften), Martin Hammitzsch (Das Helmholtz-Zentrum Potsdam Deutsches GeoForschungsZentrum), Luke Johnston (Aarhus University), and Mateusz Kuzak (Dutch Techcentre for Life Sciences) in contributing to the workshop lesson materials and notes. These can be found at the workshop website while the slides can be viewed at the 2018-07-09-FAIR-Data-and-Software-TIB-workshop Google Drive folder (or as PDFs). All the materials are openly shared in the hopes that others will reuse and develop them further. Video recordings will also be available at the TIB AV-Portal. In addition, I and other participants tweeted non-stop to document the workshop for others following along remotely via the hashtag #TIBFDS.

Katrin (left) and Angelina (pictured above) demonstrated that you can successfully pair background information on the FAIR Principles with the hands-on examples taught in The Carpentries. For others hoping to better prepare for FAIR and train their communities in the principles, the TIB Hannover workshop serves as an excellent starting point. I know a number of us in Library Carpentry will be working with Angelina and Katrin to further develop their material.

Crossposted from Library Carpentry blog: https://librarycarpentry.org/blog/2018/07/24/tib-hannover-fair-report/