(index page)

The History of Higher Education in California: A Big Data Approach

Thursday, April 6 10am -11:30am 180 Doe Library Zach Bleemer discusses how he used data science — thousands of computer-processed versions of annual registers, directories, and catalogs — to reconstruct a near-complete database of all students, faculty, and courses at four-year universities in California in the first half of the 20th century, including the UC … Continue reading →

Source: The History of Higher Education in California: A Big Data Approach

Berkeley commits to accelerating universal open access, signs the OA2020 Expression of Interest

The University Library at UC Berkeley took a major step today in its commitment to achieving universal open access for scholarly journal literature by signing the OA2020 Expression of Interest, in collaboration with UC Davis and UC San Francisco. OA2020 is an international movement, led by the Max Planck Digital Library in Munich, to convert … Continue reading →

Source: Berkeley commits to accelerating universal open access, signs the OA2020 Expression of Interest

Announcing New Dash Features- April 2017

The Dash team is pleased to announce the release of our newest features. Taking in requests from users as well as standards in the field, we have now adapted the platform with the following releases: Private for Peer Review (Timed-Release of Data), ORCiD integration, email capture for corresponding authors, user friendly downloads, and a variety of search and view enhancements.

Private for Peer Review (Timed-Release of Data)

As mentioned in a previous post, this was formally referred to as embargoing data but we are releasing this feature in the context of keeping data private for the length of peer review. We have now implemented a feature to allow researchers to keep data private, for the purposes of peer review, for up to six months. If a researcher decides to use this option they will be given a private Reviewer URL that can be used by an external party to download the data.

This URL will redirect to the landing page with available data for download as soon as the data are public. If external parties have any questions or would like to request a download they will also now have the ability to reach the corresponding author.

Corresponding Author Email Capture & ORCiD Integration

Corresponding authors (and contributing authors) will now have the ability to enter their email address and ORCiD iD which will both appear on the landing page beneath author name. Just as article publications have, we believe Data Publications should have a corresponding author contact who can be reached with questions about the dataset.

User Friendly Downloads & Interface Improvements

What one uploads is what another may download. When choosing to download the data files, only the files uploaded by the corresponding author will be downloaded.

Some other fixes and features include:

the wording our our search filters and browse option
a checkbox at the file upload stage to ensure researchers are not uploading sensitive or identifying information
explanatory information within the metadata submission for usage notes and related work
a preview of how large the dataset is on the download button

What’s up next?

Next Feature: large file upload and bulk file upload
Future Feature: a curation layer that will allow for administration capabilities

For more information or if you have any questions please check for updates on the @uc3cdl twitter feed, or get in touch at uc3@ucop.edu.

Embargoing the Term “Embargoes” Indefinitely

I’m two months into a position that lends part of its time to overseeing Dash, a Data Publication platform for the University of California. On my first day I was told that a big priority for Dash was to build out an embargo feature. Coming to the California Digital Library (CDL) from PLOS, an OA publisher with an OA Data Policy, I couldn’t understand why I would be leading endeavors to embargo data and not open it up- so I met this embargo directive with apprehension.

I began to acquaint myself with the campuses and a couple of weeks ago while at UCSF I presented the prototype for what this “embargo” feature would look like and I questioned why researchers wanted to close data on an open data platform. This is where it gets fun.

“Our researchers really just want a feature to keep their data private while their associated paper is under peer review. We see this frequently when people submit to PLOS”.

Yes, I had contributed to my own conflict.

While I laughed about how I was previously the person at PLOS convincing UC researchers to make their data public- I recognized that this would be an easy issue to clarify. And here we are.

Embargoes imply a negative connotation in the open community and I ask that moving forward we do not use this phrase to talk about keeping data private until an associated manuscript has been accepted. Let us call this “Private for Peer Review” or “Timed Release”, with a “Peer Review URL” that is available for sharing data during the peer review process as Dryad does.

Embargoes imply that data are being held private for reasons other than the peer review process.
Embargoes are not appropriate if you have a funder, publisher, or other mandate to open up your data.
Embargoes are not appropriate for sensitive data, as these data should not be held in a public repository (embargoed) unless this were through a data access committee and the repository had proper security.
Embargoes are not appropriate for open Data Publications.

To embargo your data for longer than the peer review process (or for other reasons) is to shield your data from being used, built off of, or validated. This is contrary to “Open” as a strategy to further scientific findings and scholarly communications.

Dash is implementing features that will allow researchers to choose, in line with what we believe is reasonable for peer review and revisions, a publication date up to six months after submission. If researchers choose to use this feature, they will be given a Peer Review URL that can be shared to download the data until the data are public. It is important to note though that while the data may be private during this time, the DOI for the data and associated metadata will be public and should be used for citation. These features will be for the use of Peer Review; we do not believe that data should be held private for a period of time on an open data publication platform for other reasons.

Opening up data, publishing data, and giving credit to data are all important in emphasizing that data are a credible and necessary piece of scholarly work. Dash and other repositories will allow for data to be private through peer review (with the intent to have data be public and accessible in the close future). However, my hope is that as the data revolution evolves, incentives to open up data sooner will become apparent. The first step is to check our vocab and limit the use of the term “embargo” to cases where data are being held private without an open data intention.

California Digital Library Supports the Initiative for Open Citations

California Digital Library (CDL) is proud to announce our formal endorsement for the Initiative for Open Citations (I4OC). CDL has long supported free and reusable scholarly work, as well as organizations and initiatives supporting citations in publication. With a growing database of literature and research data citations, there is a need for an open global network of citation data.

The Initiative for Open Citations will work with Crossref and their Cited-by service to open up all references indexed in Crossref. Many publishers and stakeholders have opted in to participate in opening up their citation data, and we hope that each year this list will grow to encompass all fields of publication. Furthermore, we are looking forward to seeing how research data citations will be a part of this discussion.

CDL is a firm believer in and advocate for data citations and persistent identifiers in scholarly work. However, if research publications are cited and those citations are not freely accessible and searchable- our goal is not accomplished. We are proud to support the Initiative for Open Citations and invite you to get in touch with any questions you may have about the need for open citations or ways to be an advocate for this necessary change.

Below are some Frequently Asked Questions about the need, ways to get involved, and misconceptions regarding citations. The answers are provided by the Board and founders of the I4OC Initiative:

I am a scholarly publisher not enrolled in the Cited-by service. How do I enable it?

If not already a participant in Cited-by, a Crossref member can register for this service free-of-charge. Having done so, there is nothing further the publisher needs to do to ‘open’ its reference data, other than to give its consent to Crossref, since participation in Cited-by alone does not automatically make these references available via Crossref’s standard APIs.

I am a scholarly publisher already depositing references to Crossref. How do I publicly release them?

We encourage all publishers to make their reference metadata publicly available. If you are already submitting article metadata to Crossref as a participant in their Cited-by service, opening them can be achieved in a matter of days. Publishers can easily and freely achieve this:

either by contacting Crossref support directly by e-mail, asking them to turn on reference distribution for all of the relevant DOI prefixes;
or by themselves setting the < reference_distribution_opt > metadata element to “ any ” for each DOI deposit for which they want to make references openly available.

How do I access open citation data?

Once made open, the references for individual scholarly publications may be accessed immediately through the Crossref REST API.

Open citations are also available from the OpenCitations Corpus , a database created to house scholarly citations, that is progressively and systematically harvested citation data from Crossref and other sources. An advantage of accessing citation data from the OpenCitations Corpus is that they are available in standards-compliant machine-readable RDF format , and include information about both incoming and outgoing citations of bibliographic resources (published articles and books).

Does this initiative cover future citations only or also historical data?

Both. All DOIs under a prefix set for open reference distribution will have open references through Crossref, for past, present, and future publications.

Past and present publications that lack DOIs are not dealt with by Crossref, and gaining access to their citation data will require separate initiatives by their publishers or others to extract and openly publish those references.

Under what licensing terms is citation data being made available?

Crossref exposes article and reference metadata without a license, since it regards these as raw facts that cannot be licensed.

The structured citation metadata within the OpenCitations Corpus are published under a Creative Commons CC0 public domain dedication, to make it explicitly clear that these data are open.

My journal is open access. Aren’t its articles’ citations automatically available?

No. Although Open Access articles may be open and freely available to read on the publisher’s website, their references are not separate, and are not necessarily structured or accessible programmatically. Additionally, although their reference metadata may be submitted to Crossref, Crossref historically set the default for references to “closed,” with a manual opt-in being required for public references. Many publisher members have not been aware that they could simply instruct Crossref to make references open, and, as a neutral party, Crossref has not promoted the public reference option. All publishers therefore have to opt in to open distribution of references via Crossref.

Is there a programmatic way to check whether a publisher’s or journal’s citation data is free to reuse?

For Crossref metadata , their REST API reveals how many and which publishers have opened references. Any system or tool (or a JSON viewer) can be pointed to this query: http://api.crossref.org/members?filter=has-public-references:true&rows=1000 to show the count and the list of publishers with ” public-references “: true .

To query a specific publisher’s status, use, for example:

http://api.crossref.org/members?filter=has-public-references:true&rows=1000&qu ery=springer then find the tag for public-references. In some cases it will be set to false.

Contact

You can contact the founding group by e-mail at: info@i4oc.org .

Describing the Research Process

We at UC3 are constantly developing new tools and resources to help researchers manage their data. However, while working on projects like our RDM guide for researchers, we’ve noticed that researchers, librarians, and people working in the broader digital curation space often talk about the research process in very different ways.

To help bridge this gap, we are conducting an informal survey to understand the terms researchers use when talking about the various stages of a research project.

If you are a researcher and can spare about 5 minutes, we would greatly appreciate it if you would click the link below to participate in our survey.

http://survey.az1.qualtrics.com/jfe/form/SV_a97IJAEMwR7ifRP

Thank you.

Data Publication: Sharing, Crediting, and Re-Using Research Data

In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.

Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.

Data are publicly accessible and preserved indefinitely

There are many ways for researchers to make their data publicly available, be it within Supporting Information files of a journal article or within an institutional, field specific, or general repository. For a true Data Publication, data should be submitted to a stable repository that can ensure data will be available and stored for an indefinite amount of time. There are over a thousand repositories registered with re3data and many publishers have repository guides to help with field specific guidance. When data are not suitable for public deposition, i.e. when data contain sensitive information, data should still be stored in a preserved and compliant space. While this restriction is a more difficult hurdle to jump over in advocating for data publishing and data preservation, it is important to ensure these data are not violating ethical requirements, nor are they locked up in a filing cabinet and eventually thrown out. Preservation of data is a necessity for the future.

Data are described (data have metadata)

Data without proper documentation or descriptive metadata are about as useful as research without data. If a Data Publication is a citable piece of scholarly work, it should contain information that it allow it to be a useful and valued piece of scholarly work. Documentation and metadata range from information regarding software used for analysis to who funded the work. While these examples serve separate purposes (one for re-use and the other for credit), it is important that all information about the creation of the dataset (who, where, how, related publications) are available.

Data are citable and credible

We’ve established that datasets are essential to research output and are an important piece of scholarly work- and they should receive the same benefits. Data need to have a persistent identifier (a stable link) that can be referenced. While many repositories use a DataCite DOI to fulfill this, some field-specific repositories use accession numbers (i.e. NCBI repositories) that can be referenced within a URL. This is one of the reasons data need to be available in a stable repository. It’s a bit difficult to reference and credit data that are on your hard drive!

If it’s so clear- why are there barriers?

Data publishing has become more widely accepted in the last ten years, with new standards from funders and publishers and a growth in stable repositories. However, there’s still work to be done and more questions to be answered before we reach mass adoption. Let’s start that conversation (you can be the questioner and I’ll be the advocate):

Organizing and submitting data are time intensive and in turn, costly

Trying to replicate a data set from scratch takes much more time (and money) than publishing your data (see robotics example here). Taking the time to search your old computer files or get in touch with your last institution to get your data is more complicated than publishing your data. Having your paper retracted because your data are called into question and you can’t share your data or don’t have it would take more time, money, and hit to your reputation than proactively publishing your datasets.

As an important side note: Data Publications do not need to be linked to a journal publication. While it may take extra time to submit a Data Publication in proper form, if used as an intermediate step in the research process you can reduce time later, get credit, and benefit the research community in the meantime.

What’s the incentive?

Credit. Next question?

But beyond credit for a citable piece of work, publishing data as a common practice will shift focus from publications being an end point in the research cycle to a starting point and this shift is crucial for transparency and reproducibility in published works. Incentives will become clear once Data Citations become common practice within the publisher and research community, and resources are available for researchers to know how (and have the time/funds) to submit Data Publications.

Too few resources for understanding Data Publishing

Many great papers have been posted and published in the last ten years about what a Data Publication is; however, less resources have been made available to the research community on how to integrate Data Publishing into the research life cycle and how to organize data to even be suitable for a Data Publication. Data Management Plans, courses on research data management, and pressure from various funder and publisher policies will help, but there’s a serious need for education on data planning/organization (including metadata and format requirements) as well as awareness of data publishing platforms and their benefits. This is a call to the community to release these materials and engage in the Research Data Management (RDM) community to get as many of these conversations going. The more resources, answers, and guidance that institutions can provide to researchers, the less the “it takes too much time and money” argument will arise, the easier it will be to achieve the incentive, and the further we will push the boundaries of transparency in scholarly communications.

There’s no better time than now to re-evaluate what resources are available for research output. If we strive for re-use and reproducibility of research data within the community, then now is the time to increase awareness and adoption of Data Publication.

For more information about research data organizations, machine actionable Data Management Plans, or Data Publication platforms, please utilize UC3 resources or get in touch at uc3@ucop.edu.