Skip to main content

CC BY and data: Not always a good fit

CDL UC3,

simbadanslecarton-jacme31

Simba dans le carton, jacme31, CC BY-SA 2.0

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data.

People often refer to everything with a CC in front of it as a “Creative Commons license.” But only six of the Creative Commons tools are licenses. They are:

The licenses permit people to use a work in exchange for agreeing to certain conditions: attribution at a minimum, and others if the author so chooses. Let’s not focus on what those other conditions are today, just the fact that a) there are conditions and b) attribution is always one of them.

The public domain tools are different. There are two of them:

There aren’t conditions placed on the user in these two cases, so we don’t call them licenses.

Sharing data with CC0 is increasingly encouraged (see these pages by Dryad and BioMed Central for a couple of examples) because the easier you make it for people to reuse your data — and the clearer you make it that you aren’t planning on suing people who do — the more likely it is that your data will be reused. The Open Data Commons has a similar tool called the Public Domain Dedication and License (PDDL).

That said, CC BY and the other five Creative Commons licenses are also popular among scholars publishing data, because of the attribution requirement I mentioned above. Specifically, the CC licenses require that those who reuse a work provide attribution to the work’s creator by retaining “identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated).” This makes great sense for photos, articles, and music that creators share with a CC license.

It’s not surprising that researchers producing data also appreciate attribution. Unfortunately, CC licenses can be a poor tool for mandating it. They are neither sufficient, nor necessary.

1. CC licenses are not sufficient for ensuring proper attribution in many cases because their restrictions — including attribution — do not apply to facts. The restrictions only apply to whatever portion of a work is copyrightable. (See last week’s discussion of copyrightability.)

Because of this, sharing uncopyrightable data under a CC license can lead to confusion about what the data producer is trying to achieve. So can sharing data with no reuse terms stated at all: if data is shared publicly, but without a license, copyright law’s defaults apply. That means that if the data isn’t copyrightable, reuse doesn’t require the author’s permission. But if there’s something copyrightable, the author has exclusive rights to control reproduction and distribution of it, subject to certain exceptions like fair use. With or without a CC license, users who want to reproduce or redistribute the data can be left wondering about what’s protected by copyright, and what kinds of reuse (if any) the data producer was hoping to enable.

CC0 is much clearer. It doesn’t matter if a dataset is copyrightable or not. Whatever copyright the author has, they waive. Everywhere. If it’s nothing, they waive nothing, but they’re making a clear statement that they don’t plan on using copyright to restrict reuse. That’s especially handy when you consider that a) reasonable people can argue about whether something is copyrightable or not and b) something can be copyrightable in some countries but not others.

2. CC licenses’ attribution requirements aren’t necessary because scholars have very good reasons to provide attribution that has nothing to do with copyright — see, for instance, the Joint Declaration of Data Citation Principles, which doesn’t mention copyright or intellectual property at all. Data that comes from nowhere has little credibility. If someone wants to use data as persuasive evidence, they need to refer readers and reviewers back to its source: who it came from and how it was produced.

And licenses that dictate exactly how later projects reuse data and provide attribution can hobble those projects. For just one example, see this discussion of licensing issues in Rephetio, a project that’s done a particularly good job documenting the challenges faced in attempting to reuse drug repurposing data. Data reusers should follow best practices for data citation in their field given their particular project. If data sharers can trust them to do just that, rather than trying to shoehorn attribution practices into one-size-fits-all copyright licenses, they can make it much easier for others to reuse their data.

Meanwhile, cheaters will cheat. If professional ethics aren’t enough to dissuade someone from plagiarism or other misconduct, why would the restrictions of a CC license dissuade them, especially in cases where they’re using factual data that those restrictions don’t apply to?

If you’re sharing data publicly:

If you’re using publicly shared data:

Public data sharing on a large scale is still relatively new. Things are going to be confusing for a while. People are going to make mistakes. Don’t be afraid to ask questions. Share your concerns, successes, and roadblocks with others in your field, publicly if you can. As data sharing becomes more common due to funder and publisher requirements, answers will be increasingly easy to find.