(index page)

What You and Odysseus Have in Common

If you are like me, you begrudgingly read Homer’s epic poem The Odyssey in high school, but remember very little. I would like to remind you of one specific (and perhaps most famous) scene: Odysseus’ encounter with the Sirens. (Stay with me— this does eventually relate to data). Odysseus, as captain of his ship, is warned of the Sirens that lure sailors to their death. Intrigued by the idea of hearing such a beautiful song, Odysseus concocts a hair brained scheme worthy of an episode of I Love Lucy.

Odysseus recognizes the inevitable dynamic inconsistency of his situation (although he doesn’t call it that). That is, Present Odysseus knows how dangerous the Sirens are, and believes he can resist the temptation by being tied to the mast of his ship by his men (the men will have their ears plugged to avoid hearing the Sirens). In insisting that he be tied up, Odysseus recognizes that Future Odysseus will be unable to resist the temptation, despite knowledge of the potential dangers.

Ulysses and the Sirens by John William Waterhouse — “Ulysses and the Sirens” by John William Waterhouse. Courtesy of Wikimedia Commons

I won’t spoil the outcome of Odysseus’ personal battles, but the battle between present and future selves is a very relevant topic in modern times — especially during the holiday season (e.g. the epic battle between sugar cookies and your skinny jeans). Another parallel surrounds the time it takes to document your data. Present Scientist Self has two papers, three reviews, and four grants to finish before Friday. PSS has no patience for the intricacies of well-documented data: it is not a pressing issue. Future Scientist Self would greatly benefit from good documentation, especially when it comes time to archive those data (you WERE going to archive them, right?). Of course, not only will Future Scientist SELF benefit from good data management and organization, other future scientists will benefit.

Food for thought: it’s inevitable that Odysseus will want to go to the Sirens, that you will regret those sugar cookies, and that poor data documentation will end up costing your future self much time and frustration.

DCXL, Policies & NSF Data Management Plans

Unless you live in a cave, you are probably aware that NSF started requiring that researchers submit a two-page supplement to all proposals titled “Data Management Plan”. To paraphrase from the Grant Proposal Guide, investigators are told they need to discuss:

Types of data
Standards they will use for data and metadata
Policies for access and sharing
Policies and provisions for re-use
Plans for archiving, preserving, and providing access to data

Points #3 and #4 were discussed quite a bit last week at the Data Governance Workshop I attended, with much concern from myself over how scientists would be able to find and comprehend these policies. If a room full of librarians, funders, publishers, and experts can’t figure out what policies might apply to scientific data, I began to wonder if scientists had any hope understanding data governance. I think they do, so long as some of the proposed products that will result from the workshop come to fruition.

So where might the Excel add-in we are developing fit into this scheme? The first version of the add-in will likely not have much utility for data governance issues, like setting policies, establishing access rights, and restricting data availability. We do, however, envision that this add-in might provide a framework for future developers to implement tools to facilitate good data governance practices. This might be in the form of a link to an archive’s policy, metadata with provisions for access and use, or other methods.

I like to think that because this add-in is intended to be open-source, it will become a useful tool upon which savvy developers can build in capabilities for things like governance, collaboration, links to social networking tools, etc.

Intellectual Property, Copyright, & Other Dry Topics

Recently, I found myself wondering What the heck is data governance? I was asked to participate in a workshop on Data Governance, supported by DataONE and led by MacKenzie Smith of Creative Commons and Trisha Cruse of UC3. I promptly replied “yes!”, pretending to understand the phrase, and then hurried back to my computer and Googled it.

Data governance is one of those phrases where you can define all of the words involved, but are unclear what they represent when strung together. No need for you to start Googling – after participating in the data governance workshop in DC for the last couple of days, I can happily report all that I learned and save you the effort.

First, let’s define data governance (based on Wikipedia’s entry): it’s the policies surrounding data, including data risk management, assignment of roles and responsibilities for data, and more generally formally managing data assets throughout the research cycle. Data governance issues include things like

data sharing licenses
providing credit for data (see my post about data citation here)
managing persistent identifiers (like those available via EZID)
documenting data provenance
sharing metadata to enable discovery
establishing registries for standards and ontologies

Many scientists might think this is a rather dry set of topics (whether they are correct is a matter of opinion!). Scientists aren’t concerned about the policies surrounding data, and they have very little incentive for caring. We have all signed copyright agreements when we publish in journals and patent agreements for our institutions (like this one for the UC system). But how many of us have read those documents? We have agreed to the terms and conditions of accepting funding, using institutional resources, publishing in journals, and engaging in collaborative research; but how many of us know what we have agreed to do with our data? My guess? Very close to zero.

The important point here is that we SHOULD care. In my conversations with scientists, I have discovered that most of them, if willing to share at all, would like to place restrictions on access and use of their data. We need to be involved in those data governance discussions if we want to set the terms of our data sharing.

The data governance meeting was attended by 30 folks representing a wide range of perspectives. There were publishers, librarians, funders, scientists, data managers and a lawyer to offer up their ideas about how best to tackle the issues surrounding digital data. Examples of issues that surfaced:

Who owns the data?
Who is legally allowed to set the polices for data access and use?
How are data affected by copyright law?
How should we handle data that is used for meta-analysis, and therefore subject to many different policies?
What is the implicit policy if none is specified?
How should we educate the community of stakeholders about data governance?

We certainly didn’t solve all of the problems associated with data governance, but we made good headway on starting the conversation and encouraging further work in this area. I will expand on some of these topics in the next blog entry, so stay tuned! For a preview, check out this Storify record of the Twitter feed from the meeting.

copyrightforlosers — Whether or not you think this graffiti speaks the truth, you should be part of the discussion. From Flickr by 917press

The Skinny on Data Publication

The concept of data publication is rather simple in theory: rather than relying on journal articles alone for scholarly communication, let’s publish data sets as “first class citizens” (hat tip to the DataCite group). Data sets have inherent value that makes them standalone scholarly objects— they are more likely to be discovered by researchers in other domains and working on other questions if they are not associated with a specific journal and all of the baggage that entails.

Consider this example (taken from personal experience). If you are a biologist interested in studying clam population connectivity, how likely are you to find the (extremely relevant) data related to clam shell chemistry that are associated with paleo-oceanography journals? It took me several months before I discovered them during my PhD. If those datasets had been published in a repository, however, with a few well-chosen keywords and a quick web search, I would have located those datasets much more quickly.

Who would be against this idea, you ask? It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality. There is also considerable debate about how we should proceed to make data publication the norm in scientific communication.

phd cartoon — A summary of what’s wrong with the current system, from a PhD Comics cartoon: http://www.phdcomics.com/comics/archive.php?comicid=1200

I had a lovely dinner last week with some colleagues in town for the AGU meeting, where a passionate debate ensued about data publication. One of the scientists made the (quite valid) argument that data publication is a terrible phrase because the word “publication” insinuates that we are beholden to the current broken system of journal publication. The word itself has too much baggage. The opposing counsel suggested that bureaucrats, funders, and institutions have a familiarity with the word publication and that will ensure the success of the data publication goals, regardless of whether we break the mold in the process. We agreed to brainstorm potential metaphors for the concept of data publication that might result in a better phrase to describe the idea. Any suggestions?

This has relevance to the DCXL project since we consider this Excel add-in to be a stepping stone towards data publication (whatever we end up calling it). By allowing scientists to directly link with archives and upload their data, we are promoting data as a unique scholarly object. Through services like EZID, you can even get a DOI for your dataset. These are all good advances towards promoting data as a first class object.

For more on the current debate that is raging about scholarly communication via journal publication, check out these two recent excellent pieces:

And for a giggle, watch the awesome cartoon called Scientist Meets Publisher from the blog Ceptional.

Scientific Data at the AGU Meeting

This past week, 20,000 Earth scientists gathered in San Francisco for the American Geophysical Union’s Fall 2011 Meeting. Like most conference attendees, I had grand plans of attending the conference all week, filling my head with geophysical knowledge and talking to scientists about their use of Excel. The week didn’t quite work out as planned, but I have good information to share nonetheless. Here’s some highlights:

Earth scientists are using Excel just as much as Ecologists and Fisheries folks. They are using it in similar ways, too: they organize their data in Excel, then export it to other programs like MATLAB and JMP to perform analyses.
Earth scientists are more familiar with the idea of downloading data from websites and repositories, and then combining these potentially disparate data types to perform analyses. (Check out this great list of relevant data repositories from University of Oregon libraries)
Despite their familiarity with downloading data from repositories, very few of them upload their data to those repositories. They think of those data centers as Read-only rather than Read-write.

Item #3 above was the most interesting, from my perspective. Earth scientists are benefiting greatly from others sharing their data and making it available for use and re-use. Why doesn’t it occur to them to contribute? It’s the holiday season- the season of giving and receiving. That means you, Earth Scientists!

two way street sign — Data centers are a two-way street. From Flickr by z6p6tist6

Drumroll please… Excel add-in features we are proposing

Just in time for the holidays (and the AGU Fall 2011 Meeting)- a list of the overarching requirements we are working on for the Excel add-in project! After talking with about 150 scientists about their Excel use and their data management/archiving practices, we think we have narrowed down the giant list of potential Excel improvements to a manageable list for the project.

Generate metadata. Over and over, scientists said they didn’t have good data documentation. Wouldn’t it be great it Excel helped you create it? We think so. Data centers require a certain amount of metadata for you to archive your data with them; the process would be much easier if you had a way to generate the correct metadata format and structure for the archive you designate.
Generate a data citation for the data file. Data citation is the fastest way to encourage good data stewardship since it gives scientists incentives to publish and share their data. For more information on data citation, read my post on the subject.
Check the spreadsheet for export compatibility. Most scientists don’t use Excel alone. Instead, Excel is a stepping stone for other programs- it is used to organize the data and perform basic quality control, but the data are then promptly copied and pasted or exported to another program like R, MATLAB, ArcGIS, or SAS. The spreadsheet format needed to import your data into other programs is similar to what’s needed to submit your data to an archive. That means if you eliminate problems that would cause statistical programs to choke (see my post on problematic features in Excel), your data are one step closer to being archived quickly and easily.
Link to archive services. We want you to be able to archive your data with the click of a button. You will need to have a relationship with the archive you plan to submit to, but hopefully that’s already established. Don’t worry- you can specify your usage restrictions and access policies depending on the archive you choose. This requirement is where DataONE comes in- we hope to make the connection between Excel and DataONE as seamless as possible.

coast guard rescue — This add-in is going to rescue you from bad data stewardship. Original photo from Flickr by alandberning

What do you think? Ideas? Concerns? More details will come but suggestions are always welcomed. Just shoot me an email.