(index page)
Data Publishing and the Coproduction of Quality
This post is authored by Eric Kansa
There is a great deal of interest in the sciences and humanities around how to manage “data.” By “data,” I’m referring to content that has some formal and logical structure needed to meet the requirements of software processing. Of course, distinctions between structured versus unstructured data represent more of a continuum or spectrum than a sharp line. What sets data apart from texts however is that data are usually intended for transactional (with queries and visualizations) rather than narrative applications.
The uses of data versus texts make a big difference in how we perceive “quality.” If there is a typo in a text, it usually does not break the entire work. Human readers are pretty forgiving with respect to those sorts of errors, since humans interpret texts via pattern recognition heavily aided by background knowledge and expectations. Small deviations from a reader’s expectations about what should be in a text can be glossed over or even missed entirely. If noticed, many errors annoy rather than confuse. This inherently forgiving nature of text makes editing and copy-editing attention-demanding tasks. One has to struggle to see what is actually written on a page rather than getting the general gist of a written text.
Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions (many suggested by peer-review evaluations) before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. The final product is a work of collaborative “coproduction” between authors, editors, reviewers, and type-setters.
What does this have to do with data?
Human beings typically don’t read data. We use data mediated through software. The transactional nature of data introduces a different set of issues impacting the quality and usability of data. Whereas small errors in a text often go unnoticed, such errors can have dramatic impacts on the use and interpretation of a dataset. For instance, a misplaced decimal point in a numeric field can cause problems for even basic statistical calculations. Such errors can also break visualizations.
These issues don’t only impact single datasets, they can also wreak havoc in settings where multiple individual datasets need to be joined together. I work mainly on archaeological data dissemination. Archaeology is an inherently multidisciplinary practice, involving inputs from different specialists in the natural sciences (especially zoology, botany, human osteology, and geomorphology), the social sciences, and the humanities. Meaningful integration of these diverse sources of structured data represents a great information challenge for archaeology. Archaeology also creates vast quantities of other digital documentation. A single field project may result in tens of thousands of digital photos documenting everything from excavation contexts to recovered artifacts. Errors and inconsistencies in identifiers can create great problems in joining together disparate datasets, even from a single archaeological project.
It is a tremendous challenge to relate all of these different datasets and media files together in a usable manner. The challenge is further compounded because archaeology, like many small sciences, typically lacks widely used recording terminologies and standards. Each archaeological dataset is custom crafted by researchers to address a particular suite of research interests and needs. This means that workflows and supporting software to find and fix data problems needs to be pretty generalized.
Fortunately, archaeology is not alone in needing tools to promote data quality. Google Refine helps meet these needs. Google Refine leverages the transactional nature of data to summarize and filter datasets in ways that make many common errors apparent. Once errors are discovered, Google Refine has powerful editing tools to fix problems. Users can also undo edits to roll-back fixes and return a dataset to an earlier state.
With funding from the Alfred P. Sloan Foundation, we’re working to integrate Google Refine in a collaborative workflow called “Data Refine“. Again, the transactional nature of data helps shape this workflow. Because use of data is heavily mediated by software, datasets can be seen as an integral part of software. This thinking motivated us to experiment with using software debugging and issue tracking tools to help organize collaborative work on editing data. Debugging and issue tracking tools are widely used and established ways of improving software quality. They can play a similar role in the “debugging” of data.
We integrated Google Refine and the PHP-based Mantis issue tracker to support collaboration in improving data quality. In this approach, contributing researchers and data editors collaborate in the coproduction of higher quality, more intelligible and usable datasets. These workflows try to address both supply and demand needs in scholarship. Researchers face strong and well known career pressures. Tenure may be worth $2 million or more over the course of a career, and its alternative can mean complete ejection from a discipline. A model of editorially supervised “data sharing as publication” can help better align the community’s interest in data dissemination with the realities of individual incentives. On the demand side, datasets must have sufficient quality and documentation. To give context, data often need to be related and linked with shared concepts and with other datasets available on the Web (as in the case of “Linked Open Data” scenarios).
All of these processes require effort. New skills, professional roles, and scholarly communication channels need to be created to meet the specific requirements of meaningful data sharing. Tools and workflows as discussed here can help make this effort a bit more efficient and better suited to how data are used in research.
Communication Breakdown: Nerds, Geeks, and Dweebs
Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project. WAIT! Keep reading! I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks. I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.
I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics. There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack. I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues. It’s actually quite obvious why these communication problems exist. There are a lot of smart people involved in data, all of whom have very divergent backgrounds. I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs. These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.
The Nerds. This is the group to which I belong. We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education. Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.
The geeks. Also known as computer folks. These folks might be developers, computer scientists, information technology specialists, database managers, etc. They are uber-smart, but from what I can tell their uber-smart brains do not work like mine. From what I can tell, geeks can explain things to me in one of two ways:
- “To turn your computing machine on, you need to first plug it in. Then push the big button.”
- “First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”
In all fairness, (1) occurs far less than (2). But often you get (1) after trying to get clarification on (2). How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen. That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours. Often I’ve found myself writing down terms in a meeting so I can go look them up later. Things like “elements” and “terminal” are not unfamiliar words in and of themselves. However the contexts in which they are used are completely new to me. That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and XML.
The dweebs. Also known as librarians. These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms. There’s certainly a bit of crossover with the computer types, especially when it comes to data. However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers. There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind. Check out the DCXL post on libraries from January.
Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs. I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort. How can we solve this communication impasse? I have a few ideas:
- Nerds: open your minds to the possibility that computer types and librarian types might know about better ways of doing what you are doing. Tap the resources that these groups have to offer. Stop being scared of the unknown. You love learning or you wouldn’t be a scientist; devote some of that love in the direction of improving your computer savvy.
- Geeks: dumb it down, but not too much. Recognize that scientists and librarians are smart, but potentially in very different ways than you. Also, please recognize that change will be incremental, and we will not universally adopt whatever you think is the best possible set of tools or strategies and how “totally stupid” or current workflow seems.
- Dweebs: spend some time getting to know the disciplines you want to help. Toot your own horn– you know A LOT of stuff that nerds and geeks don’t, and you are all so darn shy! Make sure both geeks and nerds know of your capacity to help, and your ability to lend important information to the discussion.
And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation). I plead with you to stop reinventing the wheel. As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software. It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types. Consult with them rather than starting from scratch.
In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.
Fun Uses for Excel

It’s Friday! Better still, it’s Friday afternoon! To honor all of the hard work we’ve done this week, let’s have some fun with Excel. Check out these interesting uses for Excel that have nothing to do with your data:
Want to see some silly spreadsheet movies? Here ya go.
Excel Hero: Download .xls files that create nifty optical illusions. Here’s one of them.
From PCWorld, Fun uses for Excel, including a Web radio player that plays inside your worksheet (click to download the zip file and then select a station), or simulating dice rolls in case of a lack-of-dice emergency during Yatzee.
Mona Lisa never looked so smart. Want to know more? Check out the YouTube video tutorial or read Creating art with Microsoft Excel from the blog digital inspiration.
Tweeting for Science
At risk of veering off course of this blog’s typical topics, I am going to post about tweeting. This topic is timely given my previous post about the lack of social media use in Ocean Sciences, the blog post that it spawned at Words in mOcean, and the Twitter hash tag #NewMarineTweep. A grad school friend asked me recently what I like about tweeting (ironically, this was asked using Facebook). So instead of touting my thoughts on Twitter to my limited Facebook friends, I thought I would post here and face the consequences of avoiding DCXL almost completely this week on the blog.
First, there’s no need to reinvent the wheel. Check out these resources about tweeting in science:
- Wired did a great piece on Twitter + Science, including a list of tweets collected by the piece’s author about why scientists choose to tweet. Don’t take my word for it- read up on what the masses said about Twitter.
- The social media expert + scientist Christie Wilcox (aka @NerdyChristie) created a super duper set of slides about “Why every lab should tweet”; it’s a visual, easy-to-follow way to understand how Twitter could shape your science for the better.
- Of course, the amazing Marine Science blog Deep Sea News posted about Twitter’s power way back in 2010. Read up on what they say about it.
- The blog Biodiversity in Focus (written by a grad student) recently posted about science and Twitter use, and sums up why it’s valuable in a single word: Networking.
- If you are more geoscience-inclined, check out AGU’s piece on Twitter in science.
That being said, I will now pontificate on the value of Twitter for science, in handy numbered list form.
- It saves me time. This might seem counter-intuitive, but it’s absolutely true. If you are a head-in-the-sand kind of person, this point might not be for you. But I like to know what’s going on in science, science news, the world of science publishing, science funding, etc. etc. That doesn’t even include regular news or local events. The point here is that instead of checking websites, digging through RSS feeds, or having an overfull email inbox, I have filtered all of these things through HootSuite. HootSuite is one of several free services for organizing your Twitter feeds; mine looks like a bunch of columns arranged by topic. That way I can quickly and easily check on the latest info, in a single location. Here’s a screenshot of my HootSuite page, to give you an idea of the possibilities: click to open the PDF: HootSuite_Screenshot
- It is great for networking. I’ve met quite a few folks via Twitter that I probably never would have encountered otherwise. Some have become important colleagues, others have become friends, and all of them have helped me find resources, information, and insight. I’ve been given academic opportunities based on these relationships and connections. How does this happen? The Twittersphere is intimate and small enough that you can have meaningful interactions with folks. Plus, there’s tweetups, where Twitter folks meet up at a designated physical location for in-person interaction and networking.
- It’s the best way to experience a conference, whether or not you are physically there. This is what spawned that previous post about Oceanography and the lack of social media use. I was excited to experience my first Ocean Sciences meeting with all of the benefits of Twitter, only to be disappointed at the lack of participation. In a few words, here’s how conference (or any event) tweeting works:
- A hash tag is declared. It’s something short and pithy, like #Oceans2012. How do you find out about the tag? Usually the organizing committee tells you, or in lieu of that you rely on your Twitter network to let you know.
- Everyone who tweets about a conference, interaction, talk, etc. uses the hash tag in their tweet. Examples:

- Hash tags are ephemeral, but they allow you to see exactly who’s talking about something, whether you follow them or not. They are a great way to find people on Twitter that you might want to network with… I’m looking at you, @rejectedbanana @miriamGoldste.
- If you are not able to attend a conference, you can “follow along” on your computer and get real-time feeds of what’s happening. I’ve followed several conferences like this- over the course of the day, I will check in on the feed a few times and see what’s happening. It’s the next best thing to being there.
I could continue expounding the greatness of Twitter, but as I said before, others have done a better job than I could (see links above). No, it’s not for everyone. But keep in mind that you can follow people, hash tags, etc. without actually ever tweeting. You can reap the benefits of everything I mentioned above, except for the networking. Food for thought.
My friend from WHOI, who also attended the Ocean Sciences meeting, emailed me this comment later:
…I must say those “#tweetstars” were pretty smug about their tweeting, like they were the sitting at the cool kids table during lunch or something…
I countered that it was more like those tweeting at OS were incredulous at the lack of tweets, but yes, we are definitely the cool kids.
Oceanographers: Why So Shy?
Last week I attended the TOS/ASLO/AGU Ocean Sciences 2012 Meeting in Salt Lake City. (If you are a DCXL blog regular, you know I was also at the Personal Digital Archiving 2012 Conference last week: my ears were bleeding by Friday night!). These two conferences were starkly different in many ways. Ocean Sciences had about 4,000 attendees, while PDA was closer to 100. Ocean Sciences had concurrent sessions, plenaries, and workshops, while PDA had only one room where all of the speakers presented. Although both provided provisions during breaks, PDA’s coffee and treats far surpassed those provided at the Salt Palace. But the most interesting difference? The incorporation of social media into the conference.
There are some amazing blogs out there for ocean scientists: Deep Sea News and SeaMonster come to mind immediately. There are also a plethora of active tweeters and bloggers in the ocean sciences community, including @labroides @jebyrnes (and his blog) @MiriamGoldste @RockyRohde @JohnFBruno @kzelnio @SFriedScientist @rejectedbanana @DrCraigMc @rmacpherson @Dr_Bik . I’m sure I’ve left some great ones out- feel free to tweet me and let me know! @carlystrasser).
That being said, ocean scientists stink at social media if OS 2012 was any indication.
First, the Ocean Sciences Meeting did not declare a hash tag – this is the first major conference I’ve been to in a while that didn’t do so. What does this mean? Those of us who were trying to communicate about OS 2012 via Twitter were not able to converge under a single hash tag until Tuesday (#oceans2012). Perhaps that isn’t such a big deal since there were only a dozen Tweeters at the conference. This is unusual for a conference of this size: at AGU 2011 in December, I would hazard to guess that there were more like 200 Tweeters. Food for thought.
Second, I heard from @MiriamGoldste that there was actual, audible clapping when disparaging comments were made about social media in one of the presentations. For shame, oceanographers! You should take advantage of tools offered to you; short of using social media yourself, you should recognize its growing importance in science (read some of the linked articles below).
Now for PDA 2012. A hash tag was declared (#pda12) and about 2 dozen active tweeters were off and running. We had dialogues during the conference, helped answer each others’ questions, commented on speakers’ major conclusions, and generally kept those that couldn’t attend the conference in person abreast of the goings-on. Combine that with real-time blogging of the meeting, and you had a recipe for being connected whether you were sitting in a pew at the Internet Archive or not. Links were tweeted to newly-posted slides, and generally there was a buzz about the conference.
So listen up, OS 2012 attendees: You are being left in the dust by other scientists who have embraced social media. I know what you are thinking: “I don’t have time to do all of that stuff!” One of the conference tweets says it best:
More information…
Read this great post from Scientific American on Social Media for Scientists
COMPASS: Communication partnership for science and the sea. I attended a COMPASS workshop two years ago at NCEAS and was swayed by the lovely Liz Neeley that social media was not only worth my time, but it could advance my career (read “Highly tweeted articles were 11x more likely to be cited” from The Atlantic).
Generally all of the resources on the Social Media For Scientists wikispace
Social Media for Scientists Recap from American Fisheries Society blog
As for how social media relates to the DCXL project, isn’t it obvious? I’ve been collecting feedback straight from potential DCXL users using social media. Because I have tapped into these networks, the DCXL project’s outcomes are likely to be useful for a large contingent of our target audience.

Archiving Your Life: PDA 2012 Meeting
I’m currently sitting in a church. No, I’m not being disrespectful and blogging while at church. Technically, I’m in a former church, in the Richmond District of San Francisco. The Internet Archive bought an old church and turned it into an amazing space for their operation, as well as for meetings like the 2012 Personal Digital Archiving Meeting I’m currently attending.
I wasn’t sure what “personal digital archiving” meant, exactly, before I heard about this conference. It turns out the concept is very familiar to me. It’s basically thinking about how to preserve your life’s digital content – photos, emails, writings, files, scanned images, etc. etc. The concept of archiving personal materials is a very hot topic right now. Think about Facebook, Storify, iCloud, WordPress, and Flickr, to name a few. As a scientist, I actually think my of my data as personal digital files: they represent a very long period of my life, after all. So I’m at this meeting talking a bit about DCXL, and also learning a lot about some amazing new stuff. Here’s a few interesting tidbits:
Cowbird: This is a place to tell stories, rather than just archive their lives. According to the founder (who is attending this conference), Cowbird is about the experience of life, as opposed to merely curating life. For an amazing, moving example of how Cowbird works, check this out: First Love
The Brain: Very cool, free software that helps you organize links, definitions, notes, etc. The idea is that it works just like your brain: it makes connections and creates networks to provide meaning to each link. Play with it a bit and you will be hooked.
Pinboard: Technically, I already knew about Pinboard. But the founder of the bookmarking system gave a great talk, so I’m including it here. Pinboard has been described as how the bookmarking service Delicious used to work, before it stopped working well. For a very small fee (~$10) you can store your bookmarks, tag them, and even save copies of the web pages as they were when you viewed them- this comes in particularly handy if you use a website for research and it might mysteriously disappear without warning. My favorite thing about Pinboard is it isn’t mucked up with ads and other visual distractions.



