(index page)

Putting the Meta in Metadata

The prefix “meta” implies “abstraction”. Metadata is an abstraction of the data. My favorite abstract artist is Jean-Michel Basquiat. See the connection? From www.provisionslibrary.com

The DCXL project is in full swing now– developers are working closely with Microsoft Research to create the add-in that will revolutionize scientific data curation (in my humble opinion!). Part of this process was deciding how to handle metadata. For a refresher on metadata, i.e. data documentation, read this post about the metadata in DCXL.

Creating metadata was one of the major requirements for the project, and arguably the most challenging task. The challenges stem from the fact that there are many metadata standards out there, and of course, none are perfect for our particular task. So how do we incorporate good work done by others much smarter than me into DCXL, without compromising our need for user-friendly, simple data documentation?

It was tricky, but we came up with a solution that will work for many, if not most, potential DCXL users. A few things entered into the metadata route we chose:

DataONE: We are very interested in making sure that data posted to a repository via the DCXL add-in can be found using the DataONE Mercury metadata search system (Called ONE-Mercury; to be released in May). That means we need to make sure we are using metadata that the DataONE infrastructure likes. At this point in DataONE’s development, that limits us to International Organization for Standardization Geospatial Metadata Standard (ISO19115), Federal Geographic Data Committee Geospatial Metadata Standard (FGDC) , and Ecological Metadata Language (EML).
We want metadata created by the DCXL software to be as flexible as possible for as many different types of data as possible. ISO19115 and FGDC are both geared towards spatial data specifically (e.g., GIS). EML is a bit more general and flexible, so we chose to go with it.
EML is a very well documented metadata schema; rather than include every element of EML in DCXL, we cherry-picked the elements we thought would generate metadata that makes the data more discoverable and useable. Of course, just like never being too skinny or too rich, you can NEVER have too much metadata. But we chose to draw the line somewhere between “not useful at all” and “overwhelming”.
We ensured that the metadata elements we included could be mapped to DataCite and Dublin Core minimal metadata. This ensures that a data citation can be generated based on the metadata collected for the dataset.

The result of this work can be seen here: dot.ucop.edu/specs/emlcore.html. Word of warning – proceed to this link only if HTML doesn’t scare you! (Hat tip to CDL’s John Kunze for leading the charge on the DCXL metadata front).

Still hungry for more? There are three DataONE education modules pertaining to metadata: What Is Metadata? | The Value of Metadata | Writing Metadata

Data Publishing and the Coproduction of Quality

This post is authored by Eric Kansa

There is a great deal of interest in the sciences and humanities around how to manage “data.” By “data,” I’m referring to content that has some formal and logical structure needed to meet the requirements of software processing. Of course, distinctions between structured versus unstructured data represent more of a continuum or spectrum than a sharp line. What sets data apart from texts however is that data are usually intended for transactional (with queries and visualizations) rather than narrative applications.

The uses of data versus texts make a big difference in how we perceive “quality.” If there is a typo in a text, it usually does not break the entire work. Human readers are pretty forgiving with respect to those sorts of errors, since humans interpret texts via pattern recognition heavily aided by background knowledge and expectations. Small deviations from a reader’s expectations about what should be in a text can be glossed over or even missed entirely. If noticed, many errors annoy rather than confuse. This inherently forgiving nature of text makes editing and copy-editing attention-demanding tasks. One has to struggle to see what is actually written on a page rather than getting the general gist of a written text.

Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions (many suggested by peer-review evaluations) before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. The final product is a work of collaborative “coproduction” between authors, editors, reviewers, and type-setters.

What does this have to do with data?

Human beings typically don’t read data. We use data mediated through software. The transactional nature of data introduces a different set of issues impacting the quality and usability of data. Whereas small errors in a text often go unnoticed, such errors can have dramatic impacts on the use and interpretation of a dataset. For instance, a misplaced decimal point in a numeric field can cause problems for even basic statistical calculations. Such errors can also break visualizations.

These issues don’t only impact single datasets, they can also wreak havoc in settings where multiple individual datasets need to be joined together. I work mainly on archaeological data dissemination. Archaeology is an inherently multidisciplinary practice, involving inputs from different specialists in the natural sciences (especially zoology, botany, human osteology, and geomorphology), the social sciences, and the humanities. Meaningful integration of these diverse sources of structured data represents a great information challenge for archaeology. Archaeology also creates vast quantities of other digital documentation. A single field project may result in tens of thousands of digital photos documenting everything from excavation contexts to recovered artifacts. Errors and inconsistencies in identifiers can create great problems in joining together disparate datasets, even from a single archaeological project.

It is a tremendous challenge to relate all of these different datasets and media files together in a usable manner. The challenge is further compounded because archaeology, like many small sciences, typically lacks widely used recording terminologies and standards. Each archaeological dataset is custom crafted by researchers to address a particular suite of research interests and needs. This means that workflows and supporting software to find and fix data problems needs to be pretty generalized.

Fortunately, archaeology is not alone in needing tools to promote data quality. Google Refine helps meet these needs. Google Refine leverages the transactional nature of data to summarize and filter datasets in ways that make many common errors apparent. Once errors are discovered, Google Refine has powerful editing tools to fix problems. Users can also undo edits to roll-back fixes and return a dataset to an earlier state.

With funding from the Alfred P. Sloan Foundation, we’re working to integrate Google Refine in a collaborative workflow called “Data Refine“. Again, the transactional nature of data helps shape this workflow. Because use of data is heavily mediated by software, datasets can be seen as an integral part of software. This thinking motivated us to experiment with using software debugging and issue tracking tools to help organize collaborative work on editing data. Debugging and issue tracking tools are widely used and established ways of improving software quality. They can play a similar role in the “debugging” of data.

We integrated Google Refine and the PHP-based Mantis issue tracker to support collaboration in improving data quality. In this approach, contributing researchers and data editors collaborate in the coproduction of higher quality, more intelligible and usable datasets. These workflows try to address both supply and demand needs in scholarship. Researchers face strong and well known career pressures. Tenure may be worth $2 million or more over the course of a career, and its alternative can mean complete ejection from a discipline. A model of editorially supervised “data sharing as publication” can help better align the community’s interest in data dissemination with the realities of individual incentives. On the demand side, datasets must have sufficient quality and documentation. To give context, data often need to be related and linked with shared concepts and with other datasets available on the Web (as in the case of “Linked Open Data” scenarios).

All of these processes require effort. New skills, professional roles, and scholarly communication channels need to be created to meet the specific requirements of meaningful data sharing. Tools and workflows as discussed here can help make this effort a bit more efficient and better suited to how data are used in research.

Communication Breakdown: Nerds, Geeks, and Dweebs

Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project. WAIT! Keep reading! I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks. I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.

I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics. There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack. I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

Fredo Corleone — Fredo Corelone was smart. Not stupid like everybody says. Nerds, Geeks, and Dweebs are all smart – just in different ways. from godfather.wikia.com

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues. It’s actually quite obvious why these communication problems exist. There are a lot of smart people involved in data, all of whom have very divergent backgrounds. I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs. These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.

The Nerds. This is the group to which I belong. We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education. Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.

The geeks. Also known as computer folks. These folks might be developers, computer scientists, information technology specialists, database managers, etc. They are uber-smart, but from what I can tell their uber-smart brains do not work like mine. From what I can tell, geeks can explain things to me in one of two ways:

“To turn your computing machine on, you need to first plug it in. Then push the big button.”
“First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”

In all fairness, (1) occurs far less than (2). But often you get (1) after trying to get clarification on (2). How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen. That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours. Often I’ve found myself writing down terms in a meeting so I can go look them up later. Things like “elements” and “terminal” are not unfamiliar words in and of themselves. However the contexts in which they are used are completely new to me. That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and XML.

The dweebs. Also known as librarians. These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms. There’s certainly a bit of crossover with the computer types, especially when it comes to data. However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers. There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind. Check out the DCXL post on libraries from January.

Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs. I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort. How can we solve this communication impasse? I have a few ideas:

Nerds: open your minds to the possibility that computer types and librarian types might know about better ways of doing what you are doing. Tap the resources that these groups have to offer. Stop being scared of the unknown. You love learning or you wouldn’t be a scientist; devote some of that love in the direction of improving your computer savvy.
Geeks: dumb it down, but not too much. Recognize that scientists and librarians are smart, but potentially in very different ways than you. Also, please recognize that change will be incremental, and we will not universally adopt whatever you think is the best possible set of tools or strategies and how “totally stupid” or current workflow seems.
Dweebs: spend some time getting to know the disciplines you want to help. Toot your own horn– you know A LOT of stuff that nerds and geeks don’t, and you are all so darn shy! Make sure both geeks and nerds know of your capacity to help, and your ability to lend important information to the discussion.

And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation). I plead with you to stop reinventing the wheel. As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software. It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types. Consult with them rather than starting from scratch.

In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.

Popular Demand for Public Data

Scanned image of a 1940 Census Schedule (from http://1940census.archives.gov) — The National Archives and Records Administration digitized 3.9 million schedules from the 1940 U.S. census

When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.

Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.

In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.

For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.

The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.

Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.

So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)

Data Citation Redux

I know what faithful DCXL readers are thinking: didn’t you already post about data citation? (For the unfaithful among you, check out this post from last November). Yes, I did. But I’ve been inspired to post yet again because I just attended an amazing workshop about all things data citation related.

The workshop was hosted by the NCAR Library (NCAR stands for National Center for Atmospheric Research) and took place in Boulder on Thursday and Friday of last week. Workshop organizers expected about 30 attendees; more than 70 showed up to learn more about data citation. Hats off to the organizers – there healthy discussions among attendees and interesting presentations by great speakers.

One of the presentations that struck me most was by Dr. Tim Killeen, Assistant Director for the Geosciences Directorate at NSF. His talk (available on the workshop website) discussed the motivation for data citation, and what policies have begun to emerge. Near the end of a rather long string reports about data citation, data sharing, and data management, Killeen said “There is a drumbeat into Washington about this.”

John Bonham — If Led Zeppelin drummer J Bonham were still alive, he would leading the data charge into DC. Bonham was voted by Rolling Stone readers as the best drummer of all time. Photo from drummerworld.com

This phrase stuck with me long after I flew home because it juxtaposted two things I hadn’t considered as being related: Washington DC and data policy. Yes, I understand that NSF is located in Washington, and that very recently the White House announced some exciting Big Data funding and initiatives. But Washington DC as a whole – congress, lobbyists, lawyers, judges, etc. – would notice a drum beat about data? I must say, I got pretty excited about the idea.

What are these reports cited by Killeen? In chronological order:

NSF’s advisory panel report waaay back in 2003: a “Harbinger of Cyberinfrstructure” according to Killeen
National Science Board’s report in 2005 on the importance of ensuring digital data are long-lived.
Final report from ARL/NSF Workshop on Long-Term Stewardship of Digital Data Collections in 2006: called for promoting “change in the research enterprise regarding… stewardship of digital data”
NSF’s stated vision in a 2007 report Cyberinfrastructure Vision for 21st Century Discovery. The vision? Data being routinely deposited in a well-documented form. Love it.
A 2009 Report of the Interagency Working Group on Digital Data stated that “all sectors of society are stakeholders in digital preservation and access.” Agreed!
NSF’s 2012 Vision and Strategic Plan: Cyber Infrastructure Framework for the 21st Century

The NSB report on long-lived digital data had yet another a great phrase that stuck with me:

Long-lived digital data collections are powerful catalysts for progress and for democratization of science and education

Wow. I really love the idea of democratized data. It warms the cockles, doesn’t it? With regard to DCXL, the link is obvious. One of the features we are developing is generation of a data citation for your Excel dataset.

The Future of Metrics in Science

Ask any researcher what they need for tenure, and the answer is virtually the same across institutions and disciplines: publications. The “publish or perish” model has reigned supreme for generations of scientists, despite its rather annoying ignorance of having quality over quantity publications, how many collaborations have been established, or even the novelty or difficulty of a particular research project. This archaic measure of impact tends to rely measures like a scientist’s number of citations and the impact factor of the journals in which they publish.

With the upswing in blogs, Twitter feeds, and academic social sites like Mendeley, Zotero, and (my favorite) CiteULike, some folks are working on developing a new model for measuring one’s impact on science. Jason Priem, a graduate student at UNC’s School of Information and Library Science, coined the term “altmetrics” rather recently, and the idea has taken off like wildfire.

altmetrics is the creation and study of new metrics based on the Social Web for analyzing, and informing scholarship.

The concept is simple: instead of using traditional metrics for measuring impact (citation counts, journal impact factors), Priem and his colleagues want to take into account more modern measures of impact like number of bookmarks, shares, or re-tweets. In addition, altmetrics seeks to consider not only publications, but associated data or code downloads.

sex pistols — The original alternatives: The Sex Pistols. From Arroz Do Ceu (limpa-vias.blogspot.com). Read more about the beginnings of alternative rock in Dave Thompson’s book “Alternative Rock”.

Old-school scientists and Luddites might balk at the idea of measuring a scientist’s impact on the community by the number of re-tweets their article received, or by the number of downloads of their dataset. This reaction can be attributed to several causes, one of which may be an irrational fear of change. But the reality is that the landscape of science is changing dramatically, and the trend towards social media as a scientific tool is only likely to continue. See my blog post on why scientists should tweet for more information on the benefits of embracing one of the aspects of this trend.

Need another reason to get onboard? Funders see the value in altmetrics. Priem, along with his co-PI (and my DataONE colleague) Heather Piwowar, just received $125K from the Sloan Foundation to expand their Total Impact project. Check out the Total Impact website for more information, or read the UNC SILS news story about the grant.

The DCXL project feeds right into the concept of altmetrics. By providing citations for datasets that are housed in data centers, the impact of a scientist’s data can be easily incorporated into their impact factor.

Trending: Big Data

Last week, the White House Office of Science and Technology Policy hosted a “Big Data” R&D event, which was broadcast live on the internet (recording available here, press release available as a pdf). GeekWire did a great piece on the event that provides context. Wondering what “Big Data” means? Keep reading.

“Howdy Folks!” Big Tex from the State Fair of Texas thinks Big Data is the best kind of data. From Flickr by StevenM_61. For more info on Big Tex, check out http://en.wikipedia.org/wiki/Big_Tex

Big Data is a phrase being used to describe the huge volume of data being produced by modern technological infrastructure. Some examples include social media and remote sensing instruments. Facebook, Twitter, and other social media are producing huge amounts of datasets that can be analyzed to understand trends in the Internet. Satellites and other scientific instruments are producing constant streams of data that can be used to assess the state of the environment and understand patterns in the global ecosphere. In general, Big Data is just what it sounds like– a sometimes overwhelming amount of information, flooding scientists, statisticians, economists, and analysts with an ever-increasing pile of fodder for understanding the world.

Big Data is often used alongside the “Data Deluge”, which is a phrase used to describe the onslaught of data from multiple sources, all waiting to be collated and analyzed. The phrase brings about images of being overwhelmed by data: check out The Economist‘s graphic that represents the concept. From Wikipedia:

…datasets are growing so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing.

Despite the challenges of Big Data, folks are hungry for big data sets to analyze. Just this week, the 1940 US Census data was released; there was so much interest in downloading and analyzing the data, the servers crashed. You only need to follow the Twitter hash tag #bigdata to see it’s a very hot topic right now. Of course, Big Data should not be viewed as a bad thing. There is no such thing as too much information; it’s simply a matter of finding the best tools for handling all of those data.

Big Data goes hand-in-hand with Big Science, which is a term first coined back in 1961 by Alvin Weinberg, then the director of the Oak Ridge National Laboratory. Weinberg used “Big Science” to describe large, complex scientific endeavors in which society makes big investments in science, often via government funding. Examples include the US space program, the Sloan Digital Sky Survey, and the National Ecological Observatory Network. These projects produce mountains of data, sometimes continuously 24 hours a day, 7 days a week. Therein lies the challenge and awesomeness of Big Data.

What does all of this mean for small datasets, like those managed and organized in Excel? The individual scientist with their unique, smaller scale dataset has a big role in the era of Big Data. New analytics tools for meta-analysis offer a way for individuals to participate in Big Science, but we have to be willing to make our data standardized, useable, and available. The DCXL add-in will facilitate all three of these goals.

In the past, meta-analysis of small data sets meant digging through old papers, copying data out of tables or reconstructing data from graphs. Wondering about the gland equivalent of phenols from castoreum? Dig through this paper and reconstruct the data table in Excel. Would you like to combine that data set with data on average amounts of neutral compounds found in one beaver castor sac? That’s another paper to download and more data to reconstruct. By making small datasets available publicly (with links to the datasets embedded in the paper), and adhering to discipline-wide standards, meta-analysis will be much easier and small datasets can be incorporated into the landscape of Big Science. In essence, the whole is greater than the sum of the parts.

Think you can take on the Data Deluge? NSF’s funding call for big data proposals is available here.

DataShare: A Plan to Increase Scientific Data Sharing

This post was co-authored by Dr. Michael Weiner, CIND director at UCSF

The DataShare project is a collaboration between University of California San Francisco’s Clinical and Translational Science Institute, the UCSF Library, and the UC Curation Center (UC3) at the California Digital Library. The goal of the DataShare project is to achieve widespread voluntary sharing of scientific data at the time of publication. This will be achieved by creating a data sharing website which could be used by all UCSF investigators, and ultimately by others in the UC system and other institutions. Currently data sharing is mostly done by large, well funded multi-investigator projects. There would be great benefit if much more raw data were widely shared, especially data from individual investigators.

we are the world — Imagine the possible scientific advances if we pooled our data the way that “We are the world” pooled celebrity voices. From live.drjays.com

This project is the brainchild of Michael Weiner M.D., the director for the Center for Imaging of Neurodegenerative Diseases. Weiner’s experience as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) led him to conclude that widespread data sharing can be achieved now, with great scientific and economic benefits. All ADNI raw data is immediately shared (at UCLA/LONI/ADNI) with all scientists in the world without embargo. The project is very successful: more than 300 publications and many more submitted. This success demonstrates the feasibility and benefits of sharing data.

Individual initiatives:

The laboratory at the Center for Imaging of Neurodegenerative Diseases began to share data at the time of publication in 2011. This included both raw data and a description of how the raw data was processed and analyzed, leading to the findings in the publication. For the DataShare project, the following expansions to data sharing are planned:

ADNI scientists will be encouraged to share the raw data of their ADNI papers, and other papers from their laboratories
Other faculty in the Department of Radiology at UCSF and our collaborators in Neurology and Psychiatry at UCSF will be encouraged to share their raw data
Chancellor, Deans, and Department Chairs at UCSF will be urged to make more widespread voluntary sharing of scientific data a UCSF priority/policy; this may include providing storage space for shared data and/or development of policies which would reward data sharing in the hiring and promotion process
The example UCSF sets may then encourage the entire University of California system to implement similar changes
Other collaborators and colleagues in other universities around the world will then be encouraged to adopt similar policies
A “data sharing impact factor” will be developed and tested which will allow scientists to cite others’ data that they use and provide metrics for how others are using their data.

Institutional initiatives:

The project seeks to encourage involvement by the National Institutes of Health (NIH), the National Science Foundation (NSF), and the National Library of Medicine (NLM), to promote and facilitate sharing of scientific data. This will be accomplished via five tasks:

Encourage NIH and NSF to emphasize and expand their existing policies concerning data sharing and notify the scientific community of this greater emphasis
Promote the establishment of a small group of committed individuals who can help formulate policy for NIH in this area, including a policy framework that favors open availability of scientific data.
Establish technical mechanisms for data sharing, such as a national system for storage of all raw scientific data (e.g., a national data repository or data bank). This repository may be created by NLM, or be housed at universities, foundations, or private companies (e.g., Dataverse).
Work to develop incentives for scientists and institutions to share their raw data. This may include
1. Requesting reports in non competitive reviews, competitive reviews and/or new applications
2. Instructing the reviewers to consider data sharing in assessing priority scores in grant reviews
3. Acknowledgment in publications
4. Providing affordable access to infrastructure, i.e. software and media, which facilitates data sharing
5. Encouraging NIH to provide funding for small grants aimed to promote and take advantage of shared data. Examples include projects that utilize data mining or cloud computing.

The potential gains from widespread sharing of raw scientific data greatly outweigh the relatively small costs involved in developing the necessary infrastructure. Industries likely to benefit from increased accessibility of large amounts of raw data include the pharmaceutical and health care industry, chemistry, technology, engineering, etc. We also expect new technologies and new companies to develop to take advantage of newly available data. Furthermore, there will be substantial societal benefits gained by widespread sharing of scientific data, primarily due to the ability to link data sets and repurpose data for making unforeseen discoveries.

Survey says…

A few weeks ago we reached out to the scientific community for help on the direction of the DCXL project. The major issue at hand was whether we should develop a web-based application or an add-in for Microsoft Excel. Last week, I reported that we decided that rather than choose, we will develop both. This might seem like a risky proposition: the DCXL project has a one-year timeline, meaning this all needs to be developed before August (!). As someone in a DCXL meeting recently put it, aren’t we settling for “twice the product and half the features”? We discussed what features might need to be dropped from our list of desirables based on the change in trajectory, however we are confident that both of the DCXL products we develop will be feature-rich and meet the needs of the target scientific community. Of course, this is made easier by the fact that the features in the two products will be nearly identical.

Family Feud screen shot — What would Richard Dawson want? Add-in or web app? From Wikipedia. Source: J Graham (1988). Come on Down!!!: the TV Game Show Book. Abbeville Press

How did we arrive at developing an add-in and a web app? By talking to scientists. It became obvious that there were aspects of both products that appeal to our user communities based on feedback we collected. Here’s a summary of what we heard:

Show of hands: I ran a workshop on Data Management for Scientists at the Ocean Sciences 2012 Meeting in February. At the close of the workshop, I described the DCXL project and went over the pros and cons of the add-in option and the web app option. By show of hands, folks in the audience voted about 80% for the web app (n~150)

Conversations: here’s a sampling of some of the things folks told me about the two options:

“I don’t want to go to the web. It’s much easier if it’s incorporated into Excel.” (add-in)
“As long as I can create metadata offline, I don’t mind it being a web app. It seems like all of the other things it would do require you to be online anyway” (either)
“If there’s a link in the spreadsheet, that seems sufficient. (either) It would be better to have something that stays on the menu bar no matter what file is open.” (Add-in)
“The updates are the biggest issue for me. If I have to update software a lot, I get frustrated. It seems like Microsoft is always making update something. I would rather go to the web and know it’s the most recent version.” (web app)
Workshop attendee: “Can it work like Zotero, where there’s ways to use it both offline and online?” (both)

Survey: I created a very brief survey using the website SurveyMonkey. I then sent the link to the survey out via social media and listservs. Within about a week, I received over 200 responses.

Education level of respondents:

Survey questions & answers:

So with those results, there was a resounding “both!” emanating from the scientific community. First we will develop the add-in since it best fits the needs of our target users (those who use Excel heavily and need assistance with good data management skills). We will then develop the web application, with the hope that the community at large will adopt and improve on the web app over time. The internet is a great place for building a community with shared needs and goals– we can only hope that DCXL will be adopted as wholeheartedly as other internet sources offering help and information.

Data Publishing–the First 500 Years

Data publishing is the new hot topic in a growing number of academic communities. The scholarly ecosophere is filled with listserv threads, colloquia and conference hallway chats punctuated with questions of why to do it, how to do it, where to do it, when to do it and even what to call this seemingly new breed of scholarly output. Scholars, and those who provide the tools and infrastructure to support them, are consumed with questions that don’t seem to have easy answers, and certainly not answers that span all disciplines. How can researchers gain credit for the data they produce, separate from and in addition to the analysis of those data as articulated in formal publications? How can scholars researching an area find relevant data sets within and across their own disciplines? How are data and methodologies most effectively reviewed, validated, and corrected? How are meaningful connections maintained between different versions, iterations, and “re-uses” of a given set of data?

The high-pitched level of debate on these topics is surprising in some ways given that datasets at least in certain fields have been readily available for awhile. The Inter-University Consortium for Political and Social Research (ICPSR) has been allowing social scientists to publish or find datasets since the early 1960s. Great datasets gathered and published under institutional auspices include UN Data, the UNESCO Statistical Yearbook, and the IMF Depository library program. Closer to home is the United States’ Federal Depository Library program, which since its establishment in 1841 has served as a distribution mechanism to ensure public access to governmental documents and data.

While these outlets are only viable solutions for some disciplines, their presence started me down a path exploring the history of data publishing in an effort to try to gain some perspective on the challenges we are facing today. Somewhat surprisingly, data publishing, conducted in a manner that would be recognized by today’s scholars, has been occurring for almost half a millennium. Yes, that’s right; we are now 500 years into producing, analyzing and publishing data.

These early activities centered on demographic data, presumably in an effort to identify and understand the dramatic patterns of life and death. Starting in the late 1500’s and prompted by the Plague, “Bills of Mortality” recording deaths within London began to be published and soon continued on a weekly basis. That raw data generation got noticed by a community-minded draper, the extremely bright, but non-university affiliated London resident John Graunt, who was inspired to gather those numerical lists, turn them into a dataset, analyze those data (looking for causes of death, ages of death, comparisons of rates between London and elsewhere, etc.) and publish both the dataset and his findings regarding population patterns in a groundbreaking work in 1662 “Natural and Political Observations Mentioned in a Following Index, and made up on the Bills of Mortality.” The work was submitted to the Royal Society of Philosophers, which recognized its merit and inducted the author into its fellowship. Graunt continued to extend his data and analysis, publishing new versions of each in subsequent years. Thus was born the first great work (at least in the Western world) of statistical analysis, or “political arithmetic” as it came to be called at that time.

Moving from the 16th and 17th centuries to the 18th brings us to another major point in data publishing history with Johann Sussmilch of Germany. Sussmilch was originally a cleric involved in a variety of intellectual pursuits, though unaffiliated with a university, at least initially. Sussmilch’s interests included theology, statistics and linguistics. He was eventually appointed to the Royal Academy of Sciences and Fine Arts for his linguistic scholarship. Sussmilch’s great work was the “Divine Order,” an ambitious effort to collect detailed data about the population in Prussia in order to prove his religious theory of “Rational Theology.” In other words, Sussmilch was engaged in a basic research program–he had a theory, formed a research question, collected the data required to test that theory, analyzed his data, and then published his results along with this data.

The rigorous quality of Sussmilch’s work (both the data and the analysis) elevated it far beyond his original and personal religious motivations, leading it to have a wide impact throughout parts of Europe. It became a focal point of exchange between scholars across countries and prompted debate over his data collection methodology and interpretation. Put another way, Sussmilch’s work inspired his colleagues to engage in the modern model of “scholarly communication” – engaging in a spirited critical dialogue which in turn resulted in changes to the next edition of the work (for instance, separate tables for immigration and racial data). Published first in 1741, it was updated and reprinted six times through 1798.

In this earlier time, as in our own, the drive to engage with other intellectuals was paramount. Publishing, sharing, critiquing and modifying data production efforts and analysis was seemingly as much a driving force among this community as it is among the scholars of today. Researchers of the 17th and 18th centuries dealt with issues of attribution, review of data veracity and analytical methodology and even versioning. The surprising discovery of apparent similarities across such a large gulf of time prompts many questions. If data could be published and shared centuries ago, why are we faced with such tremendous challenges to do the same today? Are we overlooking approaches from the past that could help us today? Or are we glossing over the difficulties of the past?

More research would have to be done to answer these questions thoroughly, but perhaps a gesture can be made in that regard by identifying some of the contrasting aspects between yesterday and today. Taking the examples from above as a jumping off point, perhaps the most striking difference between past activities and the goals articulated in conversations about data publishing today is that the data publication efforts of the past were accompanied by an equally important piece of analysis, and the research community was interested in both. The conclusions drawn from the data were held to scrutiny as were the data and data collection methods that provided their foundation. All of the components of the research were of concern. These scholars were not interested in publishing the data on their own, but rather wanted to present them along with their arguments, with each underscoring the other.

Another difference is the changing relationships between individual researchers and the entities that support them. Not only do we have governments and academic institutions, but we have a new contemporary player, the corporation, which is driven by a substantially different motivation from entities of past ages. In addition, a broader range of disciplines is now concerned with data publication, and perhaps those disciplines face stumbling blocks not at issue for the social scientists working with demographic and public health data. Given the known heterogeneity of scholarly communication practices across different fields, there seems to be no reason to think that data publishing needs, expectations and concerns would not also vary. And of course, the most obvious difference between then and now is with tools and technology. Have those advancements altered fundamental data publishing practices and if so, how?

These are interesting, but complex questions to pursue. Fortunately, what the above examples of our data publishing antecedents have hopefully revealed is that there are meaningful touchstones to use as reference points as we attempt to address these points. Data publishing has a rich, resonant past stretching back hundreds of years, providing us with an opportunity to reach into that past to better understand the trajectory that has brought us to this moment, thereby helping us more effectively grapple with the questions that seem to confound us today.