Skip to main content

(index page)

What Scientists Want: The Final Chapter

Here it is — the final installment of the requirements we are submitting for the DCXL add-in  “Requirements” are the capabilities we want the proposed add-in to have, based on discussions with scientists and other stakeholders.  For more information, read my three previous posts (here, here, and here) and check out the new Requirements page for more details about each of the proposed the requirements.

Requirement 5:    Deposit into a repository

When folks here at CDL and over at Microsoft Research first started talking about this project, the overarching goal was to facilitate data archiving.  They had a vision of archiving data using a tool already familiar to scientists: Excel.  This requirement is all about archiving data for the long-term.  We want users of the add-in to essentially click a button and be able to submit their data to a data repository. This particular requirement has a lot of challenges associated with it, and the details are still very much in flux (see details on the Requirements tab of this site).  However at the root of the simple statement “Deposit into a repository” is the gist of this entire project: not only manage the data better, but share it so others now and well into the future can use it.

night deposit box at bank
Wouldn’t it be great to have direct deposit for your data? From Flickr by Jim (jaytay)

Semantics and Data

There is a good reason that this post on semantics is directly preceded by a post on Ontologies and Data: I often get confused on the differences between the two.  This is probably because my left brain, which likes to clearly define, categorize, calculate, and organize, struggles with such right brain definitions like the study of the nature of existing in the case of ontology, and the study of meaning in the case of Semantics. Whaa? But I’m feeling a bit cocky due to the fact that the Ontologies post was the most-read DCXL post so far, so I will tackle semantics while I’m on a roll.

Engrish
It can be difficult to interpret the meanings of words, even when the language is familiar. Used with permission from fundulus77 on Flickr

Semantics: unlike “ontology”, it’s a word we hear in popular media and in conversations with friends.  The colloquialism “It’s just semantics” is to imply that the difference between opinions is purely a verbal quibble, bearing no relationship to anything in the real world (D. Crystal, How Language Works).

Semantics has a different interpretation in the field of linguistics: it’s the study of meaning in language.  In data and informatics, it has to do with making sure data and information are machine-readable, and therefore in a set, common format and structure. The data and information can then interact with one anther in meaningful ways.  The most common way you might hear “semantic” in reference to information science is the semantic web.

The Semantic Web is a “collaborative effort” cooked up by the World Wide Web Consortium (W3C) that promotes common data formats on the web.  The goal is to

…provide a framework that allows data to be shared and reused across application, enterprise, and community boundaries…  It enables machines to “understand” and respond to complex human requests based on their meaning, which requires that the relevant information sources is semantically structured.

Think of it as a universal language for data on the web.

There’s quite a bit of overlap in papers with keywords ontologies and semantics. If you need a concrete example based in Ecology, the Ecological Complexity paper “An ontology for landscapes” by Lepczyk, Lortie, and Anderson is a good one: 10.1016/j.ecocom.2008.04.001

…This ontology places the [concept of] ‘landscape’ within a broader logical relational ecological context in order to establish formal rules and consistent semantics so that individual researchers can continue to study organisms at select scales while others may potentially integrate results across scales.

Clear definitions of words, their meanings, and their relationships facilitates better science. This is even more true in the era of digital data and the world wide web. The more consistent scientists are with their descriptions, definitions, and data, the more likely it is that those data are useable in the future.  Semantics in science are all about being able to combine, compare, and relate disparate data. The goal is not to make all data uniform, rather to make it uniformly understandable.

The DCXL project comes into play with semantics in a similar way that it did with ontologies: ideally, we want to be consistent with efforts like the semantic web.  The underlying structure of Excel documents (.xlsx) is based in XML, or extensible markup language, which is designed for describing data in a machine-readable way. That’s a great first step.