Last week some pretty fabulous speakers congregated in Chicago for the Microsoft eScience Workshop, which was scheduled to coincide with the 2012 IEEE eScience Workshop (IEEE stands for Institute of Electrical and Electronics Engineers, but their conference has evolved into a general tech conference). Having never attended either conference, I wasn’t sure what to expect. I was invited to participate because of DataUp; I took part in the “DemoFest” and led a panel on data curation that included an overview of the DataUp tool and DataONE.
I was pleasantly surprised by the workshop’s breadth and depth of topics. My favorite session by far, however, was titled “Publishing and eScience”, co-chaired by Mark Abbott (dean at the College of Earth, Ocean, and Atmospheric Sciences at Oregon State) and Jeff Dozier (faculty at the UCSB Bren School for Environmental Science and Management). Abbott and Dozier were joined by Jim Frew (also of UCSB) and Shuichi Iwata (Emeritus Professor of the University of Tokyo). The topic du jour was how to maintain dataset provenance, especially for those datasets that are used for publishing results.
If the word “provenance” is throwing you for a loop, you aren’t alone. Many researchers aren’t familiar with this term as it relates to research. It’s more commonly used in, say, the art or museum world. From Wikipedia:
…from the French provenir, “to come from”, refers to the chronology of the ownership or location of a historical object.
In his talk on “When Provenance Gets Real”, Frew exploited our familiarity with provenance as an art term by describing the 2009 story of a painting being attributed to Leonardo da Vinci based on discovering his fingerprint on the canvas. (Read a summary in the Park West Gallery blog or from CNN). The painting was originally bought for $19,000 in 2007, however based on its clarified provenance it is worth around $160 million. It is hard to estimate what a well-documented dataset with excellent provenance is worth; we should always operate under the assumption, however, that future users of our data might be able to spectacularly important things. I like the fact that, in this scenario, I can be the Leonardo da Vinci of data.
Provenance is something I’ve blogged about before (see my two posts on workflows: informal and formal). It’s a topic near and dear to my heart since I believe that documenting and archiving provenance will be the next major frontier for scientific research and advancement. The discussion during the workshop session ran the gamut from informal to formal; one particularly fabulous moment was when Jim Frew projected a scripted workflow (UNIX, no less!) to demonstrate what provenance looks like in the real world. Frew went on to suggest that provenance for digital resources is the foundation for other important scientific concepts, like authenticity, trust, and reproducibility. Hear hear!
I did a rough Storify with tweets from the workshop. Check it out: Storify for Microsoft eScience Workshop. You can also check out videos of the workshop presentations on the Microsoft Website.