(index page)

Finding a Home For Your Data

Where do I put my data? This question comes up often when I talk with researchers about long term data archiving. There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.

First and foremost, choose early. By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset. If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point. You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.

Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful. Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future. If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns). Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.

So how do you choose your perfect data repository? There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services. The site allows you to search for keywords in their giant list of data repositories. For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data. Don’t see your favorite data repository? The site welcomes feedback (including suggestions for additional repositories not in their database).

Another good way to discover your perfect repository: think about where you go to find datasets for your research. Try asking around too – where do your colleagues look for data? Where do they archive it? Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.

A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system. These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field. Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.

The DCXL add-in will initially connect with CDL’s Merritt Repository. The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable. Long live Excel data!

Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel

The Science of the DeepSea Challenge

Recently the film director and National Geographic explorer-in-residence James Cameron descended to the deepest spot on Earth: the Challenger Deep in the Mariana Trench. He partnered with lots of sponsors, including National Geographic and Rolex, to make this amazing trip happen. A lot of folks outside of the scientific community might not realize this, but until this week, there had been only one successful descent to this the trench by a human-occupied vehicle (that’s a submarine for you non-oceanographers). You can read more about that 1960 exploration here and here.

I could go on about how astounding it is that we know more about the moon than the bottom of the ocean, or discuss the seemingly intolerable physical conditions found at those depths– most prominently the extremely high pressure. However what I immediately thought when reading the first few articles about this expedition was where are the scientists?

h96804 — Before Cameron, Swiss Oceanographer Piccard and Navy officer Marsh went down in it to the virgin waters of the deep. From www.history.navy.mil/photos/sh-usn/usnsh-t/trste-b

After combing through many news stories, several National Geographic sites including the site for the expedition, and a few press releases, I discovered (to my relief) that there are plenty of scientists involved. The team that’s working with Cameron includes scientists from Scripps Institution of Oceanography (the primary scientific partner and long-time collaborator with Cameron), Jet Propulsion Lab, University of Hawaii, and University of Guam.

While I firmly believe that the success of this expedition will be a HUGE accomplishment for science in the United States, I wonder if we are sending the wrong message to aspiring scientists and youngsters in general. We are celebrating the celebrity film director involved in the project in lieu of the huge team of well-educated, interesting, and devoted scientists who are also responsible for this spectacular feat (I found less than 5 names of scientists in my internet hunt). Certainly Cameron deserves the bulk of the credit for enabling this descent, but I would like there to be a bit more emphasis on the scientists as well.

Better yet, how about emphasis on the science in general? It’s a too early for them to release any footage from the journey down, however I’m interested in how the samples will be/were collected, how they will be stored, what analyses will be done, whether there are experiments planned, and how the resulting scientific advances will be made just as public as Cameron’s trip was. The expedition site has plenty of information about the biology and geology of the trench, but it’s just background: there appears to be nothing about scientific methods or plans to ensure that this project will yield the maximum scientific advancement.

How does all of this relate to data and DCXL? I suppose this post falls in the category of data is important. The general public and many scientists hear the word “data” and glaze over. Data isn’t inherently interesting as a concept (except to a sick few of us). It needs just as much bolstering from big names and fancy websites as the deep sea does. After all, isn’t data exactly what this entire trip is about? Collecting data on the most remote corners of our planet? Making sure we document what we find so others can learn from it?

Here’s a roundup of some great reads about the Challenger expedition:

National Geographic: James Cameron Begins Descent to Ocean’s Deepest Point
National Geographic: Cameron’s dive cut short
National Geographic press release about Cameron’s trip to the bottom
National Geographic website for the project: Deepsea Challenge
The Guardian: James Cameron may kill the Kraken but not our journey of discovery
Spectacular post on Deep Sea News by Craig McCain about the value of this expedition for science and humanity
Scripps Institution of Oceanography information page about the Deep Sea Challenge
Stars and Stripes: Deep Sea Dive is Nothing New for the Navy
US Navy’s Press release for 1960 Trieste trip to the trench

The Digital Dark Age, Part 2

Earlier this week I blogged about the concept of a Digital Dark Age. This is a phrase that some folks are using to describe some future scenario where we are not able to read historical digital documents and multimedia because they have been rendered obsolete or were otherwise poorly archived. But what does this mean for scientific data?

Consider that Charles Darwin’s notebooks were recently scanned and made available online. This was possible because they were properly stored and archived, in a long-lasting format (in this case, on paper). Imagine if he had taken pictures of his finch beaks with a camera and saved the digital images in obsolete formats. Or ponder a scenario where he had used proprietary software to create his famous Tree of Life sketch. Would we be able to unlock those digital formats today? Probably not. We might have lost those important pieces of scientific history forever. Although it seems like software programs such as Microsoft Excel and MATLAB will be around forever, people probably said similar things about the programs Lotus 1-2-3 and iWeb.

darwin by diana sudyka — “Darwin with Finches” by Diana Sudyka, from Flickr by Karen E James

It is a common misconception that things that are posted on the internet will be around “forever”. While that might be true of embarrassing celebrity photos, it is much less likely to be true for things like scientific data. This is especially the case if data are kept on a personal/lab website or archived as supplemental material, rather than being archived in a public repository (See Santos, Blake and States 2005 for more information). Consider the fact that 10% of data published as supplemental material in the six top-cited journals was not available a mere five years later (Evangelou, Trikalinos, and Ioannidis, 2005).

Natalie Ceeney, chief executive of the National Archives, summed it up best in this quote from The Guardian’s 2007 piece on preventing a Digital Dark Age: “Digital information is inherently far more ephemeral than paper.”

My next post and final DDA installment will provide tips on how to avoid losing your data to the dark side.

Archiving Your Life: PDA 2012 Meeting

I’m currently sitting in a church. No, I’m not being disrespectful and blogging while at church. Technically, I’m in a former church, in the Richmond District of San Francisco. The Internet Archive bought an old church and turned it into an amazing space for their operation, as well as for meetings like the 2012 Personal Digital Archiving Meeting I’m currently attending.

I wasn’t sure what “personal digital archiving” meant, exactly, before I heard about this conference. It turns out the concept is very familiar to me. It’s basically thinking about how to preserve your life’s digital content – photos, emails, writings, files, scanned images, etc. etc. The concept of archiving personal materials is a very hot topic right now. Think about Facebook, Storify, iCloud, WordPress, and Flickr, to name a few. As a scientist, I actually think my of my data as personal digital files: they represent a very long period of my life, after all. So I’m at this meeting talking a bit about DCXL, and also learning a lot about some amazing new stuff. Here’s a few interesting tidbits:

Cowbird: This is a place to tell stories, rather than just archive their lives. According to the founder (who is attending this conference), Cowbird is about the experience of life, as opposed to merely curating life. For an amazing, moving example of how Cowbird works, check this out: First Love

The Brain: Very cool, free software that helps you organize links, definitions, notes, etc. The idea is that it works just like your brain: it makes connections and creates networks to provide meaning to each link. Play with it a bit and you will be hooked.

Pinboard: Technically, I already knew about Pinboard. But the founder of the bookmarking system gave a great talk, so I’m including it here. Pinboard has been described as how the bookmarking service Delicious used to work, before it stopped working well. For a very small fee (~$10) you can store your bookmarks, tag them, and even save copies of the web pages as they were when you viewed them- this comes in particularly handy if you use a website for research and it might mysteriously disappear without warning. My favorite thing about Pinboard is it isn’t mucked up with ads and other visual distractions.

Internet archive — The church meant for worship of all things digital: The Internet Archive. From Flickr by evan_carroll