Skip to main content

(index page)

Finding a Home For Your Data

Where do I put my data?  This question comes up often when I talk with researchers about long term data archiving.  There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.

First and foremost, choose early.  By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset.  If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point.  You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.

Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful.  Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future.  If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns).  Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.

So how do you choose your perfect data repository?  There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services.  The site allows you to search for keywords in their giant list of data repositories.  For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data.  Don’t see your favorite data repository?  The site welcomes feedback (including suggestions for additional repositories not in their database).

Another good way to discover your perfect repository: think about where you go to find datasets for your research.  Try asking around too – where do your colleagues look for data? Where do they archive it?  Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.

A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system.  These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field.  Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.

The DCXL add-in will initially connect with CDL’s Merritt Repository.  The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable.  Long live Excel data!

Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel 

Putting the Meta in Metadata

Basquiat
The prefix “meta” implies “abstraction”. Metadata is an abstraction of the data. My favorite abstract artist is Jean-Michel Basquiat. See the connection? From www.provisionslibrary.com

The DCXL project is in full swing now– developers are working closely with Microsoft Research to create the add-in that will revolutionize scientific data curation (in my humble opinion!).  Part of this process was deciding how to handle metadata. For a refresher on metadata, i.e. data documentation, read this post about the metadata in DCXL.

Creating metadata was one of the major requirements for the project, and arguably the most challenging task.  The challenges stem from the fact that there are many metadata standards out there, and of course, none are perfect for our particular task.  So how do we incorporate good work done by others much smarter than me into DCXL, without compromising our need for user-friendly, simple data documentation?

It was tricky, but we came up with a solution that will work for many, if not most, potential DCXL users.  A few things entered into the metadata route we chose:

  1. DataONE: We are very interested in making sure that data posted to a repository via the DCXL add-in can be found using the DataONE Mercury metadata search system (Called ONE-Mercury; to be released in May). That means we need to make sure we are using metadata that the DataONE infrastructure likes.  At this point in DataONE’s development, that limits us to International Organization for Standardization Geospatial Metadata Standard (ISO19115), Federal Geographic Data Committee Geospatial Metadata Standard (FGDC) , and Ecological Metadata Language (EML).
  2. We want metadata created by the DCXL software to be as flexible as possible for as many different types of data as possible.  ISO19115 and FGDC are both geared towards spatial data specifically (e.g., GIS).  EML is a bit more general and flexible, so we chose to go with it.
  3. EML is a very well documented metadata schema; rather than include every element of EML in DCXL, we cherry-picked the elements we thought would generate metadata that makes the data more discoverable and useable.  Of course, just like never being too skinny or too rich, you can NEVER have too much metadata.  But we chose to draw the line somewhere between “not useful at all” and “overwhelming”.
  4. We ensured that the metadata elements we included could be mapped to DataCite and Dublin Core minimal metadata.  This ensures that a data citation can be generated based on the metadata collected for the dataset.
The result of this work can be seen here: dot.ucop.edu/specs/emlcore.html. Word of warning – proceed to this link only if HTML doesn’t scare you! (Hat tip to CDL’s John Kunze for leading the charge on the DCXL metadata front).
Still hungry for more? There are three DataONE education modules pertaining to metadata: What Is Metadata? | The Value of Metadata | Writing Metadata

Communication Breakdown: Nerds, Geeks, and Dweebs

Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project.  WAIT! Keep reading!  I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks.  I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.

I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics.  There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack.  I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

Fredo Corleone
Fredo Corelone was smart. Not stupid like everybody says. Nerds, Geeks, and Dweebs are all smart – just in different ways. from godfather.wikia.com

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues.  It’s actually quite obvious why these communication problems exist.  There are a lot of smart people involved in data, all of whom have very divergent backgrounds.  I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs.  These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.

The Nerds. This is the group to which I belong.  We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education.  Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.

The geeks. Also known as computer folksThese folks might be developers, computer scientists, information technology specialists, database managers, etc.  They are uber-smart, but from what I can tell their uber-smart brains do not work like mine.  From what I can tell, geeks can explain things to me in one of two ways:

  1. “To turn your computing machine on, you need to first plug it in. Then push the big button.”
  2. “First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”

In all fairness, (1) occurs far less than (2).  But often you get (1) after trying to get clarification on (2).  How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen.  That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours.  Often I’ve found myself writing down terms in a meeting so I can go look them up later.  Things like “elements” and “terminal” are not unfamiliar words in and of themselves.  However the contexts in which they are used are completely new to me.  That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and  XML.

The dweebs.  Also known as librarians.  These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms.  There’s certainly a bit of crossover with the computer types, especially when it comes to data.  However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers.  There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind.  Check out the DCXL post on libraries from January.

Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs.  I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort.  How can we solve this communication impasse? I have a few ideas:

And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation).  I plead with you to stop reinventing the wheel.  As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software.  It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types.  Consult with them rather than starting from scratch.

In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.