Where do I put my data? This question comes up often when I talk with researchers about long term data archiving. There are many barriers to sharing and archiving data, and a lack of knowledge about where to archive the data is certainly among them.
First and foremost, choose early. By choosing a data repository early in the life of your research, you guarantee that there will be no surprises when it comes time to deposit your dataset. If there are strict metadata requirements for the repository you choose, it is immensely beneficial to know those requirements before you collect your first data point. You can design your data collection materials (e.g., spreadsheets, data entry forms, or scripts for automating computational tasks) to ensure seamless metadata creation.
Second, choose carefully. Data repositories (also known as data centers or archives) are plentiful. Choosing the correct repository for your specific dataset is important: you want the right people to be able to find your data in the future. If you research how climate change affects unicorns, you don’t want to store your dataset at the Polar Data Centre (unless you study arctic unicorns). Researchers hunting for unicorn data are more likely to check the Fantasy Creatures Archive, for instance.
So how do you choose your perfect data repository? There is no match.com or eHarmony for matching data with its repository, but a close second is Databib. This is a great resource born from a partnership between Purdue and Penn State Libraries, funded by the Institute of Museum and Library Services. The site allows you to search for keywords in their giant list of data repositories. For instance “unicorns” surprisingly brings up no results, but “marine” brings up seven repositories that house marine data. Don’t see your favorite data repository? The site welcomes feedback (including suggestions for additional repositories not in their database).
Another good way to discover your perfect repository: think about where you go to find datasets for your research. Try asking around too – where do your colleagues look for data? Where do they archive it? Look for mentions of data centers in publications that you read, and perhaps do a little web hunting.
A final consideration: some institutions provide repositories for their researchers. Often you can store your data in these repositories while you are still working on it (password protected, of course), using the data center as a kind of backup system. These institutional repositories have benefits like more personal IT service and cheaper (or free) storage rates, but they might make your data more difficult to find for those in your research field. Examples of institutional repositories are DSpace@MIT, DataSpace at Princeton, and the Merritt Repository right here at CDL.
The DCXL add-in will initially connect with CDL’s Merritt Repository. The datasets submitted to Merritt via DCXL will be integrated into the DataONE network to ensure that data deposited via DCXL is accessible and discoverable. Long live Excel data!
Not familiar with the song Homeward Bound? Check out this amazing performance by Simon and Garfunkel