Tag: data management

The Significance of Managing Research Data

Some of the most influential research tools of the last century were created to ensure the quality of beer and extrapolate the results of agriculture experiments conducted in the English countryside. Though ostensibly about the placement of a decimal point, an ongoing debate about the application of these tools also provides a window for understanding what it actually means to manage research data.

The p-value: A very quick introduction

Though now ubiquitous in experiment-based research, statistical techniques for extending inferences from small sample (e.g. the participants in a research study) to larger populations are actually a relatively recent invention. The t-test, an early and still widely used example of “small sample” statistics was developed by William Sealy Gossett in the early 20th century as an economical way of ensuring the quality of stout. Several years later, while assisting with long-term experiments on wheat and grass at Rothamsted Experimental Station, Ronald Fisher would build on the work of Gosset and others to develop a statistical framework based around the idea of comparing observations to the null hypothesis- the position that there is no significant difference between two or more specified sets of observations.

In Fisher’s significance testing framework, devices like t-tests are tests of the null hypothesis. The results of these tests indicate the likelihood of observing a result when the null hypothesis is true. The logic is a little tricky, but the core idea is that these tests give researchers a way of understanding the likelihood that their data is the result of sampling or experimental error. In quantitative terms, this likelihood is known as a p-value. In his highly influential 1925 book, Statistical Methods for Research Workers, Fisher would introduce an informal threshold for rejecting the null hypothesis: p < 0.05.

In one of the most influential sentences in modern research methodology, Ronald Fisher describes p = 0.05 as a convenient point for judging the significance of a statistical test. From: Fisher, R.A. (1925). Statistical Methods for Research Workers.

Despite the vehement objections of all three, Fisher’s work would later be synthesized with that of statisticians Jerzy Neyman and Egon Pearson into a suite of tools that are still widely used in many fields of research. In practice, p < 0.05 has since become a one-size-fits-all indicator of success. For decades it has been acknowledged that work that meets this criterion is generally more likely to be reported in the scholarly literature while work that doesn’t is generally relegated the proverbial file drawer.

Beyond p < 0.05

The p < 0.05 threshold has become a flashpoint the ongoing conversation about research practices, reproducibility, and replicability. Heated conversations about the use and misuse of p-values have been ongoing for decades, but over the summer a group of 72 influential researchers proposed a seemingly simple step forward- change the threshold from 0.05 to 0.005. According to the authors, “Reducing the p-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility.”.

As of this writing, two responses have been published. Both weigh the pros and cons of p < 0.005 and argue that the placement of a decimal point is less of a problem than the uncritical use of a single one-size-fits-all threshold across many different circumstances and fields of research. Both end on calls for greater transparency and stronger justifications for how decisions related to research design and statistical practice are made. If the initial paper proposed changing the answer from p < 0.05 to 0.005, both responses highlight the necessity of changing the question from one that is focused on statistics to one that incorporates research data management (RDM).

Ensuring that data can be used and evaluated in the future is one of the primary goals of RDM. For example, the RDM guide we’re developing does not have a space for assessing p-values. Instead, its focus is assessing and advancing practices related to planning for, saving, and documenting data and other research products. Such practices come with their own nuance, learning curves, and jargon, but are important elements to any effort to ensure that research decisions are transparent and justified.

Resources and Additional Reading

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. Nature Human Behaviour. doi: 10.1038/s41562-017-0189-z

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2017). Justify your alpha: A response to “Redefine statistical significance”PsyArxiv preprint. doi: 10.17605/OSF.IO/9S3Y6

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. arXiv preprint. arXiv: 1709.07588.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versaJournal of the American Statistical Association54(285), 30-34. doi: 10.1080/01621459.1959.10501497

Rosenthal, R. (1979). The file drawer problem and tolerance for null resultsPsychological Bulletin86(3), 638-641. doi: 10.1037/0033-2909.86.3.638

Your Time is Gonna Come

You know what they say:  Timing is everything.  Time enters into the data management and stewardship equation at several points and warrants discussion here.  Why timeliness? Last week at the University of North TexasOpen Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing.  It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data.  Feel free to comment on this post with your own opinions.

1. When should you start thinking about data management?  The best answer to this question is as soon as possible.  The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting).  The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

dark side of the rainbow image

If you have never watched the Wizard of Oz while listening to Pink Floyd’s Dark Side of the Moon album, you should. Of course, timing is everything: start the album on the third roar of the MGM lion. Image from horrorhomework.com (click on the image to go to the site)

at might pass muster without really thinking too carefully about your data.  I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship.  Spend plenty of time doing this, and return to your plan often during your project to update it.

2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly).  I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:

  • Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
  • Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
  • Are others likely to want access to your data soon?  Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.

3. When should you make your data publicly accessible?  My favorite answer to this question is also as soon as possible.  But this might mean different things for different scientists.  For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project.  This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped.  So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication.  Some journals are already requiring this, and more are likely to follow.

There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future?  In that case, have an honest conversation with yourself.  What do you mean by “future”?  Are you really likely to follow through on those future projects that might use the dataset?  If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much.  Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not.  Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.

Data Diversity is Okay

At risk of sounding like a motivational speaker, this is such an exciting time to be involved in science and research.  We are swimming in data and information (yay!), there are exciting software tools available for researchers, librarians, and lay people alike, and the possibilities for discovery seem endless.  Of course, all of this change can be a bit daunting.  How do you handle the data deluge? What software is likely to be around for a while? How do you manage your time effectively in the face of so much technology?

Growing Pains

Just like Kirk Cameron’s choice of hair style, academics and their librarians are going through some growing pains. From www.1051jackfm.com

Like many other groups, academic libraries are undergoing some growing pains in the face of the information age. This may be attributed drastic budget cuts, rising costs for journal subscriptions, and the less important role that physical collections play in due to increasing digitization of information.  Researchers are quite content to sit at their laptops and download PDFs from their favorite journals rather than wander the stacks of their local library; they would rather use Google searches to scour the internet for obscure references rather than ask their friendly subject librarian for help in the hunt.

Despite the challenges above, I firmly believe that this is such an exciting time to be working at the interface of libraries, science, and technology.  Many librarians agree with me, including those at UCLA.  Lisa Federer and Jen Weintraub recently put on a great panel at the UCLA library focused on data curation.  I was invited to participate and agreed, which turned out to be an excellent decision.

The panel was called “Data Curation in Action”, and featured four panelists: Chris Johanson, UCLA professor of classics and digital humanities; Tamar Kremer-Sadlik, director of research at the UCLA Center for Everyday Lives of Families (CELF); Paul Conner, the digital laboratory director of CELF; and myself, intended to represent some mix of researchers in science and librarians.

Without droning on about how great the panel was, and how interesting the questions from the audience were, and how wonderful my discussions were with attendees after the panel, I wanted to mention the major thing that I took away: there is so much diverse data being generated by so many different kinds of projects and researchers.  Did I mention that this is an exciting time in the world of information?

Take Tamar and Paul: their project involves following families every day for hours on end, recording video, documenting interactions and locations of family members, taking digital photographs, conducting interviews, and measuring cortisol levels (an indicator for stress).  You should read that sentence again, because that is an enormous diversity of data types, not to mention the volume. Interviews and video are transcribed, quantitative observations are recorded in databases, and there is an intense coding system for labeling images, videos, and audio files.

Now for Chris, who has the ability to say “I am a professor of classics” at dinner parties (I’m jealous).  Chris doesn’t sit about reading old texts and talking about marble statues. Instead he is trying to reconstruct “ephemeral activities in the ancient world”, such as attending a funeral, going to the market, etcetera. He does this using a complex combination of Google Earth, digitized ancient maps, pictures, historical records, and data from excavations of ancient civilizations.  He stole the show at the panel when he demonstrated how researchers are beginning to create virtual worlds in which a visitor can wander around the landscape, just like in a modern day 3D video game.

This is really just a blog post about how much I love my job. I can’t imagine anything more interesting than trying to solve problems and provide assistance for researchers such as Tamar, Paul and Chris.

In case you are not one of the 35 million who have watched it, OK Go has a wonderful video about getting through the tough times associated with the dawning information age (at least that’s my rather nerdy interpretation of this song):


Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany.  If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany.  It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation.  I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates.  For humans, this data might be used for, say, calculating premiums for life insurance.  For other organisms, these types of data are useful for studying population declines, increases, and changes.  Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino

Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data.  Luckily, they aren’t starting from scratch.  They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data.  Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database.  In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here).  Many of the scientists with whom I spoke needed convincing to share their datasets.  But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range.  The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to  believe in the cause.  This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well.  But they are choosing instead to venture into the wilderness of scientific data sharing.  Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community.  Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets.  Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.


A few months back I received an invite to visit the University of Florida in sunny Gainesville.  The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program.  Phew! That was a lot of typing for the first two acronyms in my blog post’s title.  The third acronym  (OA) stands for Open Access, and the fourth acronym should be familiar.

I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA.  There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion.  In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing.  The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”.  Some of the themes that emerged from our discussion warrant a blog post.

First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey.  This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form.  Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft.  In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.

The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?”  This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX.  You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication.  Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.

empty library

No journals means empty library shelves. Perhaps the newly freed up space could be used to store curmudgeonly professors resistant to change.

Let’s pretend that we have overhauled both scientists and the publishing system as it is.  In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents.  They document their workflows and create python scripts that run in the command line for reproducible results.  Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.

One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality.  In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.

One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.

My answer was a resounding YES! Enter the DCXL project.  We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing.  One of the main goals of the DCXL project is to help “the little guy”.  These are often scientists working on relatively small datasets that can be contained in Excel files.

In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while.  Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.

Special thanks goes to Emilio Bruna (@brunalab) who not only scored me the invite to UF, but also hosted me for a lovely dinner during my visit (albeit NOT the Tasty Budda…)

Communication Breakdown: Nerds, Geeks, and Dweebs

Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project.  WAIT! Keep reading!  I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks.  I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.

I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics.  There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack.  I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

Fredo Corleone

Fredo Corelone was smart. Not stupid like everybody says. Nerds, Geeks, and Dweebs are all smart – just in different ways. from godfather.wikia.com

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues.  It’s actually quite obvious why these communication problems exist.  There are a lot of smart people involved in data, all of whom have very divergent backgrounds.  I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs.  These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.

The Nerds. This is the group to which I belong.  We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education.  Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.

The geeks. Also known as computer folksThese folks might be developers, computer scientists, information technology specialists, database managers, etc.  They are uber-smart, but from what I can tell their uber-smart brains do not work like mine.  From what I can tell, geeks can explain things to me in one of two ways:

  1. “To turn your computing machine on, you need to first plug it in. Then push the big button.”
  2. “First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”

In all fairness, (1) occurs far less than (2).  But often you get (1) after trying to get clarification on (2).  How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen.  That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours.  Often I’ve found myself writing down terms in a meeting so I can go look them up later.  Things like “elements” and “terminal” are not unfamiliar words in and of themselves.  However the contexts in which they are used are completely new to me.  That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and  XML.

The dweebs.  Also known as librarians.  These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms.  There’s certainly a bit of crossover with the computer types, especially when it comes to data.  However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers.  There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind.  Check out the DCXL post on libraries from January.

Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs.  I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort.  How can we solve this communication impasse? I have a few ideas:

  • Nerds: open your minds to the possibility that computer types and librarian types might know about better ways of doing what you are doing.  Tap the resources that these groups have to offer. Stop being scared of the unknown. You love learning or you wouldn’t be a scientist; devote some of that love in the direction of improving your computer savvy.
  • Geeks: dumb it down, but not too much. Recognize that scientists and librarians are smart, but potentially in very different ways than you.  Also, please recognize that change will be incremental, and we will not universally adopt whatever you think is the best possible set of tools or strategies and how “totally stupid” or current workflow seems.
  • Dweebs: spend some time getting to know the disciplines you want to help. Toot your own horn– you know A LOT of stuff that nerds and geeks don’t, and you are all so darn shy! Make sure both geeks and nerds know of your capacity to help, and your ability to lend important information to the discussion.

And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation).  I plead with you to stop reinventing the wheel.  As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software.  It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types.  Consult with them rather than starting from scratch.

In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.

Data Citation Redux

I know what faithful DCXL readers are thinking: didn’t you already post about data citation? (For the unfaithful among you, check out this post from last November). Yes, I did. But I’ve been inspired to post yet again because I just attended an amazing workshop about all things data citation related.

The workshop was hosted by the NCAR Library (NCAR stands for National Center for Atmospheric Research) and took place in Boulder on Thursday and Friday of last week.  Workshop organizers expected about 30 attendees; more than 70 showed up to learn more about data citation.  Hats off to the organizers – there healthy discussions among attendees and interesting presentations by great speakers.

One of the presentations that struck me most was by Dr. Tim KilleenAssistant Director for the Geosciences Directorate at NSF.  His talk (available on the workshop website) discussed the motivation for data citation, and what policies have begun to emerge.  Near the end of a rather long string reports about data citation, data sharing, and data management, Killeen said  “There is a drumbeat into Washington about this.”

John Bonham

If Led Zeppelin drummer J Bonham were still alive, he would leading the data charge into DC. Bonham was voted by Rolling Stone readers as the best drummer of all time. Photo from drummerworld.com

This phrase stuck with me long after I flew home because it juxtaposted two things I hadn’t considered as being related: Washington DC and data policy.  Yes, I understand that NSF is located in Washington, and that very recently the White House announced some exciting Big Data funding and initiatives. But Washington DC as a whole – congress, lobbyists, lawyers, judges, etc. – would notice a drum beat about data? I must say, I got pretty excited about the idea.

What are these reports cited by Killeen?  In chronological order:

The NSB report on long-lived digital data had yet another a great phrase that stuck with me:

Long-lived digital data collections are powerful catalysts for progress and for democratization of science and education

Wow. I really love the idea of democratized data.  It warms the cockles, doesn’t it?  With regard to DCXL, the link is obvious.  One of the features we are developing is generation of a data citation for your Excel dataset.

Survey says…

A few weeks ago we reached out to the scientific community for help on the direction of the DCXL project.  The major issue at hand was whether we should develop a web-based application or an add-in for Microsoft Excel.  Last week, I reported that we decided that rather than choose, we will develop both.  This might seem like a risky proposition: the DCXL project has a one-year timeline, meaning this all needs to be developed before August (!).  As someone in a DCXL meeting recently put it, aren’t we settling for “twice the product and half the features”?  We discussed what features might need to be dropped from our list of desirables based on the change in trajectory, however we are confident that both of the DCXL products we develop will be feature-rich and meet the needs of the target scientific community.  Of course, this is made easier by the fact that the features in the two products will be nearly identical.

Family Feud screen shot

What would Richard Dawson want? Add-in or web app? From Wikipedia. Source: J Graham (1988). Come on Down!!!: the TV Game Show Book. Abbeville Press

How did we arrive at developing an add-in and a web app? By talking to scientists. It became obvious that there were aspects of both products that appeal to our user communities based on feedback we collected.  Here’s a summary of what we heard:

Show of hands:  I ran a workshop on Data Management for Scientists at the Ocean Sciences 2012 Meeting in February.  At the close of the workshop, I described the DCXL project and went over the pros and cons of the add-in option and the web app option.  By show of hands, folks in the audience voted about 80% for the web app (n~150)

Conversations: here’s a sampling of some of the things folks told me about the two options:

  • “I don’t want to go to the web. It’s much easier if it’s incorporated into Excel.” (add-in)
  • “As long as I can create metadata offline, I don’t mind it being a web app. It seems like all of the other things it would do require you to be online anyway” (either)
  • “If there’s a link in the spreadsheet, that seems sufficient. (either)  It would be better to have something that stays on the menu bar no matter what file is open.” (Add-in)
  • “The updates are the biggest issue for me. If I have to update software a lot, I get frustrated. It seems like Microsoft is always making update something. I would rather go to the web and know it’s the most recent version.” (web app)
  • Workshop attendee: “Can it work like Zotero, where there’s ways to use it both offline and online?” (both)

Survey: I created a very brief survey using the website SurveyMonkey. I then sent the link to the survey out via social media and listservs.  Within about a week, I received over 200 responses.

Education level of respondents:

Survey questions & answers:


So with those results, there was a resounding “both!” emanating from the scientific community.  First we will develop the add-in since it best fits the needs of our target users (those who use Excel heavily and need assistance with good data management skills).  We will then develop the web application, with the hope that the community at large will adopt and improve on the web app over time.  The internet is a great place for building a community with shared needs and goals– we can only hope that DCXL will be adopted as wholeheartedly as other internet sources offering help and information.

The Digital Dark Age, Part 1

This will be known as the Digital Dark Age.  The first time I heard this statement was at Internet Archive, during the PDA 2012 Meeting (read my blog post about it here).  What did this mean?  What is a Digital Dark Age? Read on.

While serving in Vietnam, my father wrote letters to my grandparents about his life fighting a war in a foreign country.  One of his letters was sent to arrive in time for my grandfather’s birthday, and it contained a lovely poem that articulated my father’s warm feelings about his childhood, his parents, and his upbringing.  My grandparents kept the poem framed in a prominent spot in their home.  When I visited them as a child, I would read the poem written in my young dad’s  handwriting, stare at the yellowed paper, and think about how far that poem had to travel to relay its greetings to my grandparents.  It was special– for its history, the people involved, and the fact that these people were intimately connected to me.

Now fast forward to 2012.  Imagine modern-day soldiers all over the world, emailing, making satellite phone calls, and chatting with their families via video conferencing.  When compared to snail mail, these modern communication methods are likely a much preferred way of staying in touch for those families.  But how likely is it that future grandchildren will be able to listen those the conversations, read those emails, or watch those video calls?  The answer is extremely unlikely.

These two scenarios sum up the concept of a Digital Dark Age: compared to 40 years ago, we are doing a terrible job of ensuring that future generations will be able to read our letters, look at our pictures, or use our scientific data.

mix tapes

You mean future generations won’t be able to listen to my mix tapes?! From Flickr by newrambler

The Digital Dark Age “refers to a possible future situation where it will be difficult or impossible to read historical digital documents and multimedia, because they have been stored in an obsolete and obscure digital format.”  The phrase “Dark Age” is a reference to The Dark Ages, a period in history around the beginning of the Middle Ages characterized by a scarcity of historical and other written records at least for some areas of Europe, rendering it obscure to historians.  Sounds scary, no?

How can we remedy this situation? What are people doing about it? Most importantly, what does this mean for scientific advancement? Check out my next post to find out.

Why You Should Floss

No, I won’t be discussing proper oral hygiene. What I mean by “flossing” is actually “backing up your data”.  Why the floss analogy? Here are the similarities between flossing in backing up your data:

  1. It’s undisputed that it’s important
  2. Most people don’t do it as often as they should
  3. You lie (to yourself, or your dentist) about how often you do it

Oral (and data) hygiene can be fun! From Calisphere, courtesy of UC Berkeley Bancroft Library

So think about backing up similarly to the way you think about flossing:  you probably aren’t doing it enough.  In this post, I will provide a general guidance about backing up your data; as always, the advice will vary greatly depending on the types of data you are generating, how often they change, and what computational resources are available to you.

First, create multiple copies in multiple locations.  The old rule of thumb is original, near, far.  The first copy is your working copy of data; the second copy is kept near your original (this is most likely an external hard drive or thumb drive); the third is kept far from your original (off site, such as at home or on a server outside of your office building).  This is the important part: all three of these copies should be up-to-date.  Which brings me to my second point.

Second, back up your data more often.  I have had many conversations with scientists over the last few months, and I always ask, “How do you back up your data?”  Answers range, but most of them scare me silly.  For instance, there was a 5th year graduate student who had all of her data on a six-year-old laptop, and only backed up once a month.  I get heart palpitations just typing that sentence.  Other folks have said things like “I use my external drive to back things up once every couple of months”, or worst case scenario, “I know I should, but I just don’t back up”.  It is strongly recommended that you back up every day. It’s a pain, right? There are two very easy ways to back up every day, and neither require any purchasing of hardware or software: (1) Keep a copy on Dropbox, or (2) Email yourself the data file as an attachment.  Note: these suggestions are not likely to work for large data sets.

Third, find out what resources are available to you. Institutions are becoming aware of the importance of good backup and data storage systems, which means there might be ways for you to back up your data regularly with minimal effort.  Check with your department or campus IT folks and ask about server space and automated backup service. If server space and/or backing up isn’t available, consider joining forces with other scientists to purchase servers for backing up (this is an option for professors more often than graduate students).

Finally, ensure that your backup plan is working.  This is especially important if others are in charge of data backup.  If your lab group has automated backup to a common computer, check to be sure your data are there, in full, and readable.  Ensure that the backup is actually occurring as regularly as you think it is.  More generally, you should be sure that if your laptop dies, or your office is flooded, or your home is burgled, you will be able to recover your data in full.

For more information on backing up, check out the DataONE education module “Protected back-ups”