Tag: Best practices

The Significance of Managing Research Data

Some of the most influential research tools of the last century were created to ensure the quality of beer and extrapolate the results of agriculture experiments conducted in the English countryside. Though ostensibly about the placement of a decimal point, an ongoing debate about the application of these tools also provides a window for understanding what it actually means to manage research data.

The p-value: A very quick introduction

Though now ubiquitous in experiment-based research, statistical techniques for extending inferences from small sample (e.g. the participants in a research study) to larger populations are actually a relatively recent invention. The t-test, an early and still widely used example of “small sample” statistics was developed by William Sealy Gossett in the early 20th century as an economical way of ensuring the quality of stout. Several years later, while assisting with long-term experiments on wheat and grass at Rothamsted Experimental Station, Ronald Fisher would build on the work of Gosset and others to develop a statistical framework based around the idea of comparing observations to the null hypothesis- the position that there is no significant difference between two or more specified sets of observations.

In Fisher’s significance testing framework, devices like t-tests are tests of the null hypothesis. The results of these tests indicate the likelihood of observing a result when the null hypothesis is true. The logic is a little tricky, but the core idea is that these tests give researchers a way of understanding the likelihood that their data is the result of sampling or experimental error. In quantitative terms, this likelihood is known as a p-value. In his highly influential 1925 book, Statistical Methods for Research Workers, Fisher would introduce an informal threshold for rejecting the null hypothesis: p < 0.05.

In one of the most influential sentences in modern research methodology, Ronald Fisher describes p = 0.05 as a convenient point for judging the significance of a statistical test. From: Fisher, R.A. (1925). Statistical Methods for Research Workers.

Despite the vehement objections of all three, Fisher’s work would later be synthesized with that of statisticians Jerzy Neyman and Egon Pearson into a suite of tools that are still widely used in many fields of research. In practice, p < 0.05 has since become a one-size-fits-all indicator of success. For decades it has been acknowledged that work that meets this criterion is generally more likely to be reported in the scholarly literature while work that doesn’t is generally relegated the proverbial file drawer.

Beyond p < 0.05

The p < 0.05 threshold has become a flashpoint the ongoing conversation about research practices, reproducibility, and replicability. Heated conversations about the use and misuse of p-values have been ongoing for decades, but over the summer a group of 72 influential researchers proposed a seemingly simple step forward- change the threshold from 0.05 to 0.005. According to the authors, “Reducing the p-value threshold for claims of new discoveries to 0.005 is an actionable step that will immediately improve reproducibility.”.

As of this writing, two responses have been published. Both weigh the pros and cons of p < 0.005 and argue that the placement of a decimal point is less of a problem than the uncritical use of a single one-size-fits-all threshold across many different circumstances and fields of research. Both end on calls for greater transparency and stronger justifications for how decisions related to research design and statistical practice are made. If the initial paper proposed changing the answer from p < 0.05 to 0.005, both responses highlight the necessity of changing the question from one that is focused on statistics to one that incorporates research data management (RDM).

Ensuring that data can be used and evaluated in the future is one of the primary goals of RDM. For example, the RDM guide we’re developing does not have a space for assessing p-values. Instead, its focus is assessing and advancing practices related to planning for, saving, and documenting data and other research products. Such practices come with their own nuance, learning curves, and jargon, but are important elements to any effort to ensure that research decisions are transparent and justified.

Resources and Additional Reading

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., … & Cesarini, D. (2017). Redefine statistical significance. Nature Human Behaviour. doi: 10.1038/s41562-017-0189-z

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2017). Justify your alpha: A response to “Redefine statistical significance”PsyArxiv preprint. doi: 10.17605/OSF.IO/9S3Y6

McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2017). Abandon statistical significance. arXiv preprint. arXiv: 1709.07588.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versaJournal of the American Statistical Association54(285), 30-34. doi: 10.1080/01621459.1959.10501497

Rosenthal, R. (1979). The file drawer problem and tolerance for null resultsPsychological Bulletin86(3), 638-641. doi: 10.1037/0033-2909.86.3.638

Workflows Part II: Formal

In my last blog post, I provided an overview of scientific workflows in general. I also covered the basics of informal workflows, i.e. flow charts and commented scripts.  Well put away the tuxedo t-shirt and pull out your cummerbund and bow tie, folks, because we are moving on to formal workflow systems.

Nothing says formal like a sequined cummerbund and matching bow tie. From awkwardfamilyphotos.com (click the pic for more)

A formal workflow (let’s call them FW) is essentially an “analytical pipeline” that takes data in one end and spits out results on the other.  The major difference between FW and commented scripts (one example of informal workflows) is that FW can be implemented in different software systems.  A commented R script for estimating parameters works for R, but what about those simulations you need to run in MATLAB afterward?  Saving the outputs from one program, importing them into another, and continuing analysis there is a very common practice in modern science.

So how do you link together multiple software systems automatically? You have two options: become one of those geniuses that use the command line for all of your analyses, or use a FW software system developed by one of those geniuses.  The former requires a level of expertise that many (most?) Earth, environmental, and ecological scientists do not possess, myself included.  It involves writing code that will access different software programs on your machine, load data into them, perform analyses, save results, and use those results as input for a completely different set of analyses, often using a different software program.  FW are often called “executable workflows” because they are a way for you to push only one button (e.g., enter) and obtain your results.

What about FW software systems? These are a bit more accessible for the average scientist.  FW software has been around for about 10 years, with the first user-friendly(ish) breakthrough being the Kepler Workflow System.  Kepler was developed with researchers in mind, and allows the user to drag and drop chunks of analytical tasks into a window.  The user can indicate which data files should be used as inputs and where the outputs should be sent, connecting the analytical tasks with arrows.  Kepler is still in a beta version, and most researchers will find the work required to set up a workflow prohibitive.

Groups that have managed to incorporate workflows into their community of sharing are genomicists; this is because they tend to have predictable data as inputs, with a comparatively small set of analyses performed on those data.  Interestingly, a social networking site has grown up around genomicists’ use workflows called myExperiment, where researchers can share workflows, download others’ workflows, and comment on those that they have tried.

The benefits of FW are the each step in the analytical pipeline, including any parameters or requirements, is formally recorded.  This means that researchers can reuse both individual steps (e.g., the data cleaning step in R or the maximum likelihood estimation in MATLAB), as well as the overall workflow).  Analyses can be re-run much more quickly, and repetitive tasks can be automated to reduce chances for manual error.  Because the workflow can be saved and re-used, it is a great way to ensure reproducibility and transparency in the scientific process.

Although Kepler is not in wide use, it is a great example of something that will likely become common place in the researcher’s toolbox over the next decade.  Other FW software includes Taverna, VisTrails, and Pegasus – all with varying levels of user-friendliness and varied communities of use.  As the complexity of analyses and the variety of software systems used by scientists continues to increase, FW are going to become a more common part of the research process.  Perhaps more importantly, it is likely that funders will start requiring the archiving of FW alongside data to ensure accountability, reproducibility, and to promote reuse.

A few resources for more info:

Your Time is Gonna Come

You know what they say:  Timing is everything.  Time enters into the data management and stewardship equation at several points and warrants discussion here.  Why timeliness? Last week at the University of North TexasOpen Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing.  It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data.  Feel free to comment on this post with your own opinions.

1. When should you start thinking about data management?  The best answer to this question is as soon as possible.  The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting).  The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

dark side of the rainbow image

If you have never watched the Wizard of Oz while listening to Pink Floyd’s Dark Side of the Moon album, you should. Of course, timing is everything: start the album on the third roar of the MGM lion. Image from horrorhomework.com (click on the image to go to the site)

at might pass muster without really thinking too carefully about your data.  I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship.  Spend plenty of time doing this, and return to your plan often during your project to update it.

2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly).  I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:

  • Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
  • Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
  • Are others likely to want access to your data soon?  Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.

3. When should you make your data publicly accessible?  My favorite answer to this question is also as soon as possible.  But this might mean different things for different scientists.  For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project.  This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped.  So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication.  Some journals are already requiring this, and more are likely to follow.

There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future?  In that case, have an honest conversation with yourself.  What do you mean by “future”?  Are you really likely to follow through on those future projects that might use the dataset?  If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much.  Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not.  Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.


A few months back I received an invite to visit the University of Florida in sunny Gainesville.  The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program.  Phew! That was a lot of typing for the first two acronyms in my blog post’s title.  The third acronym  (OA) stands for Open Access, and the fourth acronym should be familiar.

I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA.  There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion.  In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing.  The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”.  Some of the themes that emerged from our discussion warrant a blog post.

First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey.  This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form.  Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft.  In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.

The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?”  This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX.  You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication.  Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.

empty library

No journals means empty library shelves. Perhaps the newly freed up space could be used to store curmudgeonly professors resistant to change.

Let’s pretend that we have overhauled both scientists and the publishing system as it is.  In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents.  They document their workflows and create python scripts that run in the command line for reproducible results.  Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.

One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality.  In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.

One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.

My answer was a resounding YES! Enter the DCXL project.  We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing.  One of the main goals of the DCXL project is to help “the little guy”.  These are often scientists working on relatively small datasets that can be contained in Excel files.

In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while.  Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.

Special thanks goes to Emilio Bruna (@brunalab) who not only scored me the invite to UF, but also hosted me for a lovely dinner during my visit (albeit NOT the Tasty Budda…)

Data Policies & Other Things

Last Friday I attended a seminar at UC Berkeley’s iSchool given by MacKenzie Smith, a terrific presenter and colleague who is affiliated with Creative Commons (among other prestigious organizations).  MacKenzie was talking about data governance, an issue I covered a few months back for the DCXL blog.  However on Friday MacKenzie brought up a few things that I think warrant another post.

First, let’s define data governance for those that aren’t familiar with the concept. Based on Wikipedia’s entry, it’s the policies surrounding data, including data risk management, assignment of roles and responsibilities for data, and more generally formally managing data assets throughout the research cycle.  Now on to the new things:

The thing adams family

Data policies are some combination of scary and confusing. Similar to Thing from The Addams Family. From monstermoviemusic.blogspot.com

Thing 1: Facts cannot be copyrighted. It makes sense for things like, say, simple math. I can’t say “2+2=4” © 2011 Carly Strasser. Known facts can’t be copyrighted.  So what about data? One might argue that data are facts (assuming you are doing science correctly). That means you don’t own the copyright to your data. Eeek! Scary thought, I know. You might be saved by the fact that a unique arrangement or collection of facts can be copyrighted. Huh.  Data in a database? Can’t be copyrighted. The database itself? Can be copyrighted. This obviously makes things related to data quite messy when it comes to intellectual property.

Thing 2: Did you know that “attribution” can be legally imposed? The remedy for a lack of attribution where warranted is a lawsuit. Creative Commons licenses are built on this fact.  This is not true, however of citation.  Citation is a “scholarly norm” that has no underlying legality.

Thing 3: Creative Commons is now working on a CC 4.0 license. Some of goals of this new version are enabling internationalization and interoperability, and improving support of data, Science, and Education. They want input from scientists, librarians, administrators, and anyone else who might have an opinion about intellectual property, open science, and governance in general.

Thing 4Open Knowledge Foundation is working on concepts related to governance with a global perspective.  They have a range of projects in the works for improving the sharing of knowledge, data, and content.

Thing 5: While waiting for a consensus on how to properly govern digital data and other digital content, many data providers are dealing with governance by constructing data usage agreements.  These are contracts created by lawyers for a specific data provider (e.g., an online database).  The problem with data usage agreements is that they are all different.  This means that if you want to use data from a source that requires you agree to their terms, you have three options:

  1. Carefully read the terms before agreeing (and who does that?)
  2. Click that you agree without reading and hope you don’t accidentally break any rules
  3. Find the data that you need from another source that doesn’t have terms and conditions for data usage.

Item three points to one of the serious downsides to data usage agreements: researchers may avoid using data if don’t understand the terms of use.  Furthermore, the terms only apply to the party that agreed to the contract (i.e. checked the box).  If they (potentially illegally) share those data with someone else, that someone else is not bound by the terms.

Thing 6: What about international collaborations? As you might imagine, this offers yet another layer of complication. As a scientist, you are supposed to be ensuring that you look into any data policies that may apply to your collaborators. From NSF DMP FAQ (hello, alphabet soup!):

16. If I participate in a collaborative international research project, do I need to be concerned with data management policies established by institutions outside the United States?

Yes. There may be cases where data management plans are affected by formal data protocols established by large international research consortia or set forth in formal science and technology agreements signed by the United States Government and foreign counterparts. Be sure to discuss this issue with your sponsored projects office (or equivalent) and your international research partner when first planning your collaboration.

Hmm. It looks like the waters are very muddy right now, and until they clear, researchers should watch their step.

Data Literacy Instruction: Training the Next Generation of Researchers

This post was contributed by Lisa Federer, Health and Life Sciences Librarian at UCLA Louise M. Darling Biomedical Library

In my previous life as an English professor, every semester I looked forward to the information literacy instruction that our librarian did for my classes.  I always learned something new, and, even better, my students no longer tried to cite Wikipedia as a source in their research papers.  Now that I’m a health and life sciences librarian, the tables are turned, and I’m the one responsible for making sure that my patrons are equipped to locate and use the information they need.  When it comes to the people I work with in the sciences, often the information they need is not an article or a book, but a dataset.  As a result, I am one of many librarians starting to think about best practices for providing data literacy instruction.

According to the National Forum on Information Literacy, information literacy is “the ability to know when there is a need for information, to be able to identify, locate, evaluate, and effectively use that information for the issue or problem at hand.”  The American Library Association has outlined a list of Information Literacy Competency Standards for Higher Education.  So far, a similar list of competencies for data literacy instruction has not been defined, but the general concepts are the same – researchers need to know how to locate data, evaluate it, and use it.  More importantly, as data creators themselves, they need to know how to make their datasets available and useful not just to their own research group, but to others.

Fortunately, a number of groups around the country are working on developing data literacy curricula.  Teams from Purdue University, Stanford University, the University of Minnesota, and the University of Oregon have received a grant from the Institute of Museum and Library Services (IMLS) to “develop a training program in data information literacy for graduate students who will become the next generation of scientists.”  Results and resources will eventually be available on their project website.  Also working under the auspices of an IMLS grant, a team from University of Massachusetts Medical School and Worcester Polytechnic Institute has developed a set of seven curricular modules for teaching data literacy.  Their curriculum centers on teaching researchers what they would need to know to complete a data management plan as required by the National Science Foundation (NSF) and several other major grant funders.

All of the work that these other institutions has done is a fantastic start, but at my institution, the researchers and students are very busy, and not likely to commit to a seven-session data literacy program.  Nonetheless, it’s still important that they learn how to manage, preserve, and share their data, not only because many funders now require it, but also because it’s the right thing to do as a member of the scientific community.  Thus, my challenge has been to design a one-off session that would be applicable across a variety of scientific (and perhaps even social science) fields.  In order to do so, I’ve started with my own list of core competencies for data literacy instruction, including:

  • understanding the “data life cycle” and the importance of sharing and preservation across the entire life cycle, especially for rare or unique datasets
  • knowing how to write a data management plan that will fulfill the requirements of funders like NSF
  • making appropriate choices about file forms and formats (such as by choosing open rather than proprietary standards)
  • keeping data organized and discoverable using file naming standards and appropriate metadata schema
  • planning for long-term, secure storage of data
  • promoting sharing by publishing datasets and assigning persistent identifiers like DOIs
  • awareness of data as scholarly output that should be considered in the context of promotion and tenure

Does this list cover everything a researcher would need to know to effectively manage their data?  Almost certainly not, but as with any single session, my goal is to introduce learners to the major issues and let them know that the library has the expertise to assist them with the more complicated issues that will inevitably arise.  Supporting the data needs of researchers is a daunting task, but librarians already have much of the knowledge and skills to provide this assistance – we simply need to adapt our knowledge of information structures and best practices to this burgeoning area.

As research becomes increasingly data-driven, libraries will be doing a great service to individuals and the research community as a whole by helping to create researchers who are good data stewards.  Like my formerly Wikipedia-dependent students, many of our researchers are still taking shortcuts when it comes to handling their data because they simply don’t know any better.  It’s up to librarians and other information professionals to ensure that the valuable research that is going on at our institutions remains available for future generations of researchers.

The Digital Dark Age, Part 2

Earlier this week I blogged about the concept of a Digital Dark Age.  This is a phrase that some folks are using to describe some future scenario where we are not able to read historical digital documents and multimedia because they have been rendered obsolete or were otherwise poorly archived.  But what does this mean for scientific data?

Consider that Charles Darwin’s notebooks were recently scanned and made available online.  This was possible because they were properly stored and archived, in a long-lasting format (in this case, on paper).  Imagine if he had taken pictures of his finch beaks with a camera and saved the digital images in obsolete formats.  Or ponder a scenario where he had used proprietary software to create his famous Tree of Life sketch.  Would we be able to unlock those digital formats today?  Probably not.  We might have lost those important pieces of scientific history forever.   Although it seems like software programs such as Microsoft Excel and MATLAB will be around forever, people probably said similar things about the programs Lotus 1-2-3 and iWeb.

darwin by diana sudyka

“Darwin with Finches” by Diana Sudyka, from Flickr by Karen E James

It is a common misconception that things that are posted on the internet will be around “forever”.  While that might be true of embarrassing celebrity photos, it is much less likely to be true for things like scientific data.  This is especially the case if data are kept on a personal/lab website or archived as supplemental material, rather than being archived in a public repository (See Santos, Blake and States 2005 for more information).  Consider the fact that 10% of data published as supplemental material in the six top-cited journals was not available a mere five years later (Evangelou, Trikalinos, and Ioannidis, 2005).

Natalie Ceeney, chief executive of the National Archives, summed it up best in this quote from The Guardian’s 2007 piece on preventing a Digital Dark Age: “Digital information is inherently far more ephemeral than paper.”

My next post and final DDA installment will provide tips on how to avoid losing your data to the dark side.

The Digital Dark Age, Part 1

This will be known as the Digital Dark Age.  The first time I heard this statement was at Internet Archive, during the PDA 2012 Meeting (read my blog post about it here).  What did this mean?  What is a Digital Dark Age? Read on.

While serving in Vietnam, my father wrote letters to my grandparents about his life fighting a war in a foreign country.  One of his letters was sent to arrive in time for my grandfather’s birthday, and it contained a lovely poem that articulated my father’s warm feelings about his childhood, his parents, and his upbringing.  My grandparents kept the poem framed in a prominent spot in their home.  When I visited them as a child, I would read the poem written in my young dad’s  handwriting, stare at the yellowed paper, and think about how far that poem had to travel to relay its greetings to my grandparents.  It was special– for its history, the people involved, and the fact that these people were intimately connected to me.

Now fast forward to 2012.  Imagine modern-day soldiers all over the world, emailing, making satellite phone calls, and chatting with their families via video conferencing.  When compared to snail mail, these modern communication methods are likely a much preferred way of staying in touch for those families.  But how likely is it that future grandchildren will be able to listen those the conversations, read those emails, or watch those video calls?  The answer is extremely unlikely.

These two scenarios sum up the concept of a Digital Dark Age: compared to 40 years ago, we are doing a terrible job of ensuring that future generations will be able to read our letters, look at our pictures, or use our scientific data.

mix tapes

You mean future generations won’t be able to listen to my mix tapes?! From Flickr by newrambler

The Digital Dark Age “refers to a possible future situation where it will be difficult or impossible to read historical digital documents and multimedia, because they have been stored in an obsolete and obscure digital format.”  The phrase “Dark Age” is a reference to The Dark Ages, a period in history around the beginning of the Middle Ages characterized by a scarcity of historical and other written records at least for some areas of Europe, rendering it obscure to historians.  Sounds scary, no?

How can we remedy this situation? What are people doing about it? Most importantly, what does this mean for scientific advancement? Check out my next post to find out.

Why You Should Floss

No, I won’t be discussing proper oral hygiene. What I mean by “flossing” is actually “backing up your data”.  Why the floss analogy? Here are the similarities between flossing in backing up your data:

  1. It’s undisputed that it’s important
  2. Most people don’t do it as often as they should
  3. You lie (to yourself, or your dentist) about how often you do it

Oral (and data) hygiene can be fun! From Calisphere, courtesy of UC Berkeley Bancroft Library

So think about backing up similarly to the way you think about flossing:  you probably aren’t doing it enough.  In this post, I will provide a general guidance about backing up your data; as always, the advice will vary greatly depending on the types of data you are generating, how often they change, and what computational resources are available to you.

First, create multiple copies in multiple locations.  The old rule of thumb is original, near, far.  The first copy is your working copy of data; the second copy is kept near your original (this is most likely an external hard drive or thumb drive); the third is kept far from your original (off site, such as at home or on a server outside of your office building).  This is the important part: all three of these copies should be up-to-date.  Which brings me to my second point.

Second, back up your data more often.  I have had many conversations with scientists over the last few months, and I always ask, “How do you back up your data?”  Answers range, but most of them scare me silly.  For instance, there was a 5th year graduate student who had all of her data on a six-year-old laptop, and only backed up once a month.  I get heart palpitations just typing that sentence.  Other folks have said things like “I use my external drive to back things up once every couple of months”, or worst case scenario, “I know I should, but I just don’t back up”.  It is strongly recommended that you back up every day. It’s a pain, right? There are two very easy ways to back up every day, and neither require any purchasing of hardware or software: (1) Keep a copy on Dropbox, or (2) Email yourself the data file as an attachment.  Note: these suggestions are not likely to work for large data sets.

Third, find out what resources are available to you. Institutions are becoming aware of the importance of good backup and data storage systems, which means there might be ways for you to back up your data regularly with minimal effort.  Check with your department or campus IT folks and ask about server space and automated backup service. If server space and/or backing up isn’t available, consider joining forces with other scientists to purchase servers for backing up (this is an option for professors more often than graduate students).

Finally, ensure that your backup plan is working.  This is especially important if others are in charge of data backup.  If your lab group has automated backup to a common computer, check to be sure your data are there, in full, and readable.  Ensure that the backup is actually occurring as regularly as you think it is.  More generally, you should be sure that if your laptop dies, or your office is flooded, or your home is burgled, you will be able to recover your data in full.

For more information on backing up, check out the DataONE education module “Protected back-ups”