Skip to content

Abandon all hope, ye who enter dates in Excel

Posted in UC3

Big thanks to Kara Woo of Washington State University for this guest blog post!

Update: The XLConnect package has been updated to fix the problem described below; however, other R packages for interfacing with Excel may import dates incorrectly. One should still use caution when storing data in Excel.


Like anyone who works with a lot of data, I have a strained relationship with Microsoft Excel. Its ubiquity forces me to tolerate it, yet I believe that it is fundamentally a malicious force whose main goal is to incite chaos through the obfuscation and distortion of data.1 After discovering a truly ghastly feature of how it handles dates, I am now fully convinced.

As it turns out, Excel “supports” two different date systems: one beginning in 1900 and one beginning in 1904.2 Excel stores all dates as floating point numbers representing the number of days since a given start date, and Excel for Windows and Mac have different default start dates (January 1, 1900 vs. January 1, 1904).3 Furthermore, the 1900 date system purposely erroneously assumes that 1900 was a leap year to ensure compatibility with a bug in—wait for it—Lotus 1-2-3.

You can’t make this stuff up.

What is even more disturbing is how the two date systems can get mixed up in the process of reading data into R, causing all dates in a dataset to be off by four years and a day. If you don’t know to look for it, you might never even notice. Read on for a cautionary tale.

I work as a data manager for a project studying biodiversity in Lake Baikal, and one of the coolest parts of my job is getting to work with data that have been collected by Siberian scientists since the 1940s. I spend a lot of time cleaning up these data in R. It was while working on some data on Secchi depth (a measure of water transparency) that I stumbled across this Excel date issue.

To read in the data I do something like the following using the XLConnect package:

library(XLConnect)
wb1 <- loadWorkbook("Baikal_Secchi_64to02.xlsx")
secchi_main <- readWorksheet(wb1, sheet = 1)
colnames(secchi_main) <- c("date", "secchi_depth", "year", "month")

So far so good. But now, what’s wrong with this picture?

head(secchi_main)
##         date secchi_depth year month
## 1 1960-01-16           12 1964     1
## 2 1960-02-04           14 1964     2
## 3 1960-02-14           18 1964     2
## 4 1960-02-24           14 1964     2
## 5 1960-03-04           14 1964     3
## 6 1960-03-25           10 1964     3

As you can see, the year in the date column doesn’t match the year in the year column. When I open the data in Excel, things look correct.

excel_secchi_data

This particular Excel file uses the 1904 date system, but that fact gets lost somewhere between Excel and R. XLConnect can tell that there are dates, but all the dates are wrong.

My solution for these particular data was as follows:

# function to add four years and a day to a given date
fix_excel_dates <- function(date) {
    require(lubridate)
    return(ymd(date) + years(4) + days(1))
}

# create a correct date column
library(dplyr)
secchi_main <- mutate(secchi_main, corrected_date = fix_excel_dates(date))

The corrected_date column looks right.

head(secchi_main)
##         date secchi_depth year month corrected_date
## 1 1960-01-16           12 1964     1     1964-01-17
## 2 1960-02-04           14 1964     2     1964-02-05
## 3 1960-02-14           18 1964     2     1964-02-15
## 4 1960-02-24           14 1964     2     1964-02-25
## 5 1960-03-04           14 1964     3     1964-03-05
## 6 1960-03-25           10 1964     3     1964-03-26

That fix is easy, but I’m left with a feeling of anxiety. I nearly failed to notice the discrepancy between the date and year columns; a colleague using the data pointed it out to me. If these data hadn’t had a year column, it’s likely we never would have caught the problem at all. Has this happened before and I just didn’t notice it? Do I need to go check every single Excel file I have ever had to read into R?

And now that I know to look for this issue, I still can’t think of a way to check the dates Excel shows against the ones that appear in R without actually opening the data file in Excel and visually comparing them. This is not an acceptable solution in my opinion, but… I’ve got nothing else. All I can do is get up on my worn out data manager soapbox and say:

and-thats-why-excel


  1. For evidence of its fearsome power, see these examples.
  2. Though as Dave Harris pointed out, “is burdened by” would be more accurate.
  3. To quote John Machin, “In reality, there are no such things [as dates in Excel spreadsheets]. What you have are floating point numbers and pious hope.”

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *