Skip to main content

(index page)

Notes on an Amazing Existing Excel Add-In

Last week I had an interesting conversation with Hazel Asuncion from University of Washington’s Bothell campus.  Along with an undergraduate computer science student, she’s  been working on the development of an add-in for Excel that helps track data provenance.

“Hmm…”, you are thinking, “what is data provenance?”  (at least, that’s what I asked when I first heard about it about 16 months ago).  Data provenance is basically a description of where the data came from (here’s Wikipedia’s explanation).  Provenance is an old concept in science- it’s directly related to reproducibility, which is considered among the most important tenets of scientific progress (more Wikipedia information here).  Without good data provenance, i.e. a good record of how you got from measurements to reported results, others can’t evaluate and verify your conclusions independently.

With that background in mind, consider all of the complex steps that scientists take to get from the data they collect to the graphs for a journal publication.  Most scientists would probably admit that this process is extremely difficult to recreate, even for themselves.  The most common way to keep track of the various steps is using a lab notebook, or perhaps notes in a Word document, or maybe even in the same Excel file, in a different spreadsheet.  Wouldn’t it be better if there were an easy way to keep track of your actions in Excel without having to manually take notes?

Here’s where Hazel’s work comes in: she has figured out a way to document activities in Excel that seamlessly integrates with the scientist’s current workflow.  It’s a great example of a useful, interesting add-in that will prove invaluable to those that use it.  Curious? Hazel will also be presenting the work at the eScience Conference in Stockholm this December, and a paper will be published in conjunction.

Here’s a demo of the add-in, available here under the agreement that it will be credited to Hazel’s group if used:

Provenance Add-In

DCXLblog Lite

Due to a number of upcoming activities associated with the DCXL project, both of this week’s posts are quite… light.  However there was a great webinar yesterday about DCXL-  Here’s the link.  It’s about 45 minutes of riveting discussion about DCXL, data management practices among scientists, and our future plans for the project.

The DCXL Project Kickoff Meeting is Monday, so next week’s blog entries are sure to be full of exciting new information. Stay tuned.

Your Turn!

For those of you that have been reading this blog, you know that I’ve devoted much of the content to results from scientist surveys.  How about you get involved?  If you or someone you know uses Excel in the course of Oceanography, Ecology, or Environmental or Earth Sciences research, please take 5 minutes to fill out this teeny tiny survey:

Excel use survey

Who knows… you might be featured (anonymously, of course) in a future blog post!

 

Sharing Fish Data

Before last week, I had never attended an American Fisheries Society function.  I have known a few fisheries biologists in my academic and personal life, including one of my closest friends who works on migration patterns in coastal fishes.  I don’t recall having any preconceived notions about the differences between fisheries folks and ecologists (or other types of scientists with which I am more familiar), but there appear to be distinct peculiarities specific to these groups.

Observation #1: Fisheries scientists have tidy spreadsheets.  When asked how much effort it would take to “clean up” their spreadsheets so that others could use their data, more than 60% said it would take very little or no work:

How much work to prepare a spreadsheet for sharing?
The amount of work required by the respondent before sharing a spreadsheet with a colleague.

I hypothesize that fisheries data tends to be quite large and unwieldy if not managed and organized well.  It may also be that fisheries scientists are more frequently analyzing their data using other software programs and therefore are less likely to conduct analyses within Excel, which often results in cluttered spreadsheets.

Observation #2: Fisheries scientists are less likely to be willing to share their data compared to ecologists.  I admit that the graph below suggests otherwise, however this statement is based on my discussions with ecologists and fisheries scientists, and the attitudes I perceived during our talks.  The cold, hard data shows that most fisheries scientists interviewed think it’s important to share data (72%):

How important is sharing?
Respondents’ responses to the question, “How important is it to share your data?”

Although I did not include this question in my survey at ESA, ecologists were generally quite positive about sharing data, and were a bit sheepish if they had to admit they had never shared their data publicly.  This was definitely not the case with fisheries scientists. They were in general hesitant to say that sharing was important, and often qualified it with the statement “data in general should be shared”, not willing to say they should share their own data.

Observation #3: Fisheries scientists’ unwillingness to share data translates to what they want from the add-in: they are most interested in having help with organizing and managing their data for their own use.  Those who answered “no opinion” either thought  data sharing was unimportant, or were satisfied with Excel as it is.

What the add-in should do
Respondents’ answer to the question, “What should the Excel add-in help you accomplish?

 

Excel Use Among Fisheries Folks

Last week I spent some time at the 2011 Fall Meeting for the American Fisheries Society.  The streets of Seattle were swamped with an estimated 4,000 AFS attendees… I wonder how many made it to Pike Place Market for a little on-the-ground research?  I was more interested in the tasty coffee choices that Seattle has to offer, and I wasn’t disappointed.  My reason for going to Seattle was not, however, to partake of the city’s many offerings, but instead to continue my quest for understanding how different types of scientists are using Excel, and how the Excel add-in that will result from this project might help them.

In that vein, fisheries folks are heavy Excel users: about 75% of the 36 scientists surveyed used Excel either every day or almost every day.  They use Excel in concert with other programs (primarily Microsoft Access, R, and ArcGIS):

Programs used with Excel
Percent of 36 fisheries scientists who use these software programs along with Excel

Similar to the ecologists at ESA I spoke with, Excel is being used for a wide range of applications:

Excel use by AFS attendees
Percent of 36 fisheries scientists surveyed using Excel for a particular application

Again, sharing and organizing data float to the top as primary uses for Excel among scientists.  The next post will some interesting discussions I had with AFS attendees about data sharing and data archiving, and what they would like to see specifically from the add-in being developed.

Potentially Problematic Excel Features

This post wraps up my report from the ESA conference, but I saved some of the best info for last.  I asked ecologists what features they used in Excel in the course of their research. There’s no question that this group of scientists is using Excel to its full potential.  However some of the features that make Excel so great cause major problems when trying to archive data.  Here’s what people are using:

Features used in Excel
Percent of Ecologists surveyed who use a particular feature of Excel

Not surprisingly, everyone is using the formulas available in Excel.  This is certainly one of Excel’s greatest strengths, and what makes it universally useful across disciplines and applications (I know I sometimes open Excel rather than a calculator to do simple arithmetic).  While at ESA I had an interesting conversation with an ecologist who possessed a 30-year data set of tree diversity.  Some of the data in his Excel spreadsheets were contained within embedded formulas.  These data would be lost if the spreadsheet were exported as a .csv file, but he does not have the personnel or funds to manually extract the data.  This is an interesting challenge we are considering in light of the Excel add-in.

The phrase “cell shading” indicates that color-coding in cells is to convey information about the data.  Essentially, the shading is metadata that is only visible in Excel.   This is bad news for data archiving since any color-coding will be lost if data are exported as any type of file except .xls.  The same problems go for using the “Comments” feature in Excel- these potentially important nuggets of information are lost outside of the software program.

Other potentially tricky features for a data archive to handle are multiple tabs in a file, multiple tables in a tab, the ever-popular pivot tables, and macros.   A somewhat comforting statistic is that ecologists are consistently labeling their columns.  This means there exists at least some metadata in a given Excel spreadsheet.

Next week’s post will pertain to the American Fisheries Society Meeting in Seattle, WA that I attended this week.  Here’s a sneak peek: fisheries folks are on average much less likely to share data (!).

Quantitative Results From the ESA Conference

For the last two posts, I have given a brief overview of the qualitative results from the ESA meeting.  For this post, I am reporting the nitty-gritty numbers.

I formally surveyed 55 Ecologists while at the ESA meeting, and managed to get a fairly diverse pool. Here’s how their academic status fell out:

Survey Participant Breakdown
Breakdown of survey participants

Their areas of study were fairly diverse too:

Areas of study
Number of participants for each area of study

When asked how frequently they used Excel, 75% responded with “every day” or “almost every day”.   All of that Excel use resulted in a general feeling that they were knowledgeable about Excel:

How knowledgeable are you?
Results from question, “How knowledgeable are you about Excel?”

Among the most interesting results related to Excel use. I asked respondents what they used Excel for:

What do you use Excel for?
Answers to the question, “What do you use Excel for?”

Not surprisingly, almost all participants used Excel for organizing their data, and a large percentage shared their data in the form of Excel files.

What does all of this mean for the DCXL project? It means there’s a need we are going to fill: Ecologists use Excel for data organization and sharing, but they aren’t using it as effectively as they could (see previous post).

The goal of this project is to identify what features the Excel software could include that would facilitate data sharing, data management, and data archiving. It looks like we have our work cut out for us!

Ecology Spreadsheets 101

While at ESA, I wore two hats- one as the DCXL project manager, and one as a DataONE postdoc working on engaging the scientific community in data management, education, and archiving (see more about my work with DataONE here).  Part of my DataONE tasks at ESA involved participating in two workshops related to data management (slides for both workshops are available here).

For one of the workshops, attendees brought their Excel spreadsheets and allowed organizers to take a look.  The goal was to help them better organize their own data, but I might have learned just as much as they did.  I observed several common practices that represent poor data management, and were almost universally used among Ecologists.  So if you are an Excel user, here’s some tips that might help you get one step closer to being a rockstar at data organization (and more prepared for data archiving):

  1. Create only one table per spreadsheet. I too am guilty of creating multiple tables, separated by empty columns, on a single spreadsheet.  Perhaps these tables represent different sites, or species, or days.  A better option, however, is to separate these tables into multiple spreadsheets (or different tabs in a single Excel file).
  2. Many researchers use comments, graph things, and make notes on a spreadsheet while pondering how best to display their data. However all of these things will make your spreadsheet unreadable to many statistical programs and make exporting as a text file for archiving more difficult.  If eschewing these tools inhibits your creativity, consider maintaining a “raw” tab that is not cluttered by miscellaneous information.
  3. Following up on that last point, keep your raw data raw. Once you’ve entered all of your data into your spreadsheet, you should save it with the title “DO_NOT_TOUCH_UNDER_ANY_CIRCUMSTANCES_RAW_DATA.xls” (or perhaps some other, more informative title of your choosing).  Raw data should stay raw- it guarantees that you always have a starting point if you screw something up during analysis, quality control, or other points in your research.  Manipulate and graph your data in a separate spreadsheet from your raw data.
  4. Use good organization in your tables. Ideally, a row is a complete record, and a column consists of all of the data that make up that record.
  5. Construct your spreadsheet so that new information adds rows, not columns. If the overall structure of your spreadsheet doesn’t change with each field season or sample site, it will be easier to import into other programs or reanalyze updated data files.
  6. Atomize: make sure each cell has only one piece of information. For instance, if your spreadsheet has a column titled Location with an entry “Austin, TX”, an improvement would be to create two columns for location, one for City and the other State. Commas and spaces in data cells might cause problems for archiving data with certain data centers, and also inhibits importing spreadsheets into other software programs.
  7. Use plain text ASCII characters for variable names and data.  Again, this ensures your data are accurate if you export them to other programs

Do you have some suggestions for good data organization in Excel spreadsheets? Pass them along by commenting below.