While at ESA, I wore two hats- one as the DCXL project manager, and one as a DataONE postdoc working on engaging the scientific community in data management, education, and archiving (see more about my work with DataONE here). Part of my DataONE tasks at ESA involved participating in two workshops related to data management (slides for both workshops are available here).
For one of the workshops, attendees brought their Excel spreadsheets and allowed organizers to take a look. The goal was to help them better organize their own data, but I might have learned just as much as they did. I observed several common practices that represent poor data management, and were almost universally used among Ecologists. So if you are an Excel user, here’s some tips that might help you get one step closer to being a rockstar at data organization (and more prepared for data archiving):
- Create only one table per spreadsheet. I too am guilty of creating multiple tables, separated by empty columns, on a single spreadsheet. Perhaps these tables represent different sites, or species, or days. A better option, however, is to separate these tables into multiple spreadsheets (or different tabs in a single Excel file).
- Many researchers use comments, graph things, and make notes on a spreadsheet while pondering how best to display their data. However all of these things will make your spreadsheet unreadable to many statistical programs and make exporting as a text file for archiving more difficult. If eschewing these tools inhibits your creativity, consider maintaining a “raw” tab that is not cluttered by miscellaneous information.
- Following up on that last point, keep your raw data raw. Once you’ve entered all of your data into your spreadsheet, you should save it with the title “DO_NOT_TOUCH_UNDER_ANY_CIRCUMSTANCES_RAW_DATA.xls” (or perhaps some other, more informative title of your choosing). Raw data should stay raw- it guarantees that you always have a starting point if you screw something up during analysis, quality control, or other points in your research. Manipulate and graph your data in a separate spreadsheet from your raw data.
- Use good organization in your tables. Ideally, a row is a complete record, and a column consists of all of the data that make up that record.
- Construct your spreadsheet so that new information adds rows, not columns. If the overall structure of your spreadsheet doesn’t change with each field season or sample site, it will be easier to import into other programs or reanalyze updated data files.
- Atomize: make sure each cell has only one piece of information. For instance, if your spreadsheet has a column titled Location with an entry “Austin, TX”, an improvement would be to create two columns for location, one for City and the other State. Commas and spaces in data cells might cause problems for archiving data with certain data centers, and also inhibits importing spreadsheets into other software programs.
- Use plain text ASCII characters for variable names and data. Again, this ensures your data are accurate if you export them to other programs
Do you have some suggestions for good data organization in Excel spreadsheets? Pass them along by commenting below.