(index page)

Communication Breakdown: Nerds, Geeks, and Dweebs

Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project. WAIT! Keep reading! I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks. I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.

I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics. There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack. I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

Fredo Corleone — Fredo Corelone was smart. Not stupid like everybody says. Nerds, Geeks, and Dweebs are all smart – just in different ways. from godfather.wikia.com

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues. It’s actually quite obvious why these communication problems exist. There are a lot of smart people involved in data, all of whom have very divergent backgrounds. I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs. These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.

The Nerds. This is the group to which I belong. We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education. Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.

The geeks. Also known as computer folks. These folks might be developers, computer scientists, information technology specialists, database managers, etc. They are uber-smart, but from what I can tell their uber-smart brains do not work like mine. From what I can tell, geeks can explain things to me in one of two ways:

“To turn your computing machine on, you need to first plug it in. Then push the big button.”
“First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”

In all fairness, (1) occurs far less than (2). But often you get (1) after trying to get clarification on (2). How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen. That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours. Often I’ve found myself writing down terms in a meeting so I can go look them up later. Things like “elements” and “terminal” are not unfamiliar words in and of themselves. However the contexts in which they are used are completely new to me. That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and XML.

The dweebs. Also known as librarians. These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms. There’s certainly a bit of crossover with the computer types, especially when it comes to data. However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers. There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind. Check out the DCXL post on libraries from January.

Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs. I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort. How can we solve this communication impasse? I have a few ideas:

Nerds: open your minds to the possibility that computer types and librarian types might know about better ways of doing what you are doing. Tap the resources that these groups have to offer. Stop being scared of the unknown. You love learning or you wouldn’t be a scientist; devote some of that love in the direction of improving your computer savvy.
Geeks: dumb it down, but not too much. Recognize that scientists and librarians are smart, but potentially in very different ways than you. Also, please recognize that change will be incremental, and we will not universally adopt whatever you think is the best possible set of tools or strategies and how “totally stupid” or current workflow seems.
Dweebs: spend some time getting to know the disciplines you want to help. Toot your own horn– you know A LOT of stuff that nerds and geeks don’t, and you are all so darn shy! Make sure both geeks and nerds know of your capacity to help, and your ability to lend important information to the discussion.

And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation). I plead with you to stop reinventing the wheel. As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software. It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types. Consult with them rather than starting from scratch.

In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.

What’s the Deal with .xlsx?

A few years back, Microsoft Excel started automatically saving my spreadsheet files with the extensions .xlsx. I first noticed it when I got a new laptop for my postdoc at University of Alberta. Suddenly, I had to be cognizant of the fact that if I left Excel to its own devices, the spreadsheets I generated would not be readable on my home computer equipped with an older version of Excel.

First, let’s cover exactly what that extra “x” is for. The additional “x” in Excel file extensions stands for XML. XML is Extensible Markup Language, which is a markup language useful for data, databases, and data-related applications. The file type .xlsx is a combination of XML architecture and ZIP compression for size reduction. Here’s a succinct summary from mrexcel.com:

If you’ve ever looked at the “View Source” view of a webpage in Notepad, you are familiar with the structure of XML. While HTML allows for certain tags, like TABLE, BODY, TR, TD, XML allows for any tags. You can make up any sort of a tag to describe your data.

You can also check out Microsoft’s description of XML in Excel. What all of this means is that .xlsx files are more generalized and easier to use with web-based applications. It’s a good thing!

beatles album cover — Just like John and Paul, XML and Excel come together to make beautiful things happen.

You might be asking yourself why I’m writing about .xlsx. Isn’t this an old issue that folks have figured out by now? The answer to that is yes and no. Many of the scientists I have spoken with over the last few months are entrenched in their current Excel version, and have major complaints about moving to newer versions. Excel 2003 (2004 for Mac) is still heavily used among some groups, which predates the .xlsx file type. Other scientists have moved on to later versions of Excel, but still have colleagues, advisors, or collaborators who use older versions and therefore cannot open the .xlsx file type. So while many scientists can tell you they have noticed the new extension on their Excel files, they don’t understand the underlying changes.

Of course, you can tell Excel to generate and save files in the old .xls format by going to the “Excel Options… Save” and changing your settings so files are saved as .xls:

Or on a Mac, the “Preferences…. Compatibility” menu:

Help Wanted: Add-in versus Web Application?

I recently updated this site with a page listing the DCXL Requirements. These five requirements are the basic feature set and capabilities we would like have for the Excel Add-in that is to be developed in the course of the project. The engineering team at Microsoft Research checked out our requirements and had a (rather surprising) suggestion: instead of an add-in, they recommended a web-based application.

Add-ins are little pieces of software that you can download to extend the capabilities of a program – in our case, Microsoft Excel. Synonyms for add-inare plug-in and add-on. They are downloaded, installed, and then appear within a specific program. An add-in for Excel would appear in the Excel “ribbon”, and would add new features to Excel.

A web-based application is something a bit different. It’s a software system designed to support “machine-to-machine interaction over a network”. Web applications require the web (shocking, I know) and do not require that you download a program. Instead, you use an internet connection and the web-based application. Basically, these are web sites that do more than just display information – they do something with the information or files provided by the user, on the user’s behalf. Web sites such as Facebook, YouTube, and SkyDrive are examples of web applications.

So I turn to you, community: what are your thoughts on this? Make your voice heard! You can email me directly, comment on the blog below, or come on down to CDL‘s Downtown Oakland office and tell me in person. But please comment quickly – this decision needs to be made soon. You can also vote using the quick poll in the sidebar to the right of this post. We want to know what you think!

To help you formulate intelligent comments, here’s a rough comparison of the two options:

Add-in: The user would download the add-in for use on the current machine. They could perform the above tasks via a new “ribbon” that appears at the top of the Excel window. They would be able to perform the above tasks on their current spreadsheet.

Web application: The user would go to the website hosting the web application. They would upload (drag-and-drop) a spreadsheet to the site. They could then perform the above tasks to the spreadsheet. The spreadsheet could then be downloaded back onto their PC.

	Office Add-In	Web-Based Application
Platform Compatibility	Windows only	Any
Spreadsheet compatibility	Different add-in for each Excel version	One application covers multiple versions; potential future expansion to SQL, CSV, XML, Open Office, GoogleDocs etc
Download necessary?	Yes	No
Software updates	Fixed bugs require download & re-install	No download/re-install necessary
Cloud-based?	No	Yes
Offline use?	Yes	No; potential future for HTML5 and offline use
Languages	C#/.NET C/C++	HTML/JavaScript C#/ASP.NET
Has all the functionality of Excel	Yes	No

And here are the basic capabilities we want, regardless of which of the two options above becomes a reality:

Must work for Excel users without the add-in
No additional software (other than add-in and Excel) necessary
Can be used offline
Perform CSV compatibility checks, reporting, and automated fixes
Add Metadata to data file
1. Can use existing metadata as a template
2. Add-in can automatically generate some of the metadata where the info is available from the file
Generate a citation for the data file
Deposit data and metadata in a repository

Download the complete requirements as a PDF: DCXL Requirements