(index page)
Oceanographers: Why So Shy?
Last week I attended the TOS/ASLO/AGU Ocean Sciences 2012 Meeting in Salt Lake City. (If you are a DCXL blog regular, you know I was also at the Personal Digital Archiving 2012 Conference last week: my ears were bleeding by Friday night!). These two conferences were starkly different in many ways. Ocean Sciences had about 4,000 attendees, while PDA was closer to 100. Ocean Sciences had concurrent sessions, plenaries, and workshops, while PDA had only one room where all of the speakers presented. Although both provided provisions during breaks, PDA’s coffee and treats far surpassed those provided at the Salt Palace. But the most interesting difference? The incorporation of social media into the conference.
There are some amazing blogs out there for ocean scientists: Deep Sea News and SeaMonster come to mind immediately. There are also a plethora of active tweeters and bloggers in the ocean sciences community, including @labroides @jebyrnes (and his blog) @MiriamGoldste @RockyRohde @JohnFBruno @kzelnio @SFriedScientist @rejectedbanana @DrCraigMc @rmacpherson @Dr_Bik . I’m sure I’ve left some great ones out- feel free to tweet me and let me know! @carlystrasser).
That being said, ocean scientists stink at social media if OS 2012 was any indication.
First, the Ocean Sciences Meeting did not declare a hash tag – this is the first major conference I’ve been to in a while that didn’t do so. What does this mean? Those of us who were trying to communicate about OS 2012 via Twitter were not able to converge under a single hash tag until Tuesday (#oceans2012). Perhaps that isn’t such a big deal since there were only a dozen Tweeters at the conference. This is unusual for a conference of this size: at AGU 2011 in December, I would hazard to guess that there were more like 200 Tweeters. Food for thought.
Second, I heard from @MiriamGoldste that there was actual, audible clapping when disparaging comments were made about social media in one of the presentations. For shame, oceanographers! You should take advantage of tools offered to you; short of using social media yourself, you should recognize its growing importance in science (read some of the linked articles below).
Now for PDA 2012. A hash tag was declared (#pda12) and about 2 dozen active tweeters were off and running. We had dialogues during the conference, helped answer each others’ questions, commented on speakers’ major conclusions, and generally kept those that couldn’t attend the conference in person abreast of the goings-on. Combine that with real-time blogging of the meeting, and you had a recipe for being connected whether you were sitting in a pew at the Internet Archive or not. Links were tweeted to newly-posted slides, and generally there was a buzz about the conference.
So listen up, OS 2012 attendees: You are being left in the dust by other scientists who have embraced social media. I know what you are thinking: “I don’t have time to do all of that stuff!” One of the conference tweets says it best:
More information…
Read this great post from Scientific American on Social Media for Scientists
COMPASS: Communication partnership for science and the sea. I attended a COMPASS workshop two years ago at NCEAS and was swayed by the lovely Liz Neeley that social media was not only worth my time, but it could advance my career (read “Highly tweeted articles were 11x more likely to be cited” from The Atlantic).
Generally all of the resources on the Social Media For Scientists wikispace
Social Media for Scientists Recap from American Fisheries Society blog
As for how social media relates to the DCXL project, isn’t it obvious? I’ve been collecting feedback straight from potential DCXL users using social media. Because I have tapped into these networks, the DCXL project’s outcomes are likely to be useful for a large contingent of our target audience.

Archiving Your Life: PDA 2012 Meeting
I’m currently sitting in a church. No, I’m not being disrespectful and blogging while at church. Technically, I’m in a former church, in the Richmond District of San Francisco. The Internet Archive bought an old church and turned it into an amazing space for their operation, as well as for meetings like the 2012 Personal Digital Archiving Meeting I’m currently attending.
I wasn’t sure what “personal digital archiving” meant, exactly, before I heard about this conference. It turns out the concept is very familiar to me. It’s basically thinking about how to preserve your life’s digital content – photos, emails, writings, files, scanned images, etc. etc. The concept of archiving personal materials is a very hot topic right now. Think about Facebook, Storify, iCloud, WordPress, and Flickr, to name a few. As a scientist, I actually think my of my data as personal digital files: they represent a very long period of my life, after all. So I’m at this meeting talking a bit about DCXL, and also learning a lot about some amazing new stuff. Here’s a few interesting tidbits:
Cowbird: This is a place to tell stories, rather than just archive their lives. According to the founder (who is attending this conference), Cowbird is about the experience of life, as opposed to merely curating life. For an amazing, moving example of how Cowbird works, check this out: First Love
The Brain: Very cool, free software that helps you organize links, definitions, notes, etc. The idea is that it works just like your brain: it makes connections and creates networks to provide meaning to each link. Play with it a bit and you will be hooked.
Pinboard: Technically, I already knew about Pinboard. But the founder of the bookmarking system gave a great talk, so I’m including it here. Pinboard has been described as how the bookmarking service Delicious used to work, before it stopped working well. For a very small fee (~$10) you can store your bookmarks, tag them, and even save copies of the web pages as they were when you viewed them- this comes in particularly handy if you use a website for research and it might mysteriously disappear without warning. My favorite thing about Pinboard is it isn’t mucked up with ads and other visual distractions.

SOS to Scientists: Help!
We are in the final stages of deciding how to proceed with the DCXL project, and we are still unsure what will work best for scientists: add-in for Excel or web-based application? (For a full comparison check out my previous blog post).
What the debate really boils down to is this: what will help scientists more? Which of the two options is most likely to foster good scientific data stewardship?
If you are a scientist, please (pretty please) take this VERY short survey on SurveyMonkey.com and help us decide what will work for most scientists most of the time.
Survey link: http://www.surveymonkey.com/s/KJHNVYC

What’s the Deal with .xlsx?
A few years back, Microsoft Excel started automatically saving my spreadsheet files with the extensions .xlsx. I first noticed it when I got a new laptop for my postdoc at University of Alberta. Suddenly, I had to be cognizant of the fact that if I left Excel to its own devices, the spreadsheets I generated would not be readable on my home computer equipped with an older version of Excel.
First, let’s cover exactly what that extra “x” is for. The additional “x” in Excel file extensions stands for XML. XML is Extensible Markup Language, which is a markup language useful for data, databases, and data-related applications. The file type .xlsx is a combination of XML architecture and ZIP compression for size reduction. Here’s a succinct summary from mrexcel.com:
If you’ve ever looked at the “View Source” view of a webpage in Notepad, you are familiar with the structure of XML. While HTML allows for certain tags, like TABLE, BODY, TR, TD, XML allows for any tags. You can make up any sort of a tag to describe your data.
You can also check out Microsoft’s description of XML in Excel. What all of this means is that .xlsx files are more generalized and easier to use with web-based applications. It’s a good thing!

You might be asking yourself why I’m writing about .xlsx. Isn’t this an old issue that folks have figured out by now? The answer to that is yes and no. Many of the scientists I have spoken with over the last few months are entrenched in their current Excel version, and have major complaints about moving to newer versions. Excel 2003 (2004 for Mac) is still heavily used among some groups, which predates the .xlsx file type. Other scientists have moved on to later versions of Excel, but still have colleagues, advisors, or collaborators who use older versions and therefore cannot open the .xlsx file type. So while many scientists can tell you they have noticed the new extension on their Excel files, they don’t understand the underlying changes.
Of course, you can tell Excel to generate and save files in the old .xls format by going to the “Excel Options… Save” and changing your settings so files are saved as .xls:
Or on a Mac, the “Preferences…. Compatibility” menu:
The Good & Bad: Web Application versus Add-in
If you missed it, I recently posted about the future direction of the DCXL project. I boiled it down to the question of Add-in versus web application. The community has offered feedback, and some major themes that have emerged, which I summarize below. But first, a reminder of the goods and bads of our two possible approaches:
| Web application | |
| Good | Bad |
| Easier to maintain, update | Requires learning new user interface |
| Use with any platform (Mac, Windows, Linux, …) | |
| Generalizable/extensible | Not integrated into Excel |
| Community involvement easier | Offline use may be limited |
| Excel Add-in | |
| Good | Bad |
| Integrated in workflow | Windows only |
| Familiar user interface & functionality | Install & updates required |
| Smaller shift in practice | Not as generalizable/extensible |
| Available offline | Not as easy for community to get involved in development, improvement |
It seems that there are strong feelings on both sides of this issue. The majority are excited about the web application, but there are some serious concerns about going whole hog into the web application realm. Most of this apprehension stems from two major issues: potential problems when offline, and the lack of a visible DCXL presence in the Excel program.
Offline use: Metadata is best collected at the time the data are collected, which means the scientist might not have an internet connection. We should make sure that any features associated with generating metadata are available offline.
DCXL presence within Excel:what if we devise a way to connect the Excel user directly to the web application from within Excel? A “Lite” version of the add-in?

If we assume that we can tackle the two problems above, then the web application might be a great direction to take. The DCXL project should focus on assisting scientists with metadata generation first, and connection to repositories second. Both of these tasks may be easier with a web application. Metadata generation could be aided by connecting to existing metadata schema and standards, which would be enabled by a generalizable API making connection easier. More interesting is the possibility for connecting with repositories and institutions; what if there was a repository-specific implementation of the DCXL web application for each interested repository? Or a DCXL web application specifically geared towards the Geology department at UC Riverside? The possibilities for connecting with existing services becomes more interesting if web connections are made easy.
Needless to say, we still want feedback from the community. Decisions will be made soon, so drop me an email or comment on the blog to make your voice heard.
Google Refine: An Interesting Take on Data Organization
A powerful tool for working with messy data. This is the tag line for Google Refine, a web-based application that can be used to manipulate and clean up data sets. The history of Google Refine is that Google acquired Freebase Gridworks (originally developed by Metaweb Technologies, Inc.) back in 2010. They re-branded the application as Google Refine.
I certainly don’t claim to be an expert on exactly how Google Refine works, but it has great potential. You download the application, which works through a browser. The idea is that you upload your spreadsheet or download it from the web from within Google Refine. You can then manipulate your data, remove duplicates, rename cell entries in bulk, etc. The underlying code is available and it appears that developers are encouraged to participate. Alternatively, if you are generally fearful of code, Google Refine “protects users from all that nasty command line stuff,” as my smart friend Karthik says.
The trajectory of the DCXL project is still in flux, but I can say with certainty that Google Refine is a pretty great web-based application we can aspire to learn from in the course of our development. Just yesterday the blog iPhylo had a great post about using Google Refine along with taxanomic databases. This is one of the features we would like to incorporate into the DCXL project, so it’s great to hear that others have been hammering away at the problem of linking controlled vocabularies and data sets.
Want to know a bit more? Here’s Google’s blog entry about Google Refine. FlowingData also posted a blog about Google Refine, which is where I first heard of it. Freebase (which appears to be some iteration of Metaweb Technologies Inc.) has a Twitter feed that mentions Google Refine quite a bit at @fbase.
And in keeping with the organization theme of this post, here’s some links to one of my latest artist crushes: Ursus Wehrli. He’s the embodiment of organization, in beautiful art form. One of his photographs is below, but check out his Ted Talk, this Visual News post about him, or Google image search him for more amazing visuals.

Help Wanted: Add-in versus Web Application?
I recently updated this site with a page listing the DCXL Requirements. These five requirements are the basic feature set and capabilities we would like have for the Excel Add-in that is to be developed in the course of the project. The engineering team at Microsoft Research checked out our requirements and had a (rather surprising) suggestion: instead of an add-in, they recommended a web-based application.
Add-ins are little pieces of software that you can download to extend the capabilities of a program – in our case, Microsoft Excel. Synonyms for add-inare plug-in and add-on. They are downloaded, installed, and then appear within a specific program. An add-in for Excel would appear in the Excel “ribbon”, and would add new features to Excel.
A web-based application is something a bit different. It’s a software system designed to support “machine-to-machine interaction over a network”. Web applications require the web (shocking, I know) and do not require that you download a program. Instead, you use an internet connection and the web-based application. Basically, these are web sites that do more than just display information – they do something with the information or files provided by the user, on the user’s behalf. Web sites such as Facebook, YouTube, and SkyDrive are examples of web applications.
So I turn to you, community: what are your thoughts on this? Make your voice heard! You can email me directly, comment on the blog below, or come on down to CDL‘s Downtown Oakland office and tell me in person. But please comment quickly – this decision needs to be made soon. You can also vote using the quick poll in the sidebar to the right of this post. We want to know what you think!
To help you formulate intelligent comments, here’s a rough comparison of the two options:
Add-in: The user would download the add-in for use on the current machine. They could perform the above tasks via a new “ribbon” that appears at the top of the Excel window. They would be able to perform the above tasks on their current spreadsheet.
Web application: The user would go to the website hosting the web application. They would upload (drag-and-drop) a spreadsheet to the site. They could then perform the above tasks to the spreadsheet. The spreadsheet could then be downloaded back onto their PC.
| Office Add-In | Web-Based Application |
|
| Platform Compatibility | Windows only | Any |
| Spreadsheet compatibility | Different add-in for each Excel version | One application covers multiple versions; potential future expansion to SQL, CSV, XML, Open Office, GoogleDocs etc |
| Download necessary? | Yes | No |
| Software updates | Fixed bugs require download & re-install | No download/re-install necessary |
| Cloud-based? | No | Yes |
| Offline use? | Yes | No; potential future for HTML5 and offline use |
| Languages | C#/.NET C/C++ | HTML/JavaScript C#/ASP.NET |
| Has all the functionality of Excel | Yes | No |
And here are the basic capabilities we want, regardless of which of the two options above becomes a reality:
- Must work for Excel users without the add-in
- No additional software (other than add-in and Excel) necessary
- Can be used offline
- Perform CSV compatibility checks, reporting, and automated fixes
- Add Metadata to data file
- Can use existing metadata as a template
- Add-in can automatically generate some of the metadata where the info is available from the file
- Generate a citation for the data file
- Deposit data and metadata in a repository
Download the complete requirements as a PDF: DCXL Requirements


