(index page)

Google Refine: An Interesting Take on Data Organization

A powerful tool for working with messy data. This is the tag line for Google Refine, a web-based application that can be used to manipulate and clean up data sets. The history of Google Refine is that Google acquired Freebase Gridworks (originally developed by Metaweb Technologies, Inc.) back in 2010. They re-branded the application as Google Refine.

I certainly don’t claim to be an expert on exactly how Google Refine works, but it has great potential. You download the application, which works through a browser. The idea is that you upload your spreadsheet or download it from the web from within Google Refine. You can then manipulate your data, remove duplicates, rename cell entries in bulk, etc. The underlying code is available and it appears that developers are encouraged to participate. Alternatively, if you are generally fearful of code, Google Refine “protects users from all that nasty command line stuff,” as my smart friend Karthik says.

The trajectory of the DCXL project is still in flux, but I can say with certainty that Google Refine is a pretty great web-based application we can aspire to learn from in the course of our development. Just yesterday the blog iPhylo had a great post about using Google Refine along with taxanomic databases. This is one of the features we would like to incorporate into the DCXL project, so it’s great to hear that others have been hammering away at the problem of linking controlled vocabularies and data sets.

Want to know a bit more? Here’s Google’s blog entry about Google Refine. FlowingData also posted a blog about Google Refine, which is where I first heard of it. Freebase (which appears to be some iteration of Metaweb Technologies Inc.) has a Twitter feed that mentions Google Refine quite a bit at @fbase.

And in keeping with the organization theme of this post, here’s some links to one of my latest artist crushes: Ursus Wehrli. He’s the embodiment of organization, in beautiful art form. One of his photographs is below, but check out his Ted Talk, this Visual News post about him, or Google image search him for more amazing visuals.

Help Wanted: Add-in versus Web Application?

I recently updated this site with a page listing the DCXL Requirements. These five requirements are the basic feature set and capabilities we would like have for the Excel Add-in that is to be developed in the course of the project. The engineering team at Microsoft Research checked out our requirements and had a (rather surprising) suggestion: instead of an add-in, they recommended a web-based application.

Add-ins are little pieces of software that you can download to extend the capabilities of a program – in our case, Microsoft Excel. Synonyms for add-inare plug-in and add-on. They are downloaded, installed, and then appear within a specific program. An add-in for Excel would appear in the Excel “ribbon”, and would add new features to Excel.

A web-based application is something a bit different. It’s a software system designed to support “machine-to-machine interaction over a network”. Web applications require the web (shocking, I know) and do not require that you download a program. Instead, you use an internet connection and the web-based application. Basically, these are web sites that do more than just display information – they do something with the information or files provided by the user, on the user’s behalf. Web sites such as Facebook, YouTube, and SkyDrive are examples of web applications.

So I turn to you, community: what are your thoughts on this? Make your voice heard! You can email me directly, comment on the blog below, or come on down to CDL‘s Downtown Oakland office and tell me in person. But please comment quickly – this decision needs to be made soon. You can also vote using the quick poll in the sidebar to the right of this post. We want to know what you think!

To help you formulate intelligent comments, here’s a rough comparison of the two options:

Add-in: The user would download the add-in for use on the current machine. They could perform the above tasks via a new “ribbon” that appears at the top of the Excel window. They would be able to perform the above tasks on their current spreadsheet.

Web application: The user would go to the website hosting the web application. They would upload (drag-and-drop) a spreadsheet to the site. They could then perform the above tasks to the spreadsheet. The spreadsheet could then be downloaded back onto their PC.

	Office Add-In	Web-Based Application
Platform Compatibility	Windows only	Any
Spreadsheet compatibility	Different add-in for each Excel version	One application covers multiple versions; potential future expansion to SQL, CSV, XML, Open Office, GoogleDocs etc
Download necessary?	Yes	No
Software updates	Fixed bugs require download & re-install	No download/re-install necessary
Cloud-based?	No	Yes
Offline use?	Yes	No; potential future for HTML5 and offline use
Languages	C#/.NET C/C++	HTML/JavaScript C#/ASP.NET
Has all the functionality of Excel	Yes	No

And here are the basic capabilities we want, regardless of which of the two options above becomes a reality:

Must work for Excel users without the add-in
No additional software (other than add-in and Excel) necessary
Can be used offline
Perform CSV compatibility checks, reporting, and automated fixes
Add Metadata to data file
1. Can use existing metadata as a template
2. Add-in can automatically generate some of the metadata where the info is available from the file
Generate a citation for the data file
Deposit data and metadata in a repository

Download the complete requirements as a PDF: DCXL Requirements

1864823746_d6bb92c305 — From Flickr by Thewmatt

Academic Libraries: Under-Used & Under-Appreciated

I’m guilty. I often admit this when I meet librarians at conferences and workshops – I’m guilty of never using my librarians as a resource in my 13 years of higher ed, spread across seven academic institutions. At the very impressive MBL-WHOI Library in Woods Hole MA, there are quite a few friendly librarians that make their presence known to visitors. They certainly offered to help me, but it never occurred to me that they might be useful beyond telling me on what floor I can find the journal Limnology and Oceanography.

In hindsight, I didn’t know any better. Yes, we took the requisite library tour in grad school, and yes, I certainly used the libraries for research and access to books and journals, but no, I never talked to the librarians. Why is this? I have a few theories:

Librarians are terrible at self promotion. Every time I meet librarian, I’m awed and amazed by the vast quantities of knowledge they hold about all kinds of information. But most of the librarians I’ve encountered are unwilling to own up to their vast skill set. These humble folks assume scientists will come to them, completely underestimating the average academic’s stubbornness and propensity for self-sufficiency. In my opinion, librarians should stake out the popular coffee spot on campus and wear sandwich boards saying things like “You have no idea how to do research” or “Five minutes with me can change your <research> life“. Come on, librarians – toot your own horns!

Academics are trained to be self-sufficient. Every grad student has probably gotten the talk from their advisor at some point in their grad education. In my case the talk had phrases like these:

“You don’t have to ask me EVERY time you want to run down to the supply room”
“Which method do YOU think would work best?”
“How should I know how to dilute that acid? Go figure it out!”

It only takes a couple of brush-offs from your advisor before you realize that part of learning to be scientist involves solving problems all by yourself. This bodes well for future academic success, but does not allow us to entertain the idea that librarians might be helpful and save us oodles of time.

Google gives academics a false sense of security. Yes, I spend a lot of time Googling things. Many of this Googling occurs while having a drink with friends – some hotly debated item of trivia comes up, which requires that we pull out our smart phones to find out who’s right (it’s usually me). But Google can’t answer everything. Yes, it’s wonderful for figuring out who that actor in that movie was, or for showing a latecomer the amazing honey badger video. But Google is not necessarily the most efficient way to go about scholarly research. Librarians know this – they have entire schools dedicated to figuring out how to deal with information. The field of information science, which encompasses librarians, gives out graduate degrees in information. Do you really think that you know more about research than someone with a grad degree in information?? Extremely unlikely. Learn more about Information Science here.

Kingston Information and Library Service — Sterotype alert: there’s a lot of knowledge hiding behind librarians’ sensible shoes. From Flickr by Kingston Information & LIbrary Service

This post does, in fact, relate to the DCXL project. If you weren’t aware, the DCXL project is based out of California Digital Library. It turns out that librarians are quite good at being stewards of scholarly communication; who better to help us navigate the tricky world of digital data curation than librarians?

This post was inspired by a great blog posted yesterday from CogSci Librarian: How Librarians Can Help in Real Life, at #Sci013, and more

What Scientists Want: Requirements Part 3

I’m in the process of posting the requirements we are submitting for the DCXL add-in, in four parts. “Requirements” are the capabilities we want the proposed add-in to have, based on discussions with scientists and other stakeholders. For more information, read my two previous posts (here and here) and check out the new Requirements page for more details about each of the proposed the requirements.

Requirement 4: Generate a Citation for the Data File

I am a big believer in data citation. If you are a researcher, you are well aware of how much time and effort goes into collecting data. In my experience, I spent much more time collecting, documenting, and cleaning up data than I did writing the resulting publications. It is time we get credit for that hard work; the best way to make sure that happens is to cite others’ data, and provide others with the tools to cite your data. We want the add-in to make it as easy as possible for scientists to promote data citation of their work. The add-in will provide the capability of generating a citation from the data set’s existing metadata, potentially even including a persistent identifier such as a DOI. Read more about my take on data citation in this blog post, or at the DataCite website.

Stay tuned- the next DCXL post will be the final installment of the requirements saga.

stuck boot in field — Data collection is hard! Poor Alexis managed to lose a boot and a sock to the Wellfleet mudflats during clam collection. CC-BY 3.0

What Scientists Want: Requirements Part 2

I’m in the process of posting the requirements we are submitting for the DCXL add-in. These are the capabilities we want the proposed add-in to have, based on discussions with scientists and other stakeholders. For more information, read my previous post and check out the new Requirements page for more details about each of the proposed the requirements.

Requirement 3: Generate metadata that is linked to the data file

Scientists want help with metadata. The world of metadata standards, schemas, and programs are overwhelming. There are already journals, techniques, statistical programs, and proposal calls to keep track of; metadata doesn’t make the list of things that researchers frequently worry about. We want the add-in we develop to make metadata generation easier. Scientists are already using Excel; wouldn’t it be great if creating metadata for your spreadsheet were as easy as opening a new tab and filling out a form? We think so.

metadata plate — Hopefully you aren’t THIS nerdy about metadata. From Flickr by Shira Golding

This capability has all kinds of challenges: How do we handle the many different metadata standards? Can repositories request a particular standard via the Excel add-in? Is there a way to automatically populate some of the metadata fields based on information stored on the computer? What is the relationship between the data file and the metadata? This are all tricky questions, but we are optimistic that this requirement will result in the most useful capabilities for Excel.

Other resources:

The University of Queensland has a great introductory website for metadata
USGS also has a great FAQ website with metadata information
DataONE has collected a list of software that is associated with metadata generation

What Scientists Want: Requirements Part 1

Just before the holidays, I managed to finish the first version of the DCXL requirements. These requirements are the set of Excel capabilities that scientists would like to help them manage, share, and archive their data. I created these requirements based on many conversations with scientists, librarians, and information specialists, and they reflect the communities that helped generate them.

I gave you a sneak peak into these requirements back in December, but over the next few blog posts I will be giving you more details about each of the specific capabilities we would like to see from Excel. This is the first of those posts. Check ou the new Requirements page for more details about the requirements below.

Requirement 1: Ensure compatibility for Excel users without the add-in

We want the information that this add-in helps generate to be accessible regardless of the version of Excel you are using, and whether you are a Mac or Windows person. This is Requirement #1 since we think it is very important that scientists be able to access and share the metadata and data they create/modify with the add-in.

Requirement 2: Check the data file for CSV compatibility

Many archives prefer that the data you submit be in a non-proprietary file format, and be readable by many different programs. Comma-separated values format is one of the most common non-proprietary file formats for tabular data. We want the add-in to let you know what problems it encounters in your data file that might inhibit generation of a CSV file, and therefore data archiving.

We envision that this compatibility check will result in a report that the Excel user can consult for tips on how to improve the archiveability of their data. There will be best practices for data management, as well as specific information about why a particular element might be problematic and how best to modify the data file to improve its compatibility. For example, embedded figures in a spreadsheet might cause problems for export; the compatibility check would notify the user of the figure’s presence and suggest it be moved to a new tab or a different file.

The NeverEnding Task: Organizing Files

In my speaking with scientists about data management, I often talk about how they organize their work files on their computers. Asking someone about this is a deeply personal question- often people are highly defensive of their system, while simultaneously being frustrated with the structure.

Organizing files on your computer might sometimes feel like the NeverEnding task. You can spend two hours on Monday re-working the structure of your file system, only to find on Thursday that you are disappointed with the outcome and start over again. Or perhaps you are quite happy with the structure, but a new project starts up (perhaps related to work in existing folders) and it seems illogical to keep the current organization scheme. Here’s a few thoughts (and tips) that might help:

Plan for the types of files you will be generating- spend 30 minutes brainstorming the anticipated files you will generate in the course of the project, then determining the most logical links between those files. Document the system with a flow chart or text file, post it in your workspace, and stick with it until you have a legitimate reason to change things.
Use the same file structure consistently: on all of your computers, in your Dropbox, and in your Google Docs. Also use similar naming strategies for different types of files, like scripts, data sets, metadata files, and images. Example: Species_Site_Date_FileType.FileExtension might be the base structure, and files might be
- - Eaffinis_nanaimo_20100901_FieldCounts.xls
  - Eaffinis_nanaimo_20100901_ANOVAcode.R
  - Eaffinish_nanaimo_20100901_adult232.tiff
Consider including dates in your file names (I use YYYYMMDD, so I can always sort by most recent).
There are tools available re-naming files in bulk and re-organizing them, like Bulk Rename Utility, Renamer, and File Buddy.

There is no one-size-fits-all solution for organizing your files. The system that works best for you is very dependent on the types of files you generate, the frequency with which you need access to those files, the interrelationships between your files, etc etc.

neverending story — If only Falkor from The NeverEnding Story (1984) could help with file organization… From www.weirdworm.com

NSF now allows data in biosketch accomplishments

Hip hip hooray for data! Contributed to Calisphere by Sourisseau Academy for State and Local History (click for more information)

Back in October, the National Science Foundation announced changes to its Grant Proposal Guidelines (Full GPG for January 2013 here). I blogged about this back when the announcement was made, but now that the changes are official, I figure it warrants another mention.

As of January 2013, you can now list products in your biographical sketches, not just publications. This is big (and very good) news for data advocates like myself.

The change is that the biosketch for senior personnel should contain a list of 5 products closely related to the project and 5 other significant products that may or may not be related to the project. But what counts as a product? “products are…including but not limited to publications, data sets, software, patents, and copyrights.”

To make it count, however, it needs to be both citable and accessible. How to do this?

Archive your data in a repository (find help picking a repo here)
Obtain a unique, persistent identifier for your dataset (e.g., a DOI or ARK)
Start citing your product!

For the librarians, data nerds, and information specialists in the group, the UC3 has put together a flyer you can use to promote listing data as a product. It’s available as a PDF (click on the image to the right to download). For the original PPT that you can customize for your institution and/or repository, send me an email.

Direct from the digital mouths of NSF:

Summary of changes: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_sigchanges.jsp

Chapter II.C.2.f(i)(c), Biographical Sketch(es), has been revised to rename the “Publications” section to “Products” and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

New wording: http://www.nsf.gov/pubs/policydocs/pappguide/nsf13001/gpg_2.jsp

(c) Products

A list of: (i) up to five products most closely related to the proposed project; and (ii) up to five other significant products, whether or not related to the proposed project. Acceptable products must be citable and accessible including but not limited to publications, data sets, software, patents, and copyrights. Unacceptable products are unpublished documents not yet submitted for publication, invited lectures, and additional lists of products. Only the list of 10 will be used in the review of the proposal.

Each product must include full citation information including (where applicable and practicable) names of all authors, date of publication or release, title, title of enclosing work such as journal or book, volume, issue, pages, website and Uniform Resource Locator (URL) or other Persistent Identifier.

Ontologies and Data

Ontologies is one of those words I hear people toss about in conversations about computing, programming, and development. I usually nod and smile, pretending I know exactly what the word means, and how it relates to scientific data. It took some vigorous Google searching and a great discussion with M. Schildhauer of NCEAS before I can say, with confidence, that I kind-of understand the concept of ontologies.

In case you are in the same situation I was a few months ago, allow me to enlighten you. First, let’s start with the pre-computer era definition: ontology is the study of the nature of existing, the categories of being, and the relationships between these categories. Still not clear? Let’s let Wikipedia explain what the study of ontology entails:

Questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences.

I haven’t thought about the nature of existence since university-level philosophy courses, so this explanation makes my brain ache mildly. Remarkably, the computer science definition for ontology is slightly more tangible (and also sheds light on the descriptions above). In this field, an ontology is a set of concepts that represent the knowledge of a particular field of study (i.e. domain). It also includes the relationships between the concepts. Here’s examples of some important consequences of a field having an ontology:

shared vocabulary and taxonomy
explicitly defined concepts
the relationships between different concepts

And Wikipedia provides an example that may help clarify things:

Particular meanings of terms applied to that domain are provided by domain ontology. For example the word card has many different meanings. An ontology about the domain of poker would model the “playing card” meaning of the word, while an ontology about the domain of computer hardware would model the “punched card” and “video card” meanings.

An important point to make is how vital ontologies are now for this era of international collaboration, data deluge, and digital data. Take the field of genetics. What if every geneticist decided on their own way to describe genes, proteins, and sequences? Furthermore, what if they used words other than “genes”, “proteins”, and “sequences” to describe these things? It would be incredibly difficult for the field to progress since no one is quite sure what anyone else is talking about in their research. A Gene Ontology has been established within the community to prevent this scenario from taking place.

There is much more to ontologies than standard vocabularies, but this is certainly the easiest ontology concept to grasp. In terms of the DCXL add-in, ontologies could be used to structure how Excel spreadsheets are formatted and coded to facilitate universal discoverability and usability. It’s not likely that the first version of the add-in will be able to accomodate a wide range of ontologies (i.e. domain-specific vocabularies), but we hope that future versions might find ways to direct users to standards used in their field of interest.

science ontology — A map of science from Ontology Explorer: www.science-metrix.com. Ontologies can be thought of as maps describing relationships.

NSF Panel Review of Data Management Plans

With the clarity of the New Year, I realized I broke a promise to you DCXL readers… in my post on data policies, I stated that my next post would be about the current state of data management plan evaluation on NSF panels. Although it is a bit late, here’s that post.

My information is from a couple of different sources: a program officer or two at NSF, a few scientists who have served on panels for several different directorates, and some miscellaneous experts in data management plans. In general, they all said about the same thing: we are in early days for data management plans as an NSF requirement, and the process is still evolving. With that in mind, here are a few more specific pieces of information I gathered (note, these should be taken with a grain of salt since this is not the official position of NSF):

zach morris cell phone — Just like Zach Morris’ cell phone, data management plans are sure to evolve into something much fancier in a few years. From zackmorriscellphone.wordpress.com

The NSF program officer that leads the panel set the tone for DMP evaluation. Scientists that serve on the proposal review panels generally are not experts in data management or archiving, and therefore are unsure what to look for in DMPs.
The contents of a data management plan will not tank a proposal unless it is completely absent. Since no one is quite sure what should be in these DMPs, it’s tough to eliminate a good proposal on the basis of its DMP. Overall, DMPs are not currently a part of the merit review process. One person said it very succinctly:

PIs received a slap on the wrist if they had a good proposal with a bad DMP. If it was a bad proposal, the bad DMP was just another nail in the coffin.
The panelists are merely trying to determine whether at DMP is “adequate”. What does this mean? It generally boils down to two criteria: (1) Is the DMP present? and (2) Does the PI discuss how they will archive the data? Even (2) is up for debate since proposals have made it to the top despite no clear plans for archival, e.g. no mention of where they will archive the data.
Finally, there is buzz about some knowledgeable PIs using DMPs as a strategic tool. Rather than considering this two-page requirement a burden, they use the DMP as part of their proposal’s narrative. Food for thought.