A powerful tool for working with messy data. This is the tag line for Google Refine, a web-based application that can be used to manipulate and clean up data sets. The history of Google Refine is that Google acquired Freebase Gridworks (originally developed by Metaweb Technologies, Inc.) back in 2010. They re-branded the application as Google Refine.
I certainly don’t claim to be an expert on exactly how Google Refine works, but it has great potential. You download the application, which works through a browser. The idea is that you upload your spreadsheet or download it from the web from within Google Refine. You can then manipulate your data, remove duplicates, rename cell entries in bulk, etc. The underlying code is available and it appears that developers are encouraged to participate. Alternatively, if you are generally fearful of code, Google Refine “protects users from all that nasty command line stuff,” as my smart friend Karthik says.
The trajectory of the DCXL project is still in flux, but I can say with certainty that Google Refine is a pretty great web-based application we can aspire to learn from in the course of our development. Just yesterday the blog iPhylo had a great post about using Google Refine along with taxanomic databases. This is one of the features we would like to incorporate into the DCXL project, so it’s great to hear that others have been hammering away at the problem of linking controlled vocabularies and data sets.
Want to know a bit more? Here’s Google’s blog entry about Google Refine. FlowingData also posted a blog about Google Refine, which is where I first heard of it. Freebase (which appears to be some iteration of Metaweb Technologies Inc.) has a Twitter feed that mentions Google Refine quite a bit at @fbase.
And in keeping with the organization theme of this post, here’s some links to one of my latest artist crushes: Ursus Wehrli. He’s the embodiment of organization, in beautiful art form. One of his photographs is below, but check out his Ted Talk, this Visual News post about him, or Google image search him for more amazing visuals.