Last week, the White House Office of Science and Technology Policy hosted a “Big Data” R&D event, which was broadcast live on the internet (recording available here, press release available as a pdf). GeekWire did a great piece on the event that provides context. Wondering what “Big Data” means? Keep reading.
Big Data is a phrase being used to describe the huge volume of data being produced by modern technological infrastructure. Some examples include social media and remote sensing instruments. Facebook, Twitter, and other social media are producing huge amounts of datasets that can be analyzed to understand trends in the Internet. Satellites and other scientific instruments are producing constant streams of data that can be used to assess the state of the environment and understand patterns in the global ecosphere. In general, Big Data is just what it sounds like– a sometimes overwhelming amount of information, flooding scientists, statisticians, economists, and analysts with an ever-increasing pile of fodder for understanding the world.
Big Data is often used alongside the “Data Deluge”, which is a phrase used to describe the onslaught of data from multiple sources, all waiting to be collated and analyzed. The phrase brings about images of being overwhelmed by data: check out The Economist‘s graphic that represents the concept. From Wikipedia:
…datasets are growing so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing.
Despite the challenges of Big Data, folks are hungry for big data sets to analyze. Just this week, the 1940 US Census data was released; there was so much interest in downloading and analyzing the data, the servers crashed. You only need to follow the Twitter hash tag #bigdata to see it’s a very hot topic right now. Of course, Big Data should not be viewed as a bad thing. There is no such thing as too much information; it’s simply a matter of finding the best tools for handling all of those data.
Big Data goes hand-in-hand with Big Science, which is a term first coined back in 1961 by Alvin Weinberg, then the director of the Oak Ridge National Laboratory. Weinberg used “Big Science” to describe large, complex scientific endeavors in which society makes big investments in science, often via government funding. Examples include the US space program, the Sloan Digital Sky Survey, and the National Ecological Observatory Network. These projects produce mountains of data, sometimes continuously 24 hours a day, 7 days a week. Therein lies the challenge and awesomeness of Big Data.
What does all of this mean for small datasets, like those managed and organized in Excel? The individual scientist with their unique, smaller scale dataset has a big role in the era of Big Data. New analytics tools for meta-analysis offer a way for individuals to participate in Big Science, but we have to be willing to make our data standardized, useable, and available. The DCXL add-in will facilitate all three of these goals.
In the past, meta-analysis of small data sets meant digging through old papers, copying data out of tables or reconstructing data from graphs. Wondering about the gland equivalent of phenols from castoreum? Dig through this paper and reconstruct the data table in Excel. Would you like to combine that data set with data on average amounts of neutral compounds found in one beaver castor sac? That’s another paper to download and more data to reconstruct. By making small datasets available publicly (with links to the datasets embedded in the paper), and adhering to discipline-wide standards, meta-analysis will be much easier and small datasets can be incorporated into the landscape of Big Science. In essence, the whole is greater than the sum of the parts.