Skip to main content

(index page)

Survey says…

A few weeks ago we reached out to the scientific community for help on the direction of the DCXL project.  The major issue at hand was whether we should develop a web-based application or an add-in for Microsoft Excel.  Last week, I reported that we decided that rather than choose, we will develop both.  This might seem like a risky proposition: the DCXL project has a one-year timeline, meaning this all needs to be developed before August (!).  As someone in a DCXL meeting recently put it, aren’t we settling for “twice the product and half the features”?  We discussed what features might need to be dropped from our list of desirables based on the change in trajectory, however we are confident that both of the DCXL products we develop will be feature-rich and meet the needs of the target scientific community.  Of course, this is made easier by the fact that the features in the two products will be nearly identical.

Family Feud screen shot
What would Richard Dawson want? Add-in or web app? From Wikipedia. Source: J Graham (1988). Come on Down!!!: the TV Game Show Book. Abbeville Press

How did we arrive at developing an add-in and a web app? By talking to scientists. It became obvious that there were aspects of both products that appeal to our user communities based on feedback we collected.  Here’s a summary of what we heard:

Show of hands:  I ran a workshop on Data Management for Scientists at the Ocean Sciences 2012 Meeting in February.  At the close of the workshop, I described the DCXL project and went over the pros and cons of the add-in option and the web app option.  By show of hands, folks in the audience voted about 80% for the web app (n~150)

Conversations: here’s a sampling of some of the things folks told me about the two options:

Survey: I created a very brief survey using the website SurveyMonkey. I then sent the link to the survey out via social media and listservs.  Within about a week, I received over 200 responses.

Education level of respondents:

Survey questions & answers:

 

So with those results, there was a resounding “both!” emanating from the scientific community.  First we will develop the add-in since it best fits the needs of our target users (those who use Excel heavily and need assistance with good data management skills).  We will then develop the web application, with the hope that the community at large will adopt and improve on the web app over time.  The internet is a great place for building a community with shared needs and goals– we can only hope that DCXL will be adopted as wholeheartedly as other internet sources offering help and information.

Data Publishing–the First 500 Years

Data publishing is the new hot topic in a growing number of academic communities.  The scholarly ecosophere is filled with listserv threads, colloquia and conference hallway chats punctuated with questions of why to do it, how to do it, where to do it, when to do it and even what to call this seemingly new breed of scholarly output.  Scholars, and those who provide the tools and infrastructure to support them, are consumed with questions that don’t seem to have easy answers, and certainly not answers that span all disciplines.  How can researchers gain credit for the data they produce, separate from and in addition to the analysis of those data as articulated in formal publications?  How can scholars researching an area find relevant data sets within and across their own disciplines?  How are data and methodologies most effectively reviewed, validated, and corrected?  How are meaningful connections maintained between different versions, iterations, and “re-uses” of a given set of data?

The high-pitched level of debate on these topics is surprising in some ways given that datasets at least in certain fields have been readily available for awhile. The Inter-University Consortium for Political and Social Research (ICPSR) has been allowing social scientists to publish or find datasets since the early 1960s.  Great datasets gathered and published under institutional auspices include UN Data, the UNESCO Statistical Yearbook, and the IMF Depository library program.  Closer to home is the United States’ Federal Depository Library program, which since its establishment in 1841 has served as a distribution mechanism to ensure public access to governmental documents and data.

While these outlets are only viable solutions for some disciplines, their presence started me down a path exploring the history of data publishing in an effort to try to gain some perspective on the challenges we are facing today. Somewhat surprisingly, data publishing, conducted in a manner that would be recognized by today’s scholars, has been occurring for almost half a millennium. Yes, that’s right; we are now 500 years into producing, analyzing and publishing data.

These early activities centered on demographic data, presumably in an effort to identify and understand the dramatic patterns of life and death. Starting in the late 1500’s and prompted by the Plague, “Bills of Mortality” recording deaths within London began to be published and soon continued on a weekly basis. That raw data generation got noticed by a community-minded draper, the extremely bright, but non-university affiliated London resident John Graunt, who was inspired to gather those numerical lists, turn them into a dataset, analyze those data (looking for causes of death, ages of death, comparisons of rates between London and elsewhere, etc.) and publish both the dataset and his findings regarding population patterns in a groundbreaking work in 1662 “Natural and Political Observations Mentioned in a Following Index, and made up on the Bills of Mortality.” The work was submitted to the Royal Society of Philosophers, which recognized its merit and inducted the author into its fellowship. Graunt continued to extend his data and analysis, publishing new versions of each in subsequent years. Thus was born the first great work (at least in the Western world) of statistical analysis, or “political arithmetic” as it came to be called at that time.

Moving from the 16th and 17th centuries to the 18th brings us to another major point in data publishing history with Johann Sussmilch of Germany. Sussmilch was originally a cleric involved in a variety of intellectual pursuits, though unaffiliated with a university, at least initially. Sussmilch’s interests included theology, statistics and linguistics. He was eventually appointed to the Royal Academy of Sciences and Fine Arts for his linguistic scholarship. Sussmilch’s great work was the “Divine Order,” an ambitious effort to collect detailed data about the population in Prussia in order to prove his religious theory of “Rational Theology.” In other words, Sussmilch was engaged in a basic research program–he had a theory, formed a research question, collected the data required to test that theory, analyzed his data, and then published his results along with this data.

The rigorous quality of Sussmilch’s work (both the data and the analysis) elevated it far beyond his original and personal religious motivations, leading it to have a wide impact throughout parts of Europe. It became a focal point of exchange between scholars across countries and prompted debate over his data collection methodology and interpretation. Put another way, Sussmilch’s work inspired his colleagues to engage in the modern model of “scholarly communication” – engaging in a spirited critical dialogue which in turn resulted in changes to the next edition of the work (for instance, separate tables for immigration and racial data). Published first in 1741, it was updated and reprinted six times through 1798.

In this earlier time, as in our own, the drive to engage with other intellectuals was paramount. Publishing, sharing, critiquing and modifying data production efforts and analysis was seemingly as much a driving force among this community as it is among the scholars of today. Researchers of the 17th and 18th centuries dealt with issues of attribution, review of data veracity and analytical methodology and even versioning. The surprising discovery of apparent similarities across such a large gulf of time prompts many questions. If data could be published and shared centuries ago, why are we faced with such tremendous challenges to do the same today? Are we overlooking approaches from the past that could help us today? Or are we glossing over the difficulties of the past?

More research would have to be done to answer these questions thoroughly, but perhaps a gesture can be made in that regard by identifying some of the contrasting aspects between yesterday and today. Taking the examples from above as a jumping off point, perhaps the most striking difference between past activities and the goals articulated in conversations about data publishing today is that the data publication efforts of the past were accompanied by an equally important piece of analysis, and the research community was interested in both. The conclusions drawn from the data were held to scrutiny as were the data and data collection methods that provided their foundation. All of the components of the research were of concern. These scholars were not interested in publishing the data on their own, but rather wanted to present them along with their arguments, with each underscoring the other.

Another difference is the changing relationships between individual researchers and the entities that support them. Not only do we have governments and academic institutions, but we have a new contemporary player, the corporation, which is driven by a substantially different motivation from entities of past ages. In addition, a broader range of disciplines is now concerned with data publication, and perhaps those disciplines face stumbling blocks not at issue for the social scientists working with demographic and public health data. Given the known heterogeneity of scholarly communication practices across different fields, there seems to be no reason to think that data publishing needs, expectations and concerns would not also vary. And of course, the most obvious difference between then and now is with tools and technology. Have those advancements altered fundamental data publishing practices and if so, how?

These are interesting, but complex questions to pursue. Fortunately, what the above examples of our data publishing antecedents have hopefully revealed is that there are meaningful touchstones to use as reference points as we attempt to address these points. Data publishing has a rich, resonant past stretching back hundreds of years, providing us with an opportunity to reach into that past to better understand the trajectory that has brought us to this moment, thereby helping us more effectively grapple with the questions that seem to confound us today.

The Science of the DeepSea Challenge

Recently the film director and National Geographic explorer-in-residence James Cameron descended to the deepest spot on Earth: the Challenger Deep in the Mariana Trench.  He partnered with lots of sponsors, including National Geographic and Rolex, to make this amazing trip happen.  A lot of folks outside of the scientific community might not realize this, but until this week, there had been only one successful descent to this the trench by a human-occupied vehicle (that’s a submarine for you non-oceanographers).  You can read more about that 1960 exploration here and here.

I could go on about how astounding it is that we know more about the moon than the bottom of the ocean, or discuss the seemingly intolerable physical conditions found at those depths– most prominently the extremely high pressure.  However what I immediately thought when reading the first few articles about this expedition was where are the scientists?

Before Cameron, Swiss Oceanographer Piccard and Navy officer Marsh went down in it to the virgin waters of the deep. From www.history.navy.mil/photos/sh-usn/usnsh-t/trste-b

After combing through many news stories, several National Geographic sites including the site for the expedition, and a few press releases, I discovered (to my relief) that there are plenty of scientists involved.  The team that’s working with Cameron includes scientists from Scripps Institution of Oceanography (the primary scientific partner and long-time collaborator with Cameron),  Jet Propulsion Lab, University of Hawaii, and University of Guam.

While I firmly believe that the success of this expedition will be a HUGE accomplishment for science in the United States, I wonder if we are sending the wrong message to aspiring scientists and youngsters in general.  We are celebrating the celebrity film director involved in the project in lieu of the huge team of well-educated, interesting, and devoted scientists who are also responsible for this spectacular feat (I found less than 5 names of scientists in my internet hunt).  Certainly Cameron deserves the bulk of the credit for enabling this descent, but I would like there to be a bit more emphasis on the scientists as well.

Better yet, how about emphasis on the science in general?  It’s a too early for them to release any footage from the journey down, however I’m interested in how the samples will be/were collected, how they will be stored, what analyses will be done, whether there are experiments planned, and how the resulting scientific advances will be made just as public as Cameron’s trip was.  The expedition site has plenty of information about the biology and geology of the trench, but it’s just background: there appears to be nothing about scientific methods or plans to ensure that this project will yield the maximum scientific advancement.

How does all of this relate to data and DCXL? I suppose this post falls in the category of data is important.  The general public and many scientists hear the word “data” and glaze over.  Data isn’t inherently interesting as a concept (except to a sick few of us).  It needs just as much bolstering from big names and fancy websites as the deep sea does.  After all, isn’t data exactly what this entire trip is about?  Collecting data on the most remote corners of our planet? Making sure we document what we find so others can learn from it?

Here’s a roundup of some great reads about the Challenger expedition:

Hooray for Progress!

Great news on the DCXL front! We are moving forward with the Excel add-in and will have something to share with the community this summer.  If you missed it, back in January the DCXL project had an existential crisis: add-in or web-based application? I posted on the subject here and here. We spent a lot of time talking to the community and collating feedback, weighing the pros and cons of each option, and carefully considering how best to proceed with the DCXL project.

And the conclusion we came to… let’s develop both!

Comparing web-based applications and add-ins (aka plug-ins) is really an apples-and-oranges comparison.  How could we discount that a web-based application is yet another piece of software for scientists to learn? Or that an add-in is only useful for Excel spreadsheets running a Windows operating system? Instead, we have chosen to first create an add-in (this was the original intent of the project), then move that functionality to a web-based application that will have more flexibility for the longer term.

Albert-Camus
What do Camus, The Cure, and DCXL have in common? Existentialists at heart. From www.openlettersmonthly.com

The capabilities of the add-in and the web-based application will be similar: we are still aiming to create metadata, check the data file for .csv compatibility, generate a citation, and upload the data set to a data repository.  For a full read of the requirements (updated last week), check out the Requirements page on  this site. The implementation of these requirements might be slightly different, but the goals of the DCXL project will be met in both cases: we will facilitate good data management, data archiving, and data sharing.

It’s true that the DCXL project is running a bit behind schedule, but we believe that it will be possible to create the two prototypes before the end of the summer.  Check back here for updates on our progress.

The Digital Dark Age, Part 2

Earlier this week I blogged about the concept of a Digital Dark Age.  This is a phrase that some folks are using to describe some future scenario where we are not able to read historical digital documents and multimedia because they have been rendered obsolete or were otherwise poorly archived.  But what does this mean for scientific data?

Consider that Charles Darwin’s notebooks were recently scanned and made available online.  This was possible because they were properly stored and archived, in a long-lasting format (in this case, on paper).  Imagine if he had taken pictures of his finch beaks with a camera and saved the digital images in obsolete formats.  Or ponder a scenario where he had used proprietary software to create his famous Tree of Life sketch.  Would we be able to unlock those digital formats today?  Probably not.  We might have lost those important pieces of scientific history forever.   Although it seems like software programs such as Microsoft Excel and MATLAB will be around forever, people probably said similar things about the programs Lotus 1-2-3 and iWeb.

darwin by diana sudyka
“Darwin with Finches” by Diana Sudyka, from Flickr by Karen E James

It is a common misconception that things that are posted on the internet will be around “forever”.  While that might be true of embarrassing celebrity photos, it is much less likely to be true for things like scientific data.  This is especially the case if data are kept on a personal/lab website or archived as supplemental material, rather than being archived in a public repository (See Santos, Blake and States 2005 for more information).  Consider the fact that 10% of data published as supplemental material in the six top-cited journals was not available a mere five years later (Evangelou, Trikalinos, and Ioannidis, 2005).

Natalie Ceeney, chief executive of the National Archives, summed it up best in this quote from The Guardian’s 2007 piece on preventing a Digital Dark Age: “Digital information is inherently far more ephemeral than paper.”

My next post and final DDA installment will provide tips on how to avoid losing your data to the dark side.

The Digital Dark Age, Part 1

This will be known as the Digital Dark Age.  The first time I heard this statement was at Internet Archive, during the PDA 2012 Meeting (read my blog post about it here).  What did this mean?  What is a Digital Dark Age? Read on.

While serving in Vietnam, my father wrote letters to my grandparents about his life fighting a war in a foreign country.  One of his letters was sent to arrive in time for my grandfather’s birthday, and it contained a lovely poem that articulated my father’s warm feelings about his childhood, his parents, and his upbringing.  My grandparents kept the poem framed in a prominent spot in their home.  When I visited them as a child, I would read the poem written in my young dad’s  handwriting, stare at the yellowed paper, and think about how far that poem had to travel to relay its greetings to my grandparents.  It was special– for its history, the people involved, and the fact that these people were intimately connected to me.

Now fast forward to 2012.  Imagine modern-day soldiers all over the world, emailing, making satellite phone calls, and chatting with their families via video conferencing.  When compared to snail mail, these modern communication methods are likely a much preferred way of staying in touch for those families.  But how likely is it that future grandchildren will be able to listen those the conversations, read those emails, or watch those video calls?  The answer is extremely unlikely.

These two scenarios sum up the concept of a Digital Dark Age: compared to 40 years ago, we are doing a terrible job of ensuring that future generations will be able to read our letters, look at our pictures, or use our scientific data.

mix tapes
You mean future generations won’t be able to listen to my mix tapes?! From Flickr by newrambler

The Digital Dark Age “refers to a possible future situation where it will be difficult or impossible to read historical digital documents and multimedia, because they have been stored in an obsolete and obscure digital format.”  The phrase “Dark Age” is a reference to The Dark Ages, a period in history around the beginning of the Middle Ages characterized by a scarcity of historical and other written records at least for some areas of Europe, rendering it obscure to historians.  Sounds scary, no?

How can we remedy this situation? What are people doing about it? Most importantly, what does this mean for scientific advancement? Check out my next post to find out.

Fun Uses for Excel

Friday movie
“Excel can do WHAT?” Image from Friday (the movie), from newsone.com

It’s Friday! Better still, it’s Friday afternoon!  To honor all of the hard work we’ve done this week, let’s have some fun with Excel.  Check out these interesting uses for Excel that have nothing to do with your data:

Want to see some silly spreadsheet movies? Here ya go.

Excel Hero: Download .xls files that create nifty optical illusions.  Here’s one of them.

From PCWorld, Fun uses for Excel, including a Web radio player that plays inside your worksheet (click to download the zip file and then select a station), or simulating dice rolls in case of a lack-of-dice emergency during Yatzee.

 Here’s the results of a Google Image Search for “Excel art:

excel art

 

Mona Lisa never looked so smart.  Want to know more? Check out the YouTube video tutorial or read Creating art with Microsoft Excel from the blog digital inspiration.

 

Why You Should Floss

No, I won’t be discussing proper oral hygiene. What I mean by “flossing” is actually “backing up your data”.  Why the floss analogy? Here are the similarities between flossing in backing up your data:

  1. It’s undisputed that it’s important
  2. Most people don’t do it as often as they should
  3. You lie (to yourself, or your dentist) about how often you do it
dentist
Oral (and data) hygiene can be fun! From Calisphere, courtesy of UC Berkeley Bancroft Library

So think about backing up similarly to the way you think about flossing:  you probably aren’t doing it enough.  In this post, I will provide a general guidance about backing up your data; as always, the advice will vary greatly depending on the types of data you are generating, how often they change, and what computational resources are available to you.

First, create multiple copies in multiple locations.  The old rule of thumb is original, near, far.  The first copy is your working copy of data; the second copy is kept near your original (this is most likely an external hard drive or thumb drive); the third is kept far from your original (off site, such as at home or on a server outside of your office building).  This is the important part: all three of these copies should be up-to-date.  Which brings me to my second point.

Second, back up your data more often.  I have had many conversations with scientists over the last few months, and I always ask, “How do you back up your data?”  Answers range, but most of them scare me silly.  For instance, there was a 5th year graduate student who had all of her data on a six-year-old laptop, and only backed up once a month.  I get heart palpitations just typing that sentence.  Other folks have said things like “I use my external drive to back things up once every couple of months”, or worst case scenario, “I know I should, but I just don’t back up”.  It is strongly recommended that you back up every day. It’s a pain, right? There are two very easy ways to back up every day, and neither require any purchasing of hardware or software: (1) Keep a copy on Dropbox, or (2) Email yourself the data file as an attachment.  Note: these suggestions are not likely to work for large data sets.

Third, find out what resources are available to you. Institutions are becoming aware of the importance of good backup and data storage systems, which means there might be ways for you to back up your data regularly with minimal effort.  Check with your department or campus IT folks and ask about server space and automated backup service. If server space and/or backing up isn’t available, consider joining forces with other scientists to purchase servers for backing up (this is an option for professors more often than graduate students).

Finally, ensure that your backup plan is working.  This is especially important if others are in charge of data backup.  If your lab group has automated backup to a common computer, check to be sure your data are there, in full, and readable.  Ensure that the backup is actually occurring as regularly as you think it is.  More generally, you should be sure that if your laptop dies, or your office is flooded, or your home is burgled, you will be able to recover your data in full.

For more information on backing up, check out the DataONE education module “Protected back-ups”

Tweeting for Science

At risk of veering off course of this blog’s typical topics, I am going to post about tweeting.  This topic is timely given my previous post about the lack of social media use in Ocean Sciences, the blog post that it spawned at Words in mOcean,  and the Twitter hash tag #NewMarineTweep. A grad school friend asked me recently what I like about tweeting (ironically, this was asked using Facebook).  So instead of touting my thoughts on Twitter to my limited Facebook friends, I thought I would post here and face the consequences of avoiding DCXL almost completely this week on the blog.

First, there’s no need to reinvent the wheel.  Check out these resources about tweeting in science:

That being said, I will now pontificate on the value of Twitter for science, in handy numbered list form.

  1. It saves me time.  This might seem counter-intuitive, but it’s absolutely true.  If you are a head-in-the-sand kind of person, this point might not be for you. But I like to know what’s going on in science, science news, the world of science publishing, science funding, etc. etc.  That doesn’t even include regular news or local events.  The point here is that instead of checking websites, digging through RSS feeds, or having an overfull email inbox, I have filtered all of these things through HootSuite.  HootSuite is one of several free services for organizing your Twitter feeds; mine looks like a bunch of columns arranged by topic.  That way I can quickly and easily check on the latest info, in a single location. Here’s a screenshot of my HootSuite page, to give you an idea of the possibilities: click to open the PDF: HootSuite_Screenshot
  2. It is great for networking.  I’ve met quite a few folks via Twitter that I probably never would have encountered otherwise.  Some have become important colleagues, others have become friends, and all of them have helped me find resources, information, and insight.  I’ve been given academic opportunities based on these relationships and connections.  How does this happen? The Twittersphere is intimate and small enough that you can have meaningful interactions with folks.  Plus, there’s tweetups, where Twitter folks meet up at a designated physical location for in-person interaction and networking.
  3. It’s the best way to experience a conference, whether or not you are physically there. This is what spawned that previous post about Oceanography and the lack of social media use.  I was excited to experience my first Ocean Sciences meeting with all of the benefits of Twitter, only to be disappointed at the lack of participation.  In a few words, here’s how conference (or any event) tweeting works:
    1. A hash tag is declared. It’s something short and pithy, like #Oceans2012. How do you find out about the tag? Usually the organizing committee tells you, or in lieu of that you rely on your Twitter network to let you know.
    2. Everyone who tweets about a conference, interaction, talk, etc. uses the hash tag in their tweet. Examples:    
    3. Hash tags are ephemeral, but they allow you to see exactly who’s talking about something, whether you follow them or not.  They are a great way to find people on Twitter that you might want to network with… I’m looking at you, @rejectedbanana @miriamGoldste.
    4. If you are not able to attend a conference, you can “follow along” on your computer and get real-time feeds of what’s happening.  I’ve followed several conferences like this- over the course of the day, I will check in on the feed a few times and see what’s happening. It’s the next best thing to being there.

I could continue expounding the greatness of Twitter, but as I said before, others have done a better job than I could (see links above).  No, it’s not for everyone. But keep in mind that you can follow people, hash tags, etc. without actually ever tweeting. You can reap the benefits of everything I mentioned above, except for the networking.  Food for thought.

My friend from WHOI, who also attended the Ocean Sciences meeting, emailed me this comment later:

…I must say those “#tweetstars” were pretty smug about their tweeting, like they were the sitting at the cool kids table during lunch or something…

I countered that it was more like those tweeting at OS were incredulous at the lack of tweets, but yes, we are definitely the cool kids.

Help us get started

We at University of California Curation Center (UC3) are very interested in engaging in the conversations happening about data.  The purpose of this blog is to explore the landscape of digital data.  We are interested in topics such as

Do you have topics you would like us to discuss on this blog? Please comment on this post or email me.  The conversation is only beginning, and is sure to be interesting.