Skip to main content

(index page)

A new opportunity to build a better (data) future

Last month I left my comfort zone.

After 30 years of working as an engineer, developer, and technical leader at Scripps Institution of Oceanography (SIO at UC San Diego), I started a new career as a Senior Product Manager and Research Data Specialist with UC Curation Center (UC3) at the California Digital Library. While it may sound like a big change, it was more of steady evolution. 

Although my projects at SIO were initially focused on scientific instrumentation, software development, and engineering specifications, I found the curation of the in situ data to be fascinating and better aligned with my skills and preferences. This led to service opportunities which included leadership positions within national and international data initiatives, and those projects allowed me to collaborate with members of UC3. 

Joining their team was the next logical step.

The transition from being part of the technical staff in a research setting to being a hands-on data advocate in UC3 has been an invigorating challenge so far, and it provides an excellent opportunity to build on my foundation of knowledge and grow in new areas.   

It’s an honor to pick up where my predecessor, Daniella Lowenberg, left off. I’ve long admired her approach to all things data. I am grateful for the extraordinary measures that she and John Chodacki have taken to bring me up to speed as soon as possible. 

Data publishing is a dynamic young field and my colleagues and I will be able to help shape the conversations, initiatives, and tools that serve the international research community. I look forward to working with my new colleagues as we advocate for open data and help build and implement infrastructure to make data more discoverable, interoperable, and reusable.

Farewell CDL!

A little over two years ago, after an exhausting day of packing up our apartment in Brooklyn, I turned to my partner and said “Hey, remember when said I wasn’t going to do a postdoc?”.

This was a joke, intended to offset the anxiety we were both feeling about our impending move across the country. But, after deciding to not pursue the “traditional” academic path (graduate school → postdoctoral fellowships → faculty position) and shifting from working in cognitive neuroscience labs to working in academic libraries, I had long assumed that my window into the liminal space occupied by postdocs had closed. That is, until I learned about the CLIR Postdoctoral Fellowship Program and saw an opportunity to dive headfirst into the wider world of scholarly communications and open science with the UC3 team at California Digital Library.

Today is my last day in the office at CDL and so much has happened in the world and for me personally and professionally over the course of my fellowship that I’m not sure anything I could write here would ever do it all justice. I suppose I could assess my time at CDL in terms of the number of posters, papers, and presentations I helped put together. I could mention my involvement with groups like BITSS and RDA. I could add up all the hours I’ve spent talking on Skype and Zoom or all the words I’ve written (and rewritten) in Slack and Google Docs. But really the most meaningful metric of my time at CDL would be the number of new colleagues, collaborators, and friends I’ve gained as a postdoc. I came to CDL because I wanted to become part of the broad community of folks working on research data in academic libraries. And now, as I’m about to move into a new position as the Data Services Librarian at Lane Medical Library, I can say that has happened more than I would have thought possible.

Looking back on the last two years, there are about a million people I owe a heartfelt thanks. If you’re out there and you don’t get an email from me, it’s almost definitely because I wrote something, decided it was completely insufficient, wrote something else, decided that was completely insufficient, and then got completely overwhelmed by the number of drafts in my mailbox. But seriously, thanks to everyone on the UC3 team, at CDL and the UC libraries, and beyond for everything you’ve done for me and for everything you’ve helped me do. 

Looking forward to what comes next, I have about a million ideas for new projects. Some of are extensions of work I started during my fellowship while others are the product of the connections, insights, or interests I developed while at CDL. But, since this is my last blog post as a postdoc, I also want to devote some space one last UC3 project update.

Support Your Data

If there is a common thread that ties together all of the work I’ve done at CDL it is that I really want to bridge the communication gap that exists between researchers and data librarians. The most explicit manifestation of this has been the Support Your Data project.

If you’ve missed all my blog posts, posters, and presentations on the topic, the goal of the Support Your Data project is to create tools for researchers to assess and advance their own data management practices. With an immense amount of help from the UC3 team, I drafted a rubric that describes activities related to data management and sharing in a framework that we hope is familiar and useful to researchers. Complementing this rubric, we also created a series of short guides that give actionable advice on topics such as data management planning, data organization and storage, documentation, and data publishing. Because we assumed that different research communities (e.g. researchers in different disciplines, researchers at different institutions) have different data-related needs and access to different data-related resources, all of these materials were designed with an eye towards easy customization.

A full rundown of the Support Your Data project will be given in a forthcoming project report. The short version is that, now that the majority of the content has been drafted, the next step is to work on design and adoption. We want researchers and librarians to use these tools so we want to make sure the final products don’t look like something I’ve been working on in a series of Google spreadsheets. Though I will no longer be leading the project, this work will continue at CDL. That said, I have a lot of ideas about using the Support Your Data materials as they currently exist as a jumping off point for future projects.

Data Management Practices in Neuroscience

I’m still surprised I convinced a library to let me do a neuroimaging project. I mean, I’m not that surprised, I can be pretty convincing when I start arguing that neuroimaging is a perfect case study for studying how researchers actually manage their data. But I think it says a lot about the UC3 team that they fully supported me as I dove deep into the literature describing fMRI data analysis workflows, charted the history of data sharing in cognitive neuroscience, and wrangled all manner of acronyms (ahem, BIDS, BIDS).

As I outlined in a previous blog post, the idea to survey neuroimaging researchers literally started with a tweet. But, before too long, it became a full fledged collaborative research project. As a former imaging researcher, I am still marveling over the fact that my collaborator Ana Van Gulick- another neuroscientist turned research data in libraries person- and I managed to collect data from over 140 participants so quickly. Our principle aim was to provide valuable insights to be both the neuroimaging and data curation community, but this project also gave us the opportunity to practice what we preach and apply open science practices to our own work. A paper describing the results of our survey of the data management practices of MRI researchers is currently through the peer review process, but we’ve already published a preprint and made our materials and data openly available.

We definitely hope to continue working with the neuroimaging community, but we also plan to do follow-up surveys of other research communities. Given the growing emphasis on transparency and open science practices in the field, what do data management practices look like in psychology? We hope to find out soon!

Exploring Researcher Needs and Values Related to Software

One of the principle aims of my fellowship was to explore issues around software curation. Spoiler alert: Though the majority of my projects touched on the subject of research software in some way, I’m still not sure I’ve come up with a comprehensive definition of what “software curation” actually means in practice. Shoutout to my fellow software curation fellows who continue to bring their array of perspectives and high levels of expertise to this issue (and thanks for not rolling your eyes at the cognitive neuroscientist trying to understand how computers work).

Before I started at CDL I knew that I would be working with Yasmin AlNoamany, my counterpart at the UC Berkeley library, on a project involving research software. To extend previous work done by the UC3 around issues related to data publishing, we eventually decided to survey researchers on how their use, share, and value their software tools. Our results, which we hope will help libraries and other research support groups shape their service offerings, are described in this preprint. We’ve also made our materials and data openly available.

There is still a lot of work to be done defining the problems and solutions of software curation. Though we currently don’t have plans to do any follow-up studies, we have another paper in the works describing the rest of our results and our survey will definitely inform how I plan to organize software-related training and outreach in the future. The UC3 team will also be continuing to work in this area, through their involvement with The Carpentries.

But wait, there’s more

Earlier this week, after another exhausting day of packing up our apartment outside of Berkeley, I keep remarking to my partner “Hey, remember when I thought I’d never get a job at Stanford.”

This is a joke too. We’re not moving across the country this time, but the move feels just as significant. Two years ago I was sad to leave New York, but ultimately decided I needed to take a step forward in my career. Now, as I’m about to take another step, I’m very sad to leave CDL. I’ve very excited about what comes next, of course. But I will always be grateful for CLIR and the UC3 team giving me to opportunity to learn so much and connect with so many amazing friends, collaborators, and colleagues.

Thanks everyone!

Dash: The Data Publication Tool for Researchers

We all know that research data should be archived and shared. That’s why Dash was created, a Data Publishing platform free to UC researchers. Dash complies with journal and funder requirements, follows best practices, and is easy to use. In addition, new features are continuously being developed to better integrate with your research workflow.

Why is Dash the best solution for UC researchers:

We hear a lot about the cost of storage being an inhibitor. But, on many campuses, the storage costs associated with Dash are subsidized by academic libraries or departments. The cost of storage could also be written into grants (as funders do require data to be archived).

We are always looking for feedback on what features would be the most useful, so that we can make data publishing a part of your normal workflows. Get in touch with us or start using Dash to archive and share your data.

Data Diversity is Okay

At risk of sounding like a motivational speaker, this is such an exciting time to be involved in science and research.  We are swimming in data and information (yay!), there are exciting software tools available for researchers, librarians, and lay people alike, and the possibilities for discovery seem endless.  Of course, all of this change can be a bit daunting.  How do you handle the data deluge? What software is likely to be around for a while? How do you manage your time effectively in the face of so much technology?

Growing Pains
Just like Kirk Cameron’s choice of hair style, academics and their librarians are going through some growing pains. From www.1051jackfm.com

Like many other groups, academic libraries are undergoing some growing pains in the face of the information age. This may be attributed drastic budget cuts, rising costs for journal subscriptions, and the less important role that physical collections play in due to increasing digitization of information.  Researchers are quite content to sit at their laptops and download PDFs from their favorite journals rather than wander the stacks of their local library; they would rather use Google searches to scour the internet for obscure references rather than ask their friendly subject librarian for help in the hunt.

Despite the challenges above, I firmly believe that this is such an exciting time to be working at the interface of libraries, science, and technology.  Many librarians agree with me, including those at UCLA.  Lisa Federer and Jen Weintraub recently put on a great panel at the UCLA library focused on data curation.  I was invited to participate and agreed, which turned out to be an excellent decision.

The panel was called “Data Curation in Action”, and featured four panelists: Chris Johanson, UCLA professor of classics and digital humanities; Tamar Kremer-Sadlik, director of research at the UCLA Center for Everyday Lives of Families (CELF); Paul Conner, the digital laboratory director of CELF; and myself, intended to represent some mix of researchers in science and librarians.

Without droning on about how great the panel was, and how interesting the questions from the audience were, and how wonderful my discussions were with attendees after the panel, I wanted to mention the major thing that I took away: there is so much diverse data being generated by so many different kinds of projects and researchers.  Did I mention that this is an exciting time in the world of information?

Take Tamar and Paul: their project involves following families every day for hours on end, recording video, documenting interactions and locations of family members, taking digital photographs, conducting interviews, and measuring cortisol levels (an indicator for stress).  You should read that sentence again, because that is an enormous diversity of data types, not to mention the volume. Interviews and video are transcribed, quantitative observations are recorded in databases, and there is an intense coding system for labeling images, videos, and audio files.

Now for Chris, who has the ability to say “I am a professor of classics” at dinner parties (I’m jealous).  Chris doesn’t sit about reading old texts and talking about marble statues. Instead he is trying to reconstruct “ephemeral activities in the ancient world”, such as attending a funeral, going to the market, etcetera. He does this using a complex combination of Google Earth, digitized ancient maps, pictures, historical records, and data from excavations of ancient civilizations.  He stole the show at the panel when he demonstrated how researchers are beginning to create virtual worlds in which a visitor can wander around the landscape, just like in a modern day 3D video game.

This is really just a blog post about how much I love my job. I can’t imagine anything more interesting than trying to solve problems and provide assistance for researchers such as Tamar, Paul and Chris.

In case you are not one of the 35 million who have watched it, OK Go has a wonderful video about getting through the tough times associated with the dawning information age (at least that’s my rather nerdy interpretation of this song):

 

Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany.  If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany.  It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation.  I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates.  For humans, this data might be used for, say, calculating premiums for life insurance.  For other organisms, these types of data are useful for studying population declines, increases, and changes.  Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino
Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data.  Luckily, they aren’t starting from scratch.  They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data.  Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database.  In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here).  Many of the scientists with whom I spoke needed convincing to share their datasets.  But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range.  The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to  believe in the cause.  This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well.  But they are choosing instead to venture into the wilderness of scientific data sharing.  Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community.  Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets.  Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.

Popular Demand for Public Data

Scanned image of a 1940 Census Schedule (from http://1940census.archives.gov)
The National Archives and Records Administration digitized 3.9 million schedules from the 1940 U.S. census

When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.

Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.

In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.

For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.

The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.

Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.

So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)