Author: CDL UC3


The integration of the Merritt repository with Amazon’s S3 and Glacier cloud storage services, previously described in an August 16 post on the Data Pub blog, is now mostly complete. The new Amazon storage supplements Merritt’s longstanding reliance on UC private cloud offerings at UCLA and UCSD. Content tagged for public access is now routed to S3 for primary storage, with automatic replication to UCSD and UCLA. Private content is routed first to UCSD, and then replicated to UCLA and Glacier. Content is served for retrieval from the primary storage location; in the unlikely event of a failure, Merritt automatically retries from secondary UCSD or UCLA storage. Glacier, which provides near-line storage with four hour retrieval latency, is not used to respond to user-initiated retrieval requests.

Content Type Primary Storage Secondary Storage Primary Retrieval Secondary Retrieval
Public S3 UCSD

In preparation for this integration, all retrospective public content, over 1.1 million objects and 3 TB, was copied from UCSD to S3, a process that took about six days to complete. A similar move from UCSD to Glacier is now underway for the much larger corpus of private content, 1.5 million objects and 71 TB, which is expected to take about five weeks to complete.

The Merritt-Amazon integration enables more optimized internal workflows and increased levels of reliability and preservation assurance. It also holds the promise of lowering overall storage costs, and thus, the recharge price of Merritt for our campus customers.  Amazon has, for example, recently announced significant price reductions for S3 and Glacier storage capacity, although their transactional fees remain unchanged.  Once the long-term impact of S3 and Glacier pricing on Merritt costs is understood, CDL will be able to revise Merritt pricing appropriately.

CDL is also investigating the possible use of the Oracle archive cloud, as a lower-cost alternative, or supplement, to Glacier for dark archival content hosting.  While offering similar function to Glacier, including four hour retrieval latency, Oracle’s price point is about 1/4th of Glacier’s for storage capacity.

Collaborative Web Archiving with Cobweb

A partnership between the CDL, Harvard Library, and UCLA Library has been awarded funding from IMLS to create Cobweb, a collaborative collection development platform for web archiving.

The demands of archiving the web in comprehensive breadth or thematic depth easily exceed the technical and financial capacity of any single institution. To ensure that the limited resources of archiving programs are deployed most effectively, it is important that their curators know something about the collection development priorities and holdings of other, similarly-engaged institutions. Cobweb will meet this need by supporting three key functions: nominating, claiming, and holdings. The nomination function will let curators and stakeholders suggest web sites pertinent to specific thematic areas; the claiming function will allow archival programs to indicate an intention to capture some subset of nominated sites; and the holdings function will allow programs to document sites that have actually been captured.

How will Cobweb work? Imagine a fast-moving news event unfolding online via news reports, videos, blogs, and social media. Recognizing the importance of recording this event, a curator immediately creates a new Cobweb project and issues an open call for nominations. Scholars, subject area specialists, interested members of the public, and event participants themselves quickly respond, contributing to a site list more comprehensive than could be created by any one curator or institution. Archiving institutions review the site list and publicly claim responsibility for capturing portions of it that are consistent with their local policies and technical capabilities. After capture, the institutions’ holdings information is updated in Cobweb to disclose the various collections containing newly available content. It’s important to note that Cobweb collects only metadata; the actual archived web content would continue to be managed by the individual collecting organizations. Nevertheless, by distributing the responsibility, more content will be captured more quickly with less overall effort than would otherwise be possible.

Cobweb will help libraries and archives make better informed decisions regarding the allocation of their individual programmatic resources, and promote more effective institutional collaboration and sharing.

This project was made possible in part by the Institute of Museum and Library Services, #LG-70-16-0093-16.

CC BY and data: Not always a good fit


Simba dans le carton, jacme31, CC BY-SA 2.0

This post was originally published on the University of California Office of Scholarly Communication blog.

Last post I wrote about data ownership, and how focusing on “ownership” might drive you nuts without actually answering important questions about what can be done with data. In that context, I mentioned a couple of times that you (or your funder) might want data to be shared under CC0, but I didn’t clarify what CC0 actually means. This week, I’m back to dig into the topic of Creative Commons (CC) licenses and public domain tools — and how they work with data.

Who “owns” your data?


Tug of war, Kathleen Tyler Conklin, CC BY-NC 2.0

This post was originally published on the University of California Office of Scholarly Communication blog.

Which of these is true?

“The PI owns the data.”

“The university owns the data.”

“Nobody can own it; data isn’t copyrightable.”

You’ve probably heard somebody say at least one of these things — confidently. Maybe you’ve heard all of them. Maybe about the same dataset (but in that case, hopefully not from the same person). So who really owns research data? Well, the short answer is “it depends.”

A longer answer is that determining ownership (and whether there’s even anything to own) can be frustratingly complicated — and, even when obvious, ownership only determines some of what can be done with data. Other things like policies, contracts, and laws may dictate certain terms in circumstances where ownership isn’t relevant — or even augment or overrule an owner where it is. To avoid an unpleasant surprise about what you can or can’t do with your data, you’ll want to plan ahead and think beyond the simple question of ownership.

PIDapalooza – What, Why, When, Who?


PIDapalooza, a community-led conference on persistent identifiers
November 9-10, 2016
Radisson Blu Saga Hotel

PIDapalooza will bring together creators and users of persistent identifiers (PIDs) from around the world to shape the future PID landscape through the development of tools and services for the research community. PIDs support proper attribution and credit, promote collaboration and reuse, enable reproducibility of findings, foster faster and more efficient progress, and facilitate effective sharing, dissemination, and linking of scholarly works.

If you’re doing something interesting with persistent identifiers, or you want to, come to PIDapalooza and share your ideas with a crowd of committed innovators.

Conference themes include:

  1. PID myths. Are PIDs better in our minds than in reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
  2. Achieving persistence. So many factors affect persistence: mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
  3. PIDs for emerging uses. Long-term identifiers are no longer just for digital objects. We have use cases for people, organizations, vocabulary terms, and more. What additional use cases are you working on?
  4. Legacy PIDs. There are of thousands of venerable old identifier systems that people want to continue using and bring into the modern data citation ecosystem. How can we manage this effectively?
  5. The I-word. What would make heterogeneous PID systems “interoperate” optimally? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
  6. PIDagogy. It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
  7. PID stories. Which strategies worked? Which strategies failed? Tell us your horror stories! Share your victories!
  8. Kinds of persistence. What are the frontiers of ‘persistence’? We hear lots about fraud prevention with identifiers for scientific reproducibility, but what about data papers promoting PIDs for long-term access to reliably improving objects (software, pre-prints, datasets) or live data feeds?

PIDapalooza is organized by California Digital Library, Crossref, DataCite, and ORCID.

We believe that bringing together everyone who’s working with PIDs for two days of discussions, demos, workshops, brainstorming, and updates on the state of the art will catalyze the development of PID community tools and services.

And you can help by getting involved!.

Propose a session

Please send us your session ideas by September 18. We will notify you about your proposals in the first week of October.

Register to attend

Registration is now open — come join the festival with a crowd of like-minded innovators. And please help us spread the word about PIDapalooza in your community!

Stay tuned

Keep updated with the latest news at the PIDapalooza website and on Twitter (@PIDapalooza) in the coming weeks.

See you in November!

UC3 to Explore Amazon S3 and Glacier Use for Merritt Storage

The UC Curation Center (UC3) has offered innovative digital content access and preservation services to the UC community for over six years through its Merritt repository.  Merritt was developed by UC3 to address unique needs for high-quality curation services at scale and a low price point.   Recently, UC3 started looking into Amazon’s S3 and Glacier cloud storage products as a way to address cost concerns, fine-tune reliability issues, increase service options, and keep pace with ever-increasing scale in the volume, variety, and velocity of new content contributions.

The current Merritt pricing model, in effect since July 1, 2015, is based on recovering the costs of storage use, currently totally over 73 TB contributed from all 10 UC campuses.  This content is now being replicated in UC private clouds supported by UCLA and UCSD.   Since the closure earlier this year of the UCOP data center, the computational processes underlying Merritt, along with all other CDL services, have been moved to virtual machines in the Amazon AWS cloud.  Collocating storage alongside this computational presence in AWS will provide increased data transfer throughput during Merritt deposit and retrieval.  In addition, the integration of online S3 with near-line Glacier storage offers opportunities to lower storage costs by moving archival materials with no expectation of direct end-user access to Glacier.  The cost for Glacier storage is about one quarter of that for S3, which is comparable with UCLA and UCSD pricing.  Of course, the additional dispersed replication of Merritt-managed data in AWS will also increase overall reliability and long-term preservation assurance.

The integration of S3 and Glacier will supplement Merritt’s existing use of UC storage.  Merritt’s storage function acts as a broker that automatically routes submitted content to the appropriate storage location based on its curatorially-defined access characteristics.  Once Amazon storage has been added to Merritt, content tagged for public access will be routed to S3 for primary storage, from which it will be automatically replicated to a UC cloud.  Retrieval requests for this content will be served from the S3 copy; should these requests fail (for example, if S3 is temporarily non-responsive), Merritt automatically retries from its secondary copy.

The path for content tagged for private access is somewhat different.  It is initially routed to S3 for temporary storage until the replication to a UC cloud completes.  The content is then moved into Glacier for permanent low-cost primary storage.  Retrieval requests will be served from the UC cloud.  In the unlikely event that this retrieval doesn’t success, there is no automatic retry from Glacier, since Glacier, while inexpensive for static storage, is costly for systematic retrieval.  UC3 staff can, however, intervene manually to retrieve from Glacier if it becomes necessary.  In the case of both public and private access, the digital content will continue to be managed with at least five copies spread across independent storage infrastructures and data centers.

The integration of Amazon S3 and Glacier into Merritt’s storage architecture will increase overall reliability and performance, while possibly leading to future reduction in costs.  Once the integration is complete, UC3 will monitor AWS storage usage and associated costs through the end of the current Merritt service year in June 30, 2017, to determine the impact on Merritt pricing.

We’re hiring a new Product Manager!

CDL is recruiting for a new Product Manager.  This position will oversee the product management and outreach activities for the Dash project and service, as well as offer research data management and digital preservation consulting for the UC community.

We are looking for an experienced professional with a full understanding of product/service development and production practices.  This position (officially titled “UC3 Service Manager, Dash”) will focus on the successful development, outreach, and adoption of the Dash service.  A complete revamp of the UI and technical architecture of Dash is nearing completion.  More detail about Dash is available here. A recent presentation on the project is also available here. Because this position will focus on continuous development of Dash, it requires an enthusiastic advocate for research data management best practices, open source community building, and digital curation skills development.

A successful candidate will advocate for the needs of our constituents and translate those needs into detailed enhancements of diverse scope, size, impact, and budget  This Dash Product Manager will have a large support network: the UC3 Director, other UC3 product managers, UC3 development team, other California Digital Library departments, plus the library/IT teams across the 10 UC campuses.  

Learn more and apply here.

What is Dash?

Dash is an open source, online data publication service that makes research data sharing easy.  While Dash gives the appearance of being a full-fledged data repository, it is actually a lightweight overlay layer that sits on top of, and freely interoperates with, standards-compliant repositories supporting common protocols for submission and harvesting.  UC3 has integrated Dash with its Merritt curation repository. The Dash system provides intuitive, easy-to-use interfaces for dataset submission, description, publication, and discovery.  Dash imposes minimal prescriptive eligibility and submission requirements, and automates and hides the mechanical details of DOI assignment, data packaging, and repository deposit from the user.  It features a streamlined, self-service user experience that can be integrated easily and unobtrusively into multifarious scholarly workflows.  

What is UC3?

This position is within the University of California Curation Center (UC3) at the California Digital Library (CDL), an administrative unit of the University of California Office of the President (UCOP).  UC3 works within CDL and across the 10 UC campuses to deliver leading-edge digital curation services.  We plan, create, maintain, enhance, and operate robust services responsive to the evolving needs of UC stakeholders.  UC3’s current initiatives include digital preservation, research data management, data publication, alternative metrics for usage and impact, and web archiving. Reporting to the UC3 Director, this position is responsible for managing the development and maintenance of the Dash service, including playing a key role in promoting  and setting the strategic direction for Dash. As a member of this dynamic team, a successful candidate will be asked to contribute to furthering our work advancing digital curation concepts across the UC community.  More information about UC3 can be found at  

More information about this position can be found here.

Data Science meets Academia

(guest post by Johannes Otterbach)

First Big Data and Data Science, then Data Driven and Data Informed. Even before I changed job titles—from Physicist to Data Scientist—I spent a good bit of time pondering what makes everyone so excited about these things, and whether they have a place in the academy.

Data Science is an incredibly young and flaming hot field (searching for ‘Data Science’ on Google Search yields about 283,000,000 results in 0.48 seconds [!] and the count is rising). The promises—and accordingly the stakes—of Data Science are high, and seem to follow a classic Hype Cycle. Nevertheless, Data Science is already having major impacts on all aspects of life, with personalized advertisement and self-quantification leading the charge. But is there a place for Data Science in Academia? To try and answer this question, first we have to understand more about Data Science itself, from lofty promises to practical workflows, and later I’ll offer some potential (big-picture) academic applications.

Yet another attempt at defining Data Science

There are gazillions of blogs, articles, diagrams, and other information channels that aim to define this new and still-fuzzy term ‘Data Science,’ and it will still be some years before we achieve consensus. At least for now there is some agreement surrounding the main ingredients; Drew Conway summarizes them nicely in his Venn diagram:


In this popular tweet, Josh Wills defines a Data Scientist as an individual ‘who is better at statistics than any software engineer and better at software engineering than any statistician.’  This definition just barely captures some of the basics. Referring back to the Venn diagram, a Data Scientist finds her/himself at the intersection of Statistics, Machine Learning, and a particular business need (in academic parlance, a research question).

  • Statistics is perhaps the most obvious component, as Data Science is partially about analyzing data using summary statistics (e.g., averages, standard deviations, correlations, etc.) and more complex mathematical tools. This is supplemented by
  • Machine Learning, which subsumes the programming and data munging aspects of a Data Scientist’s toolkit. Machine Learning is used to automatically sift through data that are too unwieldy for humans to analyze. (This is sometimes an aspect of defining Big Data). As an example, just try to imagine how many dimensions you could define to monitor student performance: past and current grades, participation, education history, family and social circles, physical and mental health, just to name a few categories that you could explode into several subcategories. Typically the output of Machine Learning is a certain number of features that are important within a given business problem and that can provide insight when evaluated in the context of
  • the Domain Knowledge. Domain Knowledge is essential in order to identify and explore the questions that will drive business actions. It is the one ingredient that’s not generalizable across different segments of industry (disciplines or domains) and as such a Data Scientist must acquire new Domain Knowledge for each new problem that she/he encounters.

The most formalized definition I’ve come across is from NIST’s Big Data Framework:

Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.

I won’t elaborate on these terms here, but I do want to draw your attention to the modest word actionable. This is the key component of Data Science that distinguishes it from mere data analysis, and the implementation of which gives rise to the dichotomy of Data Driven vs. Data Informed.

Promises and shortcomings of Data Science: The Hype Cycle

The Gartner Hype Cycle report (2014) on emerging technologies places Data Science just past the threshold of inflated expectations.


This hype inflation contributes to unreasonable expectations about the problem-solving power of Data Science. All the way back in 2008, one of the early proponents of Big Data and Data Science, the Editor-in-Chief of Wired, Chris Anderson, blogged that the new data age would bring The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. He claimed that by using sufficiently advanced Machine Learning algorithms, gaining insight into a problem would become trivial. This ignores the element of Domain Knowledge to understand and pose the right questions and by now it’s not hard to see that his projection was off. If we consider highly complex processes where sufficient data are not and might never be available, we can only make advances by means of educated guesses and building appropriate models and hypotheses. This requires a substantial amount of Domain Knowledge. Nick Barrowman formulated a detailed argument (that goes beyond just a response to Anderson’s opinion) in his article on Correlation, Causation and Confusion.

Data Science, and in particular Applied Machine Learning, is not completely agnostic of the problem space in which it’s applied; this has serious implications for the analyst’s approach to unknown data. Most importantly, the Domain Knowledge is indispensable for correctly evaluating the predictions of the algorithms and making smart decisions rather than placing blind faith in the computational output. As Yoshua Bengio frames it in his book, Deep Learning [Ch. 5.3.1, p.110]:

The most sophisticated [Machine Learning] algorithm we can conceive of has the same average performance (over all possible tasks) as merely predicting that every point belongs to the same class. Fortunately, these results hold only when we average over all possible data generating distributions. If we make assumptions about the kinds of probability distributions we encounter in real-world applications, then we can design learning algorithms that perform well on these distributions.

Actionable business insights: Data Driven vs. Data Informed

The oft-quoted expression ‘Be data informed, not data driven’ seems to originate with Adam Mosseri’s (from Facebook) 2010 talk. He coined these terms to distinguish two different approaches to a data problem.

  • The Data-Driven approach involves analyzing the data and then adjusting the system to optimize a certain metric. Ad placement on a website provides a simple example. We move the ad slightly until we maximize the number of clicks on the ad. The problem with this approach is that we can get trapped in locally optimal points, i.e., points where any deviation leads to a decreasing click rate, however, we can’t be sure that there’s not an even better way of displaying the ad. Joshua Porter summarizes the pitfalls of a Data-Driven approach in the context of UX design. To find the absolute best solution, a tremendous amount of data and time are necessary (technically, an infinite amount of both).

Another shortcoming of the Data-Driven approach is that not everything can be formulated as an optimization problem, the fundamental mathematical formulation of Machine Learning. As a result, we can’t always guarantee that proper data have been collected, particularly in cases where we don’t have a good idea of a what a satisfying answer would look like. To circumvent these problems we can apply

  • The Data-Informed way of viewing a problem, which avoids micro-optimization as mentioned above. Furthermore it allows us to include decision-making inputs that cannot be cast into a ‘standard Machine Learning form,’ such as:
    – Qualitative data
    – Strategic interests
    – Regulatory bodies
    – Business interests
    – Competition
    – Market

Data-Informed decisions leverage the best of two worlds: the analysis of data given a hypothesis, followed by a well-rounded decision, that again leads to the collection of new data to improve business. Joe Blitzstein’s visualization summarizes the Data Science Process, and there’s even an industry standard know as CRISP-DM:


What about Data Science in Academia?

There have long been calls to Academia to better prepare students (especially Ph.D. graduates) for the job market. The explosion of Data Science as the sexiest job of the 21st century is fueling the creation of an increasing number of Data Science Masters programs. The value of these programs remains to be tested, as few graduates have hit the market, but the trend reveals that Academia is at least trying to respond to calls for reform.

Apart from preparing students for careers outside the academy, is there space for applying Data Science to traditional academic fields, and maybe establishing it as a field unto itself? Data Science involves much more than statistical data analysis, encompassing aspects of data management, data warehousing, reproducibility, and data best practices. To advance science as a whole, it will be necessary for researchers and staff to develop a pi-shaped skills profile (as coined by Alex Szalay):


The first leg, a.k.a. the domain specialty or Domain Knowledge, is already established after years of efforts to advance a field. However, this hints at a fundamental problem for Data Science as a domain-agnostic, standalone field. Data Science as a Service (DSaaS) is likely to fail. Instead, Data Scientists should be embedded in a field and possess domain expertise, in addition to the cross-disciplinary techniques required to tackle the data challenges at hand.

This feeds into the second, to-be-developed, leg, which represents advanced computational literacy. As more and more researchers leave the academy it’s obvious that the current system disincentivizes this development. However, it also reveals some low-hanging fruits. An easy win would be adopting simple best practices to improve how scientific data are handled and encouraging students to develop solid data skills. Another win would be to reward researchers for their efforts to make studies transparent and reproducible. Without such cultural changes, Academia will fail to advance ever-more-diversified scientific fields into the next century. Perpetuating current practices will only undermine scientific research and make it increasingly undiscoverable. As Denis Diderot put it in his 1755 Encyclopedie:

As long as the centuries continue to unfold, the number of books will grow continually, and one can predict that a time will come when it will be almost as difficult to learn anything from books as from the direct study of the whole universe. It will be almost as convenient to search for some bit of truth concealed in nature as it will be to find it hidden away in an immense multitude of bound volumes.

Next steps

It’s clear that Data Science will have major impacts on our digital and non-digital lives. The Internet of Things already transcends our individual internet presence by connecting everyday devices—such as thermostats, fridges, cars, etc.—to the internet, and thus makes them available to optimizations using Data Science. The extent of these impacts, though, will depend on our ability to make sense of the data and develop tools and intuitions to check computerized predictions against reality. Moreover, we require a better understanding of the limitations of Data Science as well as its mathematical-statistical foundations. Without thorough basic knowledge, Data Science and Machine Learning will be seen as belonging to the Dark Arts and raise skepticism. This is true for data of all sizes and depends strongly on whether we succeed in making data discoverable and processable. Data Science has a role to play in this (both in industry as well as the academy). To succeed we first need to rethink the way scientific information is produced, stored, and prepared for further investigations. And this goal hinges on overdue changes of incentives within the academy.

About the author

Johannes Otterbach is a Data Scientist at LendUp with a passion for big data technologies and applications to real world problems. He earned his Ph.D. in Physics in topics related to Quantum Computing.

Data (Curation) Viz.

Data management and data curation are related concepts, but they do not refer to precisely the same things. I use these terms so often now that sometimes the distinctions, fuzzy as they are, become indistinguishable. When this happens I return to visual abstractions to clarify —in my own mind—what I mean by one vs. the other. Data management is more straightforward and almost always comes in the guise of something like this:

The obligatory research data management life cycle slide. Everyone uses it, myself included, in just about every presentation I give these days. This simple (arguably oversimplified) but useful model defines more-or-less discrete data activities that correspond with different phases of the research process. It conveys what it needs to convey; namely, that data management is a dynamic cycle of activities that constantly influence one another. Essentially, we can envision a feedback loop.

Data curation, on the other hand, is a complex beastie. Standard definitions cluster around something like this one from the Digital Curation Centre in the UK:

Data curation involves maintaining, preserving, and adding value to digital research data throughout its lifecycle.

When pressed for a definition, this is certainly an elegant response. But, personally, I don’t find it to be helpful at all when I try to wrap my head around the myriad activities that go into curating anything, much less distinguishing management activities from curation activities. Moreover, I’m talking about all kinds of activities in the context of “data,” a squishy concept in and of itself. (We’ll go with the NSF’s definition: the recorded factual material commonly accepted in the scientific community as necessary to validate research findings.)

I suppose I should mention sooner or later that the point of defining “data” and all these terms appended to it is the following. There’s a lot of it [data] and we need to figure out what on earth to do with it, ergo the proliferation of new positions with these things “data management” and “data curation” in their titles. It’s important to make sure we’re speaking the same language.

There are other, more expansive approaches to defining data curation and a related post on this very blog, but to really grasp what I’m talking about when I’m saying the words “data curation,” I invariably come back to this visualization created by Tim Norris. Tim is a geographer turned CLIR Postdoctoral Fellow in Data Curation at the University of Miami. Upon assuming a new post with an unfamiliar title, he decided to draw a map of his job to explain (to himself and to others) what he means by data curation. Many thanks to Tim for sharing this exercise with the rest of our CLIR cohort and now with the blogo-world-at-large.

Below is an abbreviated caption, in Tim’s own words, as well as short- (3 min) and long-format (9 min) tours of the map narrated by Tim. And here is a handy PNG file for those occasions when the looping life cycle visualizations just won’t do.

This map of data curation has two visual metaphors. The first is that of a stylized mandala: a drawing that implies both inwards and outwards motion that is in balance. And the second is that of a Zen Koen: first there is a mountain, then there’s none, and then there is. We start with visual complexity—the mountain. To build the data curation mountain we start with a definition of the word “curation” as a five step process that moves inwards. The final purpose of this curation is to move what is being curated back into the world for re-use, publication and dissemination. This can be understood as stewardship. Next we think about the sources of data in the outside world. These sources have been abstracted into three data spaces: library digital collections, external data sources, and research data products. As this data moves “inwards” we can think of verbs that describe the ingestion processes. Metadata creation, or describing the data, is a key that enables later data linkages to be identified with the final goal of making data interoperable. Once the data is “inside” the curation space it passes through a standard process that begins with storage and ends with discovery. Specific to data in this process are the formats in which the data is stored and the difference between preservation and conservation for data. To enable this work we need hardware, software, and human interfaces to the curated data. Finally, as the data moves back out into the world, we must pay attention to institutions of property rights and access. If we get this all right we will have a system that is sustainable, secure, and increases the value of our research data collections. Once again we have a mountain.

Science Boot Camp West

Last week Stanford Libraries hosted the third annual Science Boot Camp West (SBCW 2015),

“… building on the great Science Boot Camp events held at the University of Colorado, Boulder in 2013 and at the University of Washington, Seattle in 2014. Started in Massachusetts and spreading throughout the USA, science boot camps for librarians are 2.5 day events featuring workshops and educational presentations delivered by scientists with time for discussion and information sharing among all the participants. Most of the attendees are librarians involved in supporting research in the sciences, engineering, medicine or technology although anybody with an interest in science research is welcome.”

As a former researcher and newcomer to the library and research data management (RDM) scenes, I was already familiar with many of the considerable challenges on both sides of the equation (Jake Carlson recently summarized the plight of data librarians). What made SBCW 2015 such an excellent event is that it brought researchers and librarians together to identify immediate opportunities for collaboration. It also showcased examples of Stanford libraries and librarians directly facilitating the research process, from the full-service Stanford Geospatial Center to organizing Software and Data Carpentry workshops (more on this below, and from an earlier post).

Collaboration: Not just a fancy buzzword

The mostly Stanford-based researchers were generous with their time, introducing us to high-level concerns (e.g., why electrons do what they do in condensed matter) as well as more practical matters (e.g., shopping for alternatives to Evernote—yikes—for electronic lab notebooks [ELNs]). They revealed the intimate details of their workflows and data practices (Dr. Audrey Ellerbee admitted that it felt like letting guests into her home to find dirty laundry strewn everywhere, a common anxiety among researchers that in her case was unwarranted), flagged the roadblocks, and presented a constant stream of ideas for building relationships across disciplines and between librarians and researchers.

From the myriad opportunities for library involvement, here are some of the highlights:

  • Facilitate community discussions of best practices, especially for RDM issues such as programming, digital archiving, and data sharing
  • Consult with researchers about available software solutions (e.g., ELNs such as Labguru and LabArchives; note: representatives from both of these companies gave presentations and demonstrations at SBCW 2015), connect them with other users on campus, and provide help with licensing
  • Provide local/basic IT support for students and researchers using commercial products such as ELNs (e.g., maintain FAQ lists to field common questions)
  • Leverage experience with searching databases to improve delivery of informatics content to researchers (e.g., chemical safety data)
  • Provide training in and access to GIS and other data visualization tools

A winning model

The final half-day was dedicated to computer science-y issues. Following a trio of presentations involving computational workflows and accompanying challenges (the most common: members of the same research group writing the same pieces of code over and over with scant documentation and zero version control), Tracy Teal (Executive Director of Data Carpentry) and Amy Hodge (Science Data Librarian at Stanford) introduced a winning model for improving everyone’s research lives.

Software Carpentry and Data Carpentry are extremely affordable 2-day workshops that present basic concepts and tools for more effective programming and data handling, respectively. Training materials are openly licensed (CC-BY) and workshops are led by practitioners for practitioners allowing them to be tailored to specific domains (genomics, geosciences, etc.). At present the demand for these (international) workshops exceeds the capacity to meet it … except at Stanford. With local, library-based coordination, Amy has brokered (and in some cases taught) five workshops for individual departments or research groups (who covered the costs themselves). This is the very thing I wished for as a graduate student—muddling through databases and programming in R on my own—and I think it should be replicated at every research institution. Better yet, workshops aren’t restricted to the sciences; Data Carpentry is developing training materials for techniques used in the digital humanities such as text mining.

Learning to live outside of the academic bubble

Another, subtler theme that ran throughout the program was the need/desire to strengthen connections between the academy and industry. Efforts along these lines stand to improve the science underlying matters of public policy (e.g., water management in California) and public health (e.g., new drug development). They also address the mounting pressure placed on researchers to turn knowledge into products. Mark Smith addressed this topic directly during his presentation on ChEM-H: a new Stanford initiative for supporting research across Chemistry, Engineering, and Medicine to understand and advance Human Health. I appreciated that Mark—a medicinal chemist with extensive experience in both sectors—and others emphasized the responsibility to prepare students for jobs in a rapidly shifting landscape with increasing demand for technical skills.

Over the course of SBCW 2015 I met engaged librarians, data managers, researchers, and product managers, including some repeat attendees who raved about the previous two SBCW events; the consensus seemed to be that the third was another smashing success. Helen Josephine (Head of the Engineering Library at Stanford who chaired the organizing committee) is already busy gathering feedback for next year.

SBCW 2015 at Stanford included researchers from:

Gladstone Institutes in San Francisco

ChEM-H Stanford’s lab for Chemistry, Engineering & Medicine for Human Health

Water in the West Institute at Stanford

NSF Engineering Research Center for Re-inventing the Nation’s Urban Water Infrastructure (ReNUWIt)


Special project topics on Software and Data Carpentry with Physics and BioPhysics faculty and Tracy Teal from Software Carpentry.

Many thanks to:

Helen Josephine, Suzanne Rose Bennett, and the rest of the Local Organizing Committee at Stanford. Sponsored by the National Network of Libraries of Medicine – Pacific Southwest Region, Greater Western Library Alliance, Stanford University Libraries, SPIE, IEEE, Springer Science+Business Media, Annual Reviews, Elsevier.

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)

From Flickr by Paula Fisher (It was just like this, but indoors, with coffee, and powerpoints.)