(index page)

PIDapalooza is back!

PIDapalooza is back, by popular demand! We’re building on the the best of the inaugural PIDapalooza and organizing two days packed with discussions, demos, informal and interactive sessions, updates, talks by leading PID innovators, and more. There will be lots of opportunities to network – and to learn from and engage with PID enthusiasts from around the world.All in a fun, relaxed, and welcoming atmosphere!

We’re looking for your PIDeas! Want to update the community on your current PID projects? Brainstorm new ones? Bring together experts with different perspectives on PID-related topics? Find out what’s new in PID-land? Share your experiences of creating, innovating, or communicating about PIDs? We welcome your proposals for energetic,exciting, and thoughtful rapid-fire sessions related to our eight festival themes :

PID myths. Are PIDs a dream or reality? PID stands for Persistent IDentifier, but what does that mean and does such a thing exist?
Achieving persistence. So many factors affect persistence: resolvability, mission, oversight, funding, succession, redundancy, governance. Is open infrastructure for scholarly communication the key to achieving persistence?
PIDs for emerging uses. Long-term identifiers are no longer just for digital objects. PIDs are used for people, organizations, resources, vocabulary terms, and more. What are you identifying?
Legacy PIDs. There are of thousands of venerable identifier systems that people want to bring into the modern research information ecosystem. How can we manage this effectively?
Bridging worlds. What would optimize the interoperation of PID systemsy? Would standardized metadata and APIs across PID types solve many of the problems, and if so, how would that be achieved? What about standardized link/relation types?
PIDagogy. It’s a challenge for those who provide PID services and tools to engage the wider community. How do you teach, learn, persuade, discuss, and improve adoption? What’s it mean to build a pedagogy for PIDs?
PID stories. Which strategies work? Which strategies fail? Tell us your horror stories! Share your victories!
Kinds of persistence. What are the frontiers of ‘persistence’? We hear lots about rigor and reproducibility, but what about data papers promoting PIDs for long-term access to objects that change over time, like software or live data feeds?

Please use this short form to tell us about your proposed session. The program committee will review all suggestions received by and we’ll let you know whether you’ve been successful by the first week of October.

We’ll be posting more information about the festival lineup on the PIDapalooza website and on Twitter (@PIDapalooza) in the coming weeks. We hope to see you in January!

PIDapalooza – the details

Where: Auditori Palau de Congressos de Girona, Passeig de la Devesa, 35, Girona, Catalonia, Spain
When: 23rd and 24th January 2018
Deadline for proposals: September 18 – please use this short form to submit session(s)

Dash Enables ORCiD Login

The Dash team has now added a second way to login and submit. In addition to using Single Sign-On, users now have the ability to login with ORCiD. This means that not only can you authenticate with ORCiD, but once you have logged in this way, your ORCiD ID will connect to your Dash account. The next times that you submit to Dash, your ORCiD ID will auto populate in your submission form.

To back-up a little: ORCiD is a persistent identifier used to distinguish researchers from one another, and connect researchers with their research. If you are a researcher and do not currently have an ORCiD, sign up!

To connect your ORCiD:

Login using the button on the far right of the Dash homepage
Here you will see two options. If you click on the top ORCiD button will send you out to the ORCiD authentication page, and after correctly entering your ORCiD info, send you back to Dash.
Although you have now successfully authenticated with ORCiD, to ensure you are connected to your correct submitting instance (a campus, a department, DataONE, etc…) you will be asked to choose your Single Sign-On. This is the only time you will be asked to login twice.
After successfully logging in with Single Sign-On you will have your account connected to your ORCiD. In the future, you will not need to repeat this process and instead you will either be able to save your login to your browser or choose one of the two options for logging in.If you have already submitted to Dash before, you may logout, and go through the same steps above. This process will tie your ORCiD to your existing account and allow for either ORCiD or Single Sign-On in the future.

What We Talk About When We Talk About Reproducibility

At the very beginning of my career in research I conducted a study which involved asking college students to smile, frown, and then answer a series of questions about their emotional experience. This procedure was based on several classic studies which posited that, while feeling happy and sad makes people smile and frown, smiling and frowning also makes people feel happy and sad. After several frustrating months of trying and failing to get this to work, I ended my experiment with no significant results. At the time, I chalked up my lack of success to inexperience. But then, almost a decade later, a registered replication report of the original work also showed a lack of significant results and I was left to wonder if I had also been caught up in what’s come to be known as psychology’s reproducibility crisis.

While I’ve since left the lab for the library, my work still often intersects with reproducibility. Earlier this year I attended a Research Transparency and Reproducibility Training session offered by the Berkeley Institute for Transparency in the Social Sciences (BITSS) and my projects involving brain imaging data, software, and research data management all invoke the term in some way. Unfortunately, though it has always has been an important part of my professional activities, it isn’t always clear to me what we’re actually talking about when we talk about reproducibility.

The term “reproducibility” has been applied to efforts to enhance or ensure the research process for at at least 25 years. However, related conversations about how research is conducted, published, and interpreted have been ongoing for more than half a century. Ronald Fisher, who popularized the p-value that lies so central to many modern reproducibility efforts, summed up the situation in 1935.

“We may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us statistically significant results.”

Putting this seemingly simple statement into action has proven to be quite complex. Some reproducibility-related efforts are aimed at how researchers share their results, others are aimed at how they define statistical significance. There is now a burgeoning body of scholarship devoted to the topic. Even putting aside terms like HARKing, QRPs, and p-hacking, seemingly mundane objects like file drawers are imbued with particular meaning in the language of reproducibility.

So what actually is reproducibility?

Well… it’s complicated.

The best place to start might be the National Science Foundation, which defines reproducibility as “The ability of a researcher to duplicate the results of a prior study using the same materials and procedures used by the original investigator.”. According the NSF, reproducibility is one of three qualities that ensure research is robust. The other two, replicability and generalizability, are defined as “The ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.” and “Whether the results of a study apply in other contexts or populations that differ from the original one.” respectively. The difference between these terms is in the degree of separation from the original research, but all three converge on the quality of research. Good research is reproducible, replicable, and generalizable and , at least in the context of the NSF, a researcher invested in ensuring the reproducibility of their work would deposit their research materials and data in a manner and location where they could be accessed and used by others.

Unfortunately, defining reproducibility isn’t always so simple. For example, according to the NSF’s terminology, the various iterations of the Reproducibility Project are actually replicability projects (muddying the waters further, the Reproducibility Project: Psychology was preceded by the Many Labs Replication Project). However, the complexity of defining reproducibility is perhaps best illustrated by comparing the NSF definition to that of the National Institutes of Health.

Like the NSF, NIH invokes reproducibility in the context of addressing the quality of research. However, unlike the NSF, the NIH does not provide an explicit definition of the term. Instead NIH grant applicants are asked to address rigor and reproducibility across four areas of focus: scientific premise, scientific rigor (design), biological variables, and authentication. Unlike the definition supplied by the NSF, NIH’s conception of reproducibility appears to apply to an extremely broad set of circumstances and encompasses both replicability and generalizability. In the context of the NIH, a researcher invested in reproducibility must critically evaluate every aspect of their research program to ensure that any conclusions drawn from it are well supported.

Beyond the NSF and NIH, there have been numerous attempts to clarify what reproducibility actually means. For example, a paper out of the Meta-Research Innovation Center at Stanford (METRICS) distinguishes between “methods reproducibility”, “results reproducibility”, and “inferential reproducibility”. Methods and results reproducibility map onto the NSF definitions of reproducibility and replicability, while inferential reproducibility includes the NSF definition of generalizability and also the notion of different researchers reaching the same conclusion following reanalysis of the original study materials. Other approaches focus on methods by distinguishing between empirical, statistical, and computational reproducibility or specifying that replications can be direct or conceptual.

No really, what actually is reproducibility?

It’s everything.

The deeper we dive into defining “reproducibility”, the muddier the waters become. In some contexts, the term refers to very specific practices related to authenticating the results of a single experiment. In other contexts, it describes a range of interrelated issues related to how research is conducted, published, and interpreted. For this reason, I’ve started to move away from explicitly invoking the term when I talk to researchers. Instead, I’ve tried to frame my various research and outreach projects in terms of how they relate to fostering good research practice.

To me, “reproducibility” is about problems. Some of these problems are technical or methodological and will evolve with the development of new techniques and methods. Some of these problems are more systemic and necessitate taking a critical look at how research is disseminated, evaluated, and incentivized. But fostering good research practice is central to addressing all of these problems.

Especially in my current role, I am not particularly well equipped to speak to if a researcher should define statistical significance as p < 0.05, p < 0.005, or K > 3. What I am equipped to do is to help a researcher manage their research materials so they can be used, shared, and evaluated over time. It’s not that I think the term is not useful, but the problems conjured by reproducibility are so complex and context dependent that I’d rather just talk about solutions.

Resources for understanding reproducibility and improving research practice

Goodman A., Pepe A, Blocker A. W., Borgman C. L., Cranmer K., et al. (2014) Ten simple rules for the care and feeding of scientific data. PLOS Computational Biology 10(4): e1003542.

Ioannidis J. P. A. (2005) Why most published research findings are false. PLOS Medicine 2(8): e124.

Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2017). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.

Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., du Sert, N. P., et al. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, 0021.

Wilson Gl, Bryan J., Cranston K., Kitzes J., Nederbragt L., et al. (2017) Good enough practices in scientific computing. PLOS Computational Biology 13(6): e1005510.

Dash: The Data Publication Tool for Researchers

This post has been crossposted on Medium

We all know that research data should be archived and shared. That’s why Dash was created, a Data Publishing platform free to UC researchers. Dash complies with journal and funder requirements, follows best practices, and is easy to use. In addition, new features are continuously being developed to better integrate with your research workflow.

Why is Dash the best solution for UC researchers:

Data are archived indefinitely. You can use Dash to ensure all of your research data will be available even after you get a new computer or switch institutions. Beyond that, your data will have all the important associated documentation on the funding sources for the research, the research methods and equipment used, and readme files on how your data was processed so future researchers from your own lab or globally can utilize your work.
Data can be published at any time. While we do have features that assist with affiliated article publication like keeping your data private during the review process, Data Publications do not need to be associated with an article. Publish out your data at any point in time.
Data can be versioned. As you update and optimize protocols, or do further analysis on your data, you may update your data files or documentation. Your DOI will always resolve to a landing page listing all versions of the dataset.
Data can be uploaded to Dash directly from your computer or through a “manifest”. “Manifest” means you may enter up to 1000 URLs where your data are living on servers, box, dropbox, or google drive and the data will be transferred to Dash without waiting several hours or dealing with timeouts.
You can upload up to 100gb of data per submission.
Dash does not limit file type. So long as the data are within the size limits listed above, publications can be image data, tabular data, qualitative data, etc…
Related works can be linked. Code, articles, other datasets, and protocols can be linked to your data for a more comprehensive package of your research.
Data deposited to Dash receive a DOI. This means that not only can your data be located but you can cite your data as you would articles. The landing page for each dataset includes an author list for your citation as well, so each author who contributed to the data collection and analysis may receive credit for their work.
Data are assigned an open license. Data deposited are publicly available for re-use to anyone using a Creative Commons license. You put many hours and coffees into producing these data, public release will give your research a broader reach. A light reminder that your name are still associated with data and making your data public does not mean you are “giving away” your work.
Dash is a UC project. Dash can be customized per campus. Many campus libraries are subsidizing the cost of storage, and it is developed by University of California Curation Center (UC3) meaning this service is set-up to serve your needs.

We hear a lot about the cost of storage being an inhibitor. But, on many campuses, the storage costs associated with Dash are subsidized by academic libraries or departments. The cost of storage could also be written into grants (as funders do require data to be archived).

We are always looking for feedback on what features would be the most useful, so that we can make data publishing a part of your normal workflows. Get in touch with us or start using Dash to archive and share your data.

Dash: The Data Publication Tool for Researchers

Why is Dash the best solution for UC researchers:

Data are archived indefinitely. You can use Dash to ensure all of your research data will be available even after you get a new computer or switch institutions. Beyond that, your data will have all the important associated documentation on the funding sources for the research, the research methods and equipment used, and readme files on how your data was processed so future researchers from your own lab or globally can utilize your work.
Data can be published at any time. While we do have features that assist with affiliated article publication like keeping your data private during the review process, Data Publications do not need to be associated with an article. Publish out your data at any point in time.
Data can be versioned. As you update and optimize protocols, or do further analysis on your data, you may update your data files or documentation. Your DOI will always resolve to a landing page listing all versions of the dataset.
Data can be uploaded to Dash directly from your computer or through a “manifest”. “Manifest” means you may enter up to 1000 URLs where your data are living on servers, box, dropbox, or google drive and the data will be transferred to Dash without waiting several hours or dealing with timeouts.
You can upload up to 100gb of data per submission.
Dash does not limit file type. So long as the data are within the size limits listed above, publications can be image data, tabular data, qualitative data, etc…
Related works can be linked. Code, articles, other datasets, and protocols can be linked to your data for a more comprehensive package of your research.
Data deposited to Dash receive a DOI. This means that not only can your data be located but you can cite your data as you would articles. The landing page for each dataset includes an author list for your citation as well, so each author who contributed to the data collection and analysis may receive credit for their work.
Data are assigned an open license. Data deposited are publicly available for re-use to anyone using a Creative Commons license. You put many hours and coffees into producing these data, public release will give your research a broader reach. A light reminder that your name are still associated with data and making your data public does not mean you are “giving away” your work.
Dash is a UC project. Dash can be customized per campus. Many campus libraries are subsidizing the cost of storage, and it is developed by University of California Curation Center (UC3) meaning this service is set-up to serve your needs.

From Brain Blobs to Research Data Management

If you spend some time browsing the science section of a publication like the New York Times you’ll likely run across an image that looks something like the one below: A cross section of a brain covered in colored blobs. These images are often used to visualize the results of studies using a technique called functional magnetic resonance imaging (fMRI), a non-invasive method for measuring brain activity (or, more accurately, a correlate of brain activity) over time. Researchers who use fMRI are often interested in measuring the activity associated with a particular mental process or clinical condition.

Because of the size and complexity of the datasets involved, research data management (RDM) is incredibly important in fMRI research. In addition to the brain images, a typical fMRI study involves the collection of questionnaire data, behavioral measures, and sensitive medical information. Analyzing all this data often requires the development of custom code or scripts. This analysis is also iterative and cumulative, meaning that a researcher’s decisions at each step along the way can have significant effects on both the subsequent steps and what is ultimately reported in a presentation, poster, or journal article. Those blobby brain images may look cool, but they aren’t particularly useful in the absence of information about the underlying data and analyses.

In terms of both the financial investment and researcher hours involved, fMRI research is quite expensive. Throughout fMRI’s relatively short history, data sharing has been proposed multiple times times as a method for maximizing the value of individual datasets and for overcoming the field’s ongoing methodological issues. Unfortunately, a very practical issue has hampered efforts to foster the open sharing of fMRI data- researchers have historically organized, documented, and saved their data (and code) in very different ways.

What we are doing and why

Recently, following concerns about sub-optimal statistical practices and long-standing software errors, fMRI researchers have begun to cohere around a set of standards regarding how data should be collected, analyzed, and reported. From a research data management perspective, it’s also very exciting to see that there is also an emerging standard regarding how data should be organized and described. But, even with these emerging standards, our understanding of the data-related practices actually employed by fMRI in the lab and how those practices relate to data sharing and other open science-related activities remains mostly anecdotal.

To help fill this knowledge gap and hopefully advance some best practices related to data management and sharing, Dr. Ana Van Gulick and I are conducting a survey of fMRI researchers. Developed in consultation with members of the open and reproducible neuroscience communities, our survey asks researchers about their own data-related practices, how they view the field as a whole, their interactions with RDM service providers, and the degree to which they’ve embraced developments like registrations and pre-prints. Our hope is that our results will be useful for both the community of researchers who use fMRI but and for data service providers looking to engage with researchers on their own terms.

If you are a researcher who uses fMRI and would like to complete our survey, please follow this link. We estimate that the survey should take between 10 and 20 minutes.

If you are a data service provider and would like to chat with us about what we’re doing and why, please feel free to either leave a comment or contact me directly.

Building a Community: Three months of Library Carpentry.

Back in May, almost 30 librarians, researchers, and faculty members got together in Portland Oregon to learn how to teach lessons from Software, Data, and Library Carpentry. After spending two days learning the ins and outs of Carpentry pedagogy and live coding, we all returned to our home institutions, as part of the burgeoning Library Carpentry community.

Library Carpentry didn’t begin in Portland, of course. It began in 2014 when the community began developing a group of lessons at the British Library. Since then, dozens of Library Carpentry workshops have been held across four continents. But the Portland event, hosted by California Digital Library, was the first Library Carpentry-themed instructor training session. Attendees not only joined the Library Carpentry community, but took their first step in getting certified as Software and Data Carpentry instructors. If Library Carpentry was born in London, it went through a massive growth spurt in Portland.

Together, the carpentries are a global movement focused on teaching people computing skills like navigating the Unix Shell, doing version control with Git, and programming with Python. While Software and Data Carpentry are focused on researchers, Library Carpentry is by and for Librarians. Library Carpentry lessons include an introduction to data for librarians, Open Refine, and many more. Many attendees of the Portland instructor training contributed to these lessons during the Mozilla Global Sprint in June. After more than 850 Github events (pull requests, forks, issues, etc), Library Carpentry ended up as far and away the most active part of the global sprint. We even had a five month old get in on the act!

Since the instructor training and the subsequent sprint, a number of Portland attendees have completed their instructor certification. We are on track to have 10 certified instructors in the UC system alone. Congratulations, everyone!