A few months back I received an invite to visit the University of Florida in sunny Gainesville. The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program. Phew! That was a lot of typing for the first two acronyms in my blog post’s title. The third acronym (OA) stands for Open Access, and the fourth acronym should be familiar.
I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA. There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion. In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing. The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”. Some of the themes that emerged from our discussion warrant a blog post.
First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey. This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form. Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft. In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.
The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?” This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX. You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication. Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.
Let’s pretend that we have overhauled both scientists and the publishing system as it is. In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents. They document their workflows and create python scripts that run in the command line for reproducible results. Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.
One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality. In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.
One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.
My answer was a resounding YES! Enter the DCXL project. We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing. One of the main goals of the DCXL project is to help “the little guy”. These are often scientists working on relatively small datasets that can be contained in Excel files.
In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while. Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.