Last week I spent three days in the desert, south of Albuquerque, at the NSF Large Facilities Workshop. What are these “large facilities”, you ask? I did too… this was a new world for me, but the workshop ended up being a great learning experience.
The NSF has a Large Facilities Office within the Office of Budget, Finance and Award Management, which supports “Major Research Equipment and Facilities Construction” (MREFC for short). Examples of these Large Facilities include NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). Needless to say, I spent half of the workshop googling acronyms.
I was there to talk about data management, which made me a bit of an anomaly. Other attendees administered managed, and worked at large facilities. In the course of my conversations with attendees, I was surprised to learn that these facilities aren’t too concerned with data sharing, and most of these administrator types implied that the data were owned by the researcher; it was therefore the researcher’s prerogative to share or not to share. From what I understand, the scenario is this: the NSF page huge piles of money to get these facilities up and running, with hardware, software, technicians, managers, and on and on. The researchers then write a grant to the NSF or the facilities themsleves to do work using these facilities. The researcher is then under no obligation to share the data with their colleagues. Does this seem fishy to anyone else?
I understand the point of view of the administrators that attended this conference: they have enough on their plate to worry about, without dealing with the miriad problems that accompany data management, archiving, sharing et cetera. These problems are only compounded by researchers’ general resistance to share. For example, an administrator told me that, upon completion of their study, one researcher had gone into their system and deleted all of the data related to their project to make sure no one else could get it. I nearly fell over from shock.
Whatever cultural hangups the researchers have, aren’t these big datasets, being collected by expensive equipment, among the most important to be shared? Observations of the sky at a single point and time are not reproducible. You only get one shot at collecting data on an earthquake or the current spread rate for a rift zone. Not sharing these datasets is tantamount to scientific malpractice.
One administrator respectfully disagreed with my charge that they should be doing more to promote data sharing. He said that their workflow for data processing was so complex and nuanced that no one could ever reproduce the dataset, and certainly no one could ever understand what exactly was done to obtain results. This marks the second time I nearly fell over during a conversation. If science isn’t reproducible because it’s too complex, you aren’t doing it right. Yes, I realize that exactly reproducing results is nearly impossible under the best of circumstances. But to not even try? With datasets this important? When all analyses are done via computers? It seems ludicrous.
So, after three days of dry skin and mexican food, my takeaway from the workshop was this: All large facilities sponsored by NSF need to have thorough, clear policies about data produced using their equipment. These policies should include provisions for sharing, access, use, and archiving. They will most certainly be met with skepticism and resistance, but in these tight fiscal times, data sharing is of utmost importance when equipment this expensive is being used.