In the most basic terms- Data Publishing is the process of making research data publicly available for re-use. But even in this simple statement there are many misconceptions about what Data Publications are and why they are necessary for the future of scholarly communications.
Let’s break down a commonly accepted definition of “research data publishing”. A Data Publication has three core features: 1 – data that are publicly accessible and are preserved for an indefinite amount of time, 2 – descriptive information about the data (metadata), and 3 – a citation for the data (giving credit to the data). Why are these elements essential? These three features make research data reusable and reproducible- the goal of a Data Publication.
Data are publicly accessible and preserved indefinitely
There are many ways for researchers to make their data publicly available, be it within Supporting Information files of a journal article or within an institutional, field specific, or general repository. For a true Data Publication, data should be submitted to a stable repository that can ensure data will be available and stored for an indefinite amount of time. There are over a thousand repositories registered with re3data and many publishers have repository guides to help with field specific guidance. When data are not suitable for public deposition, i.e. when data contain sensitive information, data should still be stored in a preserved and compliant space. While this restriction is a more difficult hurdle to jump over in advocating for data publishing and data preservation, it is important to ensure these data are not violating ethical requirements, nor are they locked up in a filing cabinet and eventually thrown out. Preservation of data is a necessity for the future.
Data are described (data have metadata)
Data without proper documentation or descriptive metadata are about as useful as research without data. If a Data Publication is a citable piece of scholarly work, it should contain information that it allow it to be a useful and valued piece of scholarly work. Documentation and metadata range from information regarding software used for analysis to who funded the work. While these examples serve separate purposes (one for re-use and the other for credit), it is important that all information about the creation of the dataset (who, where, how, related publications) are available.
Data are citable and credible
We’ve established that datasets are essential to research output and are an important piece of scholarly work- and they should receive the same benefits. Data need to have a persistent identifier (a stable link) that can be referenced. While many repositories use a DataCite DOI to fulfill this, some field-specific repositories use accession numbers (i.e. NCBI repositories) that can be referenced within a URL. This is one of the reasons data need to be available in a stable repository. It’s a bit difficult to reference and credit data that are on your hard drive!
If it’s so clear- why are there barriers?
Data publishing has become more widely accepted in the last ten years, with new standards from funders and publishers and a growth in stable repositories. However, there’s still work to be done and more questions to be answered before we reach mass adoption. Let’s start that conversation (you can be the questioner and I’ll be the advocate):
Organizing and submitting data are time intensive and in turn, costly
Trying to replicate a data set from scratch takes much more time (and money) than publishing your data (see robotics example here). Taking the time to search your old computer files or get in touch with your last institution to get your data is more complicated than publishing your data. Having your paper retracted because your data are called into question and you can’t share your data or don’t have it would take more time, money, and hit to your reputation than proactively publishing your datasets.
As an important side note: Data Publications do not need to be linked to a journal publication. While it may take extra time to submit a Data Publication in proper form, if used as an intermediate step in the research process you can reduce time later, get credit, and benefit the research community in the meantime.
What’s the incentive?
Credit. Next question?
But beyond credit for a citable piece of work, publishing data as a common practice will shift focus from publications being an end point in the research cycle to a starting point and this shift is crucial for transparency and reproducibility in published works. Incentives will become clear once Data Citations become common practice within the publisher and research community, and resources are available for researchers to know how (and have the time/funds) to submit Data Publications.
Too few resources for understanding Data Publishing
Many great papers have been posted and published in the last ten years about what a Data Publication is; however, less resources have been made available to the research community on how to integrate Data Publishing into the research life cycle and how to organize data to even be suitable for a Data Publication. Data Management Plans, courses on research data management, and pressure from various funder and publisher policies will help, but there’s a serious need for education on data planning/organization (including metadata and format requirements) as well as awareness of data publishing platforms and their benefits. This is a call to the community to release these materials and engage in the Research Data Management (RDM) community to get as many of these conversations going. The more resources, answers, and guidance that institutions can provide to researchers, the less the “it takes too much time and money” argument will arise, the easier it will be to achieve the incentive, and the further we will push the boundaries of transparency in scholarly communications.
There’s no better time than now to re-evaluate what resources are available for research output. If we strive for re-use and reproducibility of research data within the community, then now is the time to increase awareness and adoption of Data Publication.