At Ovation, data management for researchers in the life sciences is our passion, and breaking down barriers to communicating great science and facilitating greater scientific collaboration is our obsession. So we were so thrilled to see a recently published article by Amanda Whitmire (Harold A. Miller Library, Stanford) and Steven Van Tuyl (Oregon State University), entitled “Water, Water, Everywhere: Defining and Assessing Data Sharing in Academia.” The article looks into recent projects and subsequent publications that received funding from the National Science Foundation (NSF). The NSF sets requirements for sharing the data produced by funded research. Whitmire’s and Van Tuyl’s conclusion? “Sharing at both the project level and the journal article level was not carried out in the majority of cases, and when sharing was accomplished, the shared data were often of questionable usability due to access, documentation, and formatting issues.”
Even with a data management plan in place and thereby complying with the NSF grant, the majority of research projects (76%) and papers (53%) studied as a part of Whitmire and Van Tuyl’s analysis scored zero out of a possible eight on the scale developed by Whitmire and Van Tuyl. Their rubric, called the DATA score, awarded zero through two points in four categories: discoverability, accessibility, transparency, and actionability. While a couple of projects and papers did, in fact, get top marks, the overall picture painted by Whitmire and Van Tuyl suggests to us, as it did to them, that we have a long way to go if our aim is to maximize data sharing for all research, published or not.
Here are a few thoughts on what stuck out for us, in the interest of continuing and participating in a broader discussion about data sharing and management that Whitmire and Van Tuyl have started. I hope highlighting these points made by Whitmire and Van Tuyl help to identify and discuss the obstacles that researchers face. Through great design and close partnerships with scientific innovators, we hope to make some of these obstacles, like metadata compliance and data sharing requirements, easy and seamless.
Project data vs. paper data: dramatic differences in sharing
We refer to data produced for a project, but not used for a paper (and almost certainly not shared) as “dark data.” This dark data is an idle asset, abandoned on someone’s hard drive somewhere, that has potentially enormous value to the scientific community. As Whitmire and Van Tuyl point out, there’s a big gap between DATA scores awarded to data used in a paper (which was still less than stellar), and data that was simply produced as as part of the project. Nonetheless, it’s important to note that the NSF grant used by the researcher was for the production of data itself, not just for the paper. All of the data has potential value. Therefore, requirements by publishers to share the data used in a paper only get part of the way there. Our estimates indicate that the 70% of data produced in the U.S. with grant funding becomes “dark data”, which represents more than $24 billion dollars of research funding. This is a tremendous issue for not only grant givers, but research institutes and life science companies that invest millions, even billions annually, only to leave the majority of their product unused. We believe, along with many in the open science movement and legions of data librarians around the country, that a better state of affairs is possible. Unfortunately, current tools, growing requirements, and compliance with available platforms represent an unrealistic burden for many scientists and labs. We think the scientific community, and the cause of scientific advancement, deserves better.
The current status quo isn’t achieving its goal
Simply put, current guidelines are not sufficient to guarantee success in data sharing. As the article points out, even with data management plan in place (DMP’s being a somewhat recent requirement for many grants and publications), most research projects still fail to adequately share their data in a way that’s useful and actionable to other researchers. Why is this? In some instances, simply getting data onto a website or repository, a seemingly easy step, is instead a major pain. In other instances, it was a matter of formatting and readability of proprietary data formats. Sometimes, the documentation just wasn’t there—important components of the total experimental and analytical workflow lacked crucial metadata for making it understandable and reusable. Lastly, we’ve found that data sharing fails simply because it lacks context: where in the project’s story did this data come about? Was it the result of an experiment? The output of an analysis? What were the protocols of those, and how can I replicate and verify the data’s accuracy? Good intentions in the scientific community have slowly become requirements, and those requirements have slowly become more and more complex. The tools simply aren’t there to empower and incentivize researchers to transition from a scientific-paper-as-currency world into one in which their research value is not only shared, but cited, valued, and appreciated by other researchers.
How can we accomplish goals for data sharing without just creating more work?
We think about all the possible answers to this question all the time. We propose a digital solution (and we’d love your help designing it!) that makes providing adequate documentation of metadata and provenance a part of day-to-day data creation. In such a scenario, meeting data sharing requirements wouldn’t be extra work for the PI to manage, or a graduate student in the lab to execute, or a data librarian to come in late in the project’s lifecycle to make sense of. Similarly, it shouldn’t fall on the researchers to have to provide their own platform for sharing data, as the authors point out in the case of the researcher who committed to sharing their data on their lab’s website. So while opportunities to make data public may have proliferated over the last few years, tools to make the process of managing and sharing data painless have not. Researchers and their labs are left in a precarious position where the requirements for communicating scientific discovery have increased, leaving them with less time to focus on scientific discovery itself. This disconnect represents a tremendously interesting design challenge, one with far reaching impact throughout the scientific community.
We anticipate seeing not only a great deal of conversation around the paper, but also more literature, both scholarly and otherwise, discussing what we believe is a huge problem. And the U.S. government agrees—in a meeting at the end of 2015 of the president’s recently convened initiative to cure cancer, data silos and the lack of adequate tools for data sharing have hampered innovation and collaboration. Scientists quite simply have not been adequately empowered to meet, and have in fact been hampered from meeting, their fullest potential. This is a technology problem certainly, especially as datasets grow to the size of terabytes.
But at its core, it’s a design problem, which means together, as a community committed to developing a better way of conducting scientific research without the ever growing amount of frustration, we can develop real solutions to tackle data management, collaboration, and exchange. Reach out to us any time at [email protected] to tell us about your frustrations, and join the many others who have come together around this topic. Thanks again to the authors for a great read.