The research community is moving towards the acceptance that data and software are both essential parts of understanding and reproducing many research (e.g., science, engineering, humanities) outcomes (e.g., papers, books, presentations) and are themselves valuable outputs that can capture and explain knowledge. However, we don’t currently have a good set of practices to support reuse and sharing of data and software. Here are some steps we could take to improve this.
I’m going to focus on work funded by US NSF because that is the set of work I best understand, but I think the ideas here are more generally applicable to all research projects that aim at increasing knowledge and benefiting society, whether publicly or privately funded.
- Investigators are expected to promptly prepare and submit for publication, with authorship that accurately reflects the contributions of those involved, all significant findings from work conducted under NSF grants.
- Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.
- Investigators and grantees are encouraged to share software and inventions created under the grant or otherwise make them or their products widely available and usable.
- NSF normally allows grantees to retain principal legal rights to intellectual property developed under NSF grants to provide incentives for development and dissemination of inventions, software and publications that can enhance their usefulness, accessibility and upkeep. Such incentives do not, however, reduce the responsibility that investigators and organizations have as members of the scientific and engineering community, to make results, data and collections available to other researchers.
- NSF program management will implement these policies for dissemination and sharing of research results, in ways appropriate to field and circumstances, through the proposal review process; through award negotiations and conditions; and through appropriate support and incentives for data cleanup, documentation, dissemination, storage and the like.
And the NSF Grant & Proposal Guide (GPG), 2016, Chapter II.C.2.j includes
- Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan.” This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include:
- the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in … the project
- the standards to be used for data and metadata format and content …
- policies for access and sharing …
- policies and provisions for re-use, re-distribution, and the production of derivatives
- plans for archiving data, samples, and other research products, and for preservation of access to them
- Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available at: http://www.nsf.gov/bfa/dias/policy/dmp.jsp
- If guidance specific to the program is not available, then the requirements established in this section apply.
This has been reinforced by the 2013 “Increasing Access to the Results of Federally Funded Scientific Research” memo from OSTP stating that, “The Administration is committed to ensuring that … the direct results of federally funded scientific research are made available to and useful for the public, industry, and the scientific community.”
And recent “Open Data Policy” Guidance from OMB “… requires agencies to collect or create information in a way that supports downstream information processing and dissemination activities … includes using machine-readable and open formats, data standards, and common core and extensible metadata for all new information creation and collection efforts … use of open licenses …” Open data is defined as consistent with a set of principles: public, accessible, described, reusable, complete, timely, and managed post-release.
And this year, there is a draft US Federal source code policy that suggests “steps to help ensure that new custom-developed Federal source code be made broadly available for reuse across the Federal Government.”
With these documents and policies in mind, I suggest that, for the following reasons, there are specific changes that should be made:
1. While software is data, it’s not merely data.
Software suffers from a different type of bit rot than data — software must be constantly maintained so that it continues to function as both the hardware and software environments that it is used on changes, as bugs are found and fixed, and as user requirements demand new features and capabilities. Software is frequently built to use other software, leading to complex dependencies, and these dependent software packages also are frequently changing. Software is generally smaller than data, so a number of the storage and preservation constraints on data don’t apply to software. Finally, the lifetime of software is generally not as long as that of data.
So, while NSF currently considers software as a type of data that should be described in a data management plan, this is probably not sufficient, which leads to the recommendation:
Funding agencies should require a separate software management plan in addition to the existing data management plan.
2. Data management plans are currently private documents.
They are viewed by peer-reviewers as part of the review of a proposal, but those reviewers are looking at what is planned for this project, not what happened in the PI’s previous projects, because the data management plans for those previous projects are private. How seriously the reviewers take their duty to review the data management plan depends on their understanding of NSF’s data policies (as discussed at the start of post) and on how much the program officers running the review emphasize this in their instructions and guidance of the process.
The data management plans are also viewed by program officers during the project, in particular when an annual or final report is submitted, when the program officer can refuse to accept the report if the data management plan is not being carried out. In my experience, this is not uniformly enforced across NSF, and while many program officers and some divisions certainly do take them very seriously, others do not.
This leads to two recommendations:
The science community (reviewers and program officers) needs to thoughtfully consider, at all stages of proposal review and project management, the public interest and NSF’s requirement that investigators and organizations have the responsibility as members of the scientific and engineering community to make results, data, and collections available to other researchers.
Software and data management plans should be public documents, stored as archival documents like papers, software, or datasets, so the record of what projects planned can be later verified by other scientists, including peer-reviewers of future projects by the same investigators.
The latter could be partially satisfied by making key parts of software and data management plans part of the public NSF award abstract, though ideally the whole plans would be public.
3. Data management plans must be read, judged, and verified by a person.
Because data management plans are unstructured text documents, a human being must read them, judge if they are suitable, and ideally verify if they have been carried out. While it’s good that the plans are general, and can be customized to different community needs, it would also be good if some standard elements could be machine-readable, so that some of the burden of the reviewer and program officer could be relieved.
This could include software and data requirements. For example, a software management plan could propose developing software on GitHub at a specific URL and creating software releases every six months. If this was machine-readable, a system could check to see if the software at the URL had been updated recently, and if the PI’s annual reports included 2 software releases in each year.
Software and data management plans should have standard machine-readable elements.