As I’ve been involved in the new FAIR for Research Software (FAIR4RS) activity (jointly run under FORCE11, RDA, and ReSA), I’ve been leading a subgroup looking at the FAIR (data) principles and thinking about how they can be applied to software. One of the things that has become clear to me is how fundamentally different software and data are, and the consequences this has on how software can be made FAIR.
The gaps that we have in practices and technology are really the same gaps we’ve known about for a few years, which I wrote about in a 2017 blog post and in a 2019 ArXiv report. Here I want to focus on software metadata.
In my naive view of data, when it is shared, the people who share it usually think about metadata at that time, perhaps because they want credit, they want their software to be discovered, or perhaps because the repository they are using to share it requires metadata. But when software, particularly open source software, is shared (by being developed in the open on a platform like GitHub), no one explicitly thinks about metadata, perhaps assuming that git commits are automatically capturing all the needed metadata, not realizing that many important contributions to a project may not be commits. In the Software Citation Principles, we suggested that people who want their software to be cited can publish it to a preservation repository, which requires them to provide metadata as well, and in return they get a DOI that they can ask people to cite. This is a valid solution.
However, software changes rapidly, with frequent commits in many projects. This timescale does not match the practice of submitting the software to a preservation repository, even using tools like the GitHub to Zenodo linkage where new tagged releases on GitHub can trigger new archival deposits in Zenodo, since the vast majority of commits aren’t new tagged releases. And even when this linkage is used, someone has to update the Zenodo metadata or else it remains the same as it was for the previous archival deposit.
This says to me that we need another solution as well, to get to the point, for open source, where
- All needed metadata is stored with code (in the working development repository on GitHub, GitLab, etc.)
- Creating a new repository means creating a new metadata file (containing information such as the project title, license, and initial contributors), similarly to how it today can include creating a new README and LICENSE
- Our expectation is that a pull request that changes code will also change the metadata, if needed (e.g., if a new person has contributed to the project), including the ability to add information about non-code contributions that should be recorded along with the automated capture that comes with a commit to the repository, similar to how we expect new functionality to include new documentation and new tests
- We have bibliographically useful (permanent) identifiers for code that refer to code at the commit level, and in either the working development repository or an automatically captured archive of it
Fortunately, item 4 has been accomplished by Software Heritage, and we just need to worry about items 1, 2, and 3. (Roberto Di Cosmo wrote a paper with more details.) To make this happen, we need a set of changes, both social and technical. For example:
- We could work with GitHub and GitLab so that they create blank (or somewhat pre-populated) metadata files when a new repository is created, similar to how LICENSE and README files can be created. As Neil Chue Hong points out, there’s some conflict between human and machine readability for these files. README files are aimed at humans, as are LICENSE files mostly, though given that they are somewhat standardized, they are also somewhat machine readable. However metadata files need to serve two purposes: They need to be able to be written and updated by humans, but also need to be usable by both humans and machines. This probably requires both standardized formats and tooling.
- Neil also points out that we could work with ORCID to tie the various people identifiers that could appear in software metadata (name, GitHub/GitLab user name, ORCID) together, perhaps by registering GitHub/GitLab user names with ORCID. Martin Fenner suggests that the linking of an ORCID ID and GitHub username should be part of the intended solution, and the linking could also be mediated by a third party and should include authentication with both services. Arfon Smith adds that ORCID could also support login with GitHub and GitLab to improve the linkages (similar to how they currently support Google and Facebook.)
- We could automatically submit PRs to repos that include a semi-populated metadata file, perhaps for repositories that are cited in papers or submitted to indices like ASCL or SciCrunch or SwMath.
- We could work with the cookiecutter developers and the developers of well-used cookiecutters to incorporate this. We could work with the R and Python communities to create and harmonize practices, perhaps through CRAN and PyPI to make this easier. We could add template repositories on well-used platforms such as GitHub and GitLab that contain metadata file templates to be completed.
- We could work with the Carpentries and with disciplinary communities that teach programming practices to make sure that this is part of their lessons.
- Matt Turk points out that we could also automate a lot, potentially via a dependabot for software metadata files, similar to all-contributors. Arfon notes that users would likely have to opt in to installing this; it’s unlikely that this would be provided to all repositories. Stephan Druskat notes that part of an automated process can be the ingestion of existing, programming language-specific software metadata, such as provided in manifests and metadata files (pom.xml, MANIFEST.MF, setup.py, etc.)
- Daina Bouquin points out that automatically updated metadata stored with software may conflict with the metadata generated by archives when releases are eventually deposited. We could work with archives accepting software deposits to develop practices and tooling to avoid this problem.
Of course, this all requires that we have one or more standard ways to record metadata. Today, we have CITATION.cff, which is fairly easy to create, and focuses on the metadata needed for citation. We also have codemeta.json, which is more general, supporting more metadata fields, but also more complex to create. (However, the Software Heritage-created CodeMeta generator is reasonable, and it’s fairly easy to image a similar CodeMeta editor. And, there is tooling to create CodeMeta files from existing Citation File Format files.) A third option is schema.org, which has a software type. Work in the FORCE11 Software Citation Implementation Working Group is trying to suggest some small additions and changes to schema.org that would allow CodeMeta to be merged into it. There’s also zenodo.json files, which store metadata that can be automatically imported into a Zenodo archive, perhaps when a GitHub repository is linked to Zenodo, which for software, probably should be replaced by one of the other options. Yet another metadata record, coming from citation in objects submitted to obtain DOIs, is Crossref’s schema, which in a near future version will include a software type that will support different types of identifiers, including Software Heritage IDs (SWHIDs). This relies on the author of the submission to find and submit the metadata for the citations.
Overall, there’s a lot of work to do, but I think there’s a path forward, even if it has some forks. This work could be organized by the FORCE11 Software Citation Implementation Working Group, as well as a potential meeting organized under new IMLS support. If you are interested in participating, please join the FORCE11 group to stay up to date.
Thanks to Neil Chue Hong, Matthew Turk, Morane Gruenpeter, Martin Fenner, Daina Bouquin, Arfon Smith, and Stephan Druskat for very thoughtful and useful comments on a draft of this post, many of which are captured in the text.