Second thoughts on Proper Citation Guidance for Software Developers

About a month ago, I attended (thanks Sloan Foundation!) the Scientific Software Registry Collaboration Workshop, which had very interesting discussions. While there, I continued thinking about how software should be cited, focusing on open source. My thinking was certainly influenced by talking with Anita Bandrowski, Bryce Mecum, Shelley Stall, Katrina Fenlon, and Carly Robinson at the workshop.

My main thought is that I am even more uncertain than I have been before whether the guidance we are building for software citation is correct and useful. Specifically, starting with the FORCE11 Software Citation Principles, we’ve been modeling software citation after data citation and paper citation, with the idea that we want software authors to deposit their software in an archival repository, where they will obtain a DOI (or other identifier), and then ask citers to use this DOI when citing the software. As I’ve written before, this is problematic in many cases and for many reasons. One alternative, particularly useful when a citer wants to cite software that has not been deposited, is to use a registry. Another alternative, which I’ve also written about before, is to use identifiers that are automatically created with the software, such as Git hashes or Software Heritage identifiers. Again, this is not new. What is new is that I have started thinking more systematically about the advantages and disadvantages of each software citation scheme for the common case of open source software developed with Git on a collaborative platform (e.g., GitHub, GitLab), which follow:

 

DOI Advantages

  • The identifier is understood by all, as it is standard across disciplines.
  • Metadata must be registered as part of deposition.
  • The repository has policies to ensure persistence.

DOI Disadvantages

  • Not every software version will be registered, leading to challenges in reproducibility, and credit if there are new contributors since the last version was registered.
  • If the software is not deposited by its developers (or their institutions), there is no DOI for a citer to use.
  • Metadata will usually have to be manually edited/curated to be complete, correct, and reflecting the deposited version. (For example, the semi-automated methods used by some workflows, such as the Zenodo-GitHub integration, may not correctly capture all intellectual contributors to a package.)

 

Registry ID Advantages

  • It’s relatively easy to register software, even for people who didn’t develop the software.
  • Entries and metadata is usually curated.
  • The registry is most likely has or is developing has policies to ensure persistence.

Registry ID Disadvantages

  • Identical software can be registered multiple times in one registry, though this can potentially be avoided with automated or manual registry curation.
  • Identical software can be registered in multiple registries, which can lead to multiple identifiers for the same software in registries that are not federated or synchronized.
  • Identifiers might be less understood across disciplines (e.g., ASCL ID, SciCrunch RRID, swMATH ID).
  • Metadata will likely be incomplete, unless registration is done by developers, or unless the authors have been maintaining a metadata file in the source code repository, such as codemeta.json or CITATION.cff.
  • The registry may only register software concepts, not multiple versions.
  • The registry may only include some software versions, but not all, or may have a scheme for author to cite both the package via the repository as well as a specific version that is hosted elsewhere; this may lead to challenges in reproducibility, and credit if there are new contributors since the last version was registered.

 

Git Hash Advantages

  • The version cited can always be the specific version that was used.
  • This doesn’t require software author to do anything.

Git Hash Disadvantages

  • The identifier is unclear – there’s no standard for how to cite, perhaps the URL to the base repo and the commit hash? or the URL to a specific commit?
  • The identifier is non-archival; the platform hosting the code repository (e.g. GitHub, Gitlab) may disappear, and the source code repository may be deleted
  • Metadata is unclear, unless the authors have been maintaining a metadata file in the source code repository, such as codemeta.json or CITATION.cff.

 

Software Heritage ID Advantages

  • The version cited can always be the specific version that was used.
  • Doesn’t require software author to do anything (but suggestions include creating a codemeta.json file).
  • Can use reverse lookup: given an object, one can find the identifier by using the publicly-available hash function.
  • Because the identifier is a function of the object, modifications to the object will yield a different identifier. Similarly, one can recompute the identifier it matches the specific version of the object.
  • Captures multiple layers of source code granularity.

Software Heritage ID Disadvantages

  • Metadata is unclear, unless the authors have been maintaining a metadata file in the source code repository, such as codemeta.json or CITATION.cff.
  • These identifiers are not currently well-understood. Getting authors and publishers to recognize them will be a large effort (unlike DOIs, which are easy for authors and publishers to use for software in the same manner they now use them for papers and datasets).

 

Note that one issue that is common in all schemes is software creation metadata (properties of the software itself as source code, such as: authors, language, license, version number, location, etc.), which must be provided by the developers (owners); it cannot be determined by the user (see section “The metadata/credit gap” in my previous blog post). There are three options for metadata:

  1. It can be deposited by the developers when the software is deposited in an archival repository.
  2. It can be maintained by the developers in the source code repository (and if so, this can also be used in case 1).
  3. If it is not stored by the developers, it (at least the part of the metadata that lists the authors) cannot be known (though the authors can always be listed as “The … Project”).

If we assume that we can change the developer culture so that developers store their metadata in their source code repository, which is probably one of the most important changes we should be trying to make, and in which I believe we are making progress, we can ignore the metadata disadvantages in all of the schemes. (Another way of looking at this is that we accept that projects that are concerned with credit will store their metadata in the code repository and that citing other projects as “The … Project” is reasonable.)

Then to me, the option with the fewest disadvantages and most advantages becomes using Software Heritage IDs, rather than the previously advised depositing of software in an archival repository to obtain a DOI, though there would be a large amount of work needed to get authors and publishers to understand and use Software Heritage IDs.

I wonder if this makes sense to others, and they agree with the conclusion.

Another option, which some others favor and I’m still thinking about it, is that there is no single answer. Either we provide guidance that includes multiple options, or we suggest that different options (different schemes or different identifiers) are best in different cases.

There will be a session at the next RDA plenary (organized by the RDA/FORCE11 Software Source Code Identification WG in March 2020, Melbourne, Australia) where we can discuss this, to see if others find different advantages and disadvantages of each scheme, and can come to a consensus on changes to the guidance we provide.

Another option would be to turn this (and the proposed session discussion) into a paper. If anyone is interested in joining this effort, please let me know.

And of course, this entire post is focused on open source software in GitHub or another similar public source code hosting platform. How we handle other types of software, for example, closed source packages and services, isn’t addressed here.

Acknowledgments

Thanks to Tom Morrell, Alice Allen, Morane Gruenpeter, Neil Chue Hong, Daina Bouquin, Stephan Druskat, and Bryce Mecum for helpful feedback on this blog post, as well as to Tom, Alice, and Mike Hucka for co-hosting the workshop.

 

Published by:

Daniel S. Katz

Chief Scientist at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Categories RSE, Uncategorized5 Comments

5 thoughts on “Second thoughts on Proper Citation Guidance for Software Developers”

  1. Thanks for posting this summary Dan, I wonder though why the DOI does not have the disadvantage of having multiple DOIs associated with a tool, especially when there is no requirement that only the author(s) (potentially a shifting list of people) can pull a DOI. I see this problem as much bigger in the DOI world than in a repository, where you do talk about the potential of having multiple entries (which are mainly just mistakes that are generally corrected when they are found).

    Like

    1. Anita – to me, the problem is multiple entries in different places, and it’s reduced by having the authors submit for a DOI, where they will not want to split citations across multiple DOIs, but for non-authors submitting to different registries, there’s no incentive for them not to submit duplicates (without knowing about previous entries in other registries)

      Like

  2. Thanks for this great post Dan. A very clear analysis indeed. About your conclusion is that Software Heritage IDs seem to be the best solution: I agree that it probably is the best TECHNICAL solution. However, I am not sure that the issues we have are technical in nature in the first place. I think this is more about a cultural change, and then sticking as close as possible to a concept that most people already know and use (DOIs) seems to have the best chance of being accepted. Even though it might not be the best from a technical perspective.

    Anyway, what might be a nice addition to this post is an analysis, for each of the options, of how we could address the disadvantages in practice. Maybe some of them are very fundamental, while others can be relatively easily addressed. They might not all have the same “weight”.

    Like

Leave a comment