Software Heritage and repository metadata: a software citation solution

This is a third (and last) in a series of short (ok, the first and second were actually short, even if this one isn’t) blogs related to talks and discussions at the 10th RDA Plenary.

Software Heritage

Software Heritage (as described in a recent white paper) is a really interesting and very ambitious project that started (or at least was announced) last summer. Its mission is to collect, preserve, and share all software that is publicly available in source code form.

And the project has been doing this. For example, it currently stores all public repositories in GitHub, plus a bunch of other things, and is working with Bitbucket. It currently holds almost 4 billion source code files, almost a billion commits, and about 65 million projects.

This means that all public code will be stored, indexed, and made available (currently via an API and eventually simply to browse). The Software Heritage project describes three use cases for this archive:

  • Heritage: Software is an important part of human production. It is also a key enabler for salvaging our entire digital heritage. We collect, preserve, and make accessible source code for the benefits of present and future generations.
  • Science: Science relies more and more on software. To guarantee scientific reproducibility we need to preserve it. Amassing source code at this scale will be challenging, but will also enable the next generation of software studies.
  • Industry: Software is present in all industrial processes and products. The universal source code archive we are building will help industry with provenance tracking, long-term archival, and software bill of materials.

This includes reproducibility, but surprisingly to me, doesn’t include citation and credit, perhaps because the archive by itself is not sufficient to allow credit, but could be with just a bit of effort by the community.

Citation: citable and cited

As I’ve written before, the citation process involved two elements, making something citable and then actually citing it. And these two elements involved three steps, for example for a paper:

  1. The creator of a paper (aka, an author) submits the paper to a publisher
  2. After some number of steps, the publisher publishes the paper and assigns it an identifier, most likely a DOI.
  3. Someone who wants to refer to the paper within another work cites the metadata of the paper, likely including the identifier.

The first two steps make the paper citable, and the third step cites the paper.

The second step is the key to making the paper citable, by making it recoverable. The APA Publication Manual distinguishes between recoverable and unrecoverable data: recoverable data is that which can be accessed by the reader via the citation information, while unrecoverable data is that which cannot be accessed via the citation information. The APA Manual goes on to recommend that recoverable data should be cited as a formal citation, and unrecoverable data should be referred to within the text as “(author, personal communication, date)”.

Software citation

But for software, this distinction between recoverable (published) and unrecoverable (not available) doesn’t work. All versions of software on GitHub, even if never published, are recoverable by default (more or less barring the project being deleted from GitHub, though even here they could be recovered from a local repository.)

The software citation principles try to force this issue, by recommending the insertion of Step 2 in the process. Today, when a creator develops software on GitHub, the software is never really complete, though it may be released at different stages (versions) during its development. The principles say that the creator should also publish each software release (for example, through Zenodo or figshare.) This finished the process of making the software citable, and allows someone else then to cite it.

This is a reasonable solution in many cases, because it allows the reader of the paper to recover (access) the software that was cited. But in some cases, it will not work, because it adds a step to the software developers workflow that they may not care about enough to implement. Even if we do get to a future time in which developers routinely published their software releases, what happens until then, or for existing software?

Enter Software Heritage

For software that it archives, Software Heritage mostly removes the need for Step 2. If I, as a user of software want to cite the software I’ve used, I just need to:

  1. Find it on Software Heritage
  2. Cite it

Of course, this isn’t quite as simple as it sounds, but it could be in the future.

Three gaps that Software Heritage can fill

Here are three of the things that I think are missing, for which Software Heritage provides the basic answers, but some additional work is still needed:

  1. To cite software, how do I find it on Software Heritage? Many people today use software from GitHub, and they would like to cite it by pointing to the GitHub repository and commit hash. However, GitHub is not an archive while Software Heritage is. (While much software development is done on GitHub today, at some future point, this will likely no longer be true. Think about Google Code. And while Google has created a Google Code archive, it’s unlikely a smaller company would support creating and maintaining an archive of a dead project.) I find it easy to imagine a set of tools that could link from a GitHub commit hash to a location in Software Heritage’s internal Merkle tree.
  2. What is a Software Heritage ID? Software Heritage uses a (very long) hash to represent a file (a node in the Merkle tree). Exactly how this hash should be translated to a PID is not clear. Perhaps something of the form https://softwareheritage.org/ID/hash? Or, given that most Software Heritage hashes won’t be cited, perhaps a smaller hash space could be used for those that are, leading to PIDs that are easier to document as text?
  3. How do I access cited software on Software Heritage? Of course, an extra function is also needed to make the recoverable part of the citation work, to go from a Software Heritage ID to actually obtain the software that was cited. This needs to obtain the full package, and it also ideally should link back to where the software is being developed. Using Software Heritage would be enough to see if there had been further developments of the software, such as bug fixes, that could be important depending on if your need is to simply access the software to repeat the exact experiment in a work that cited the software, or if you want to reproduce the work at a higher level. And, if you want to contribute to the software, or create an issue about it, you also need a link to the software repository where the active development is occurring.

The metadata/credit gap

The remaining gap is not one that Software Heritage can solve alone. It can be asked as two related questions:

  1. How do I give credit to the developers of the software?
  2. How do I find the appropriate metadata for the citation?

The question of what is appropriate metadata for software (for the purpose of citation) was partially addressed in the Software Citation Principles paper, specifically in Table 2, though this hasn’t really been tested in practice. DataCite has drafted a new version (4.1) of their schema that adds and updates elements to reflect these principles, though this isn’t public yet.

It’s also important to note, that as the Software Citation Principles paper says:

Similarly, the software metadata recorded as part of data provenance will overlap the metadata recorded as part of software citation for the software that was used in the work. The data recorded for reproducibility should also overlap the metadata recorded as part of software citation. In general, we intend the software citation principles to cover the minimum of what is necessary for software citation for the purpose of software identification. Some use cases related to citation (e.g., provenance, reproducibility) might have additional requirements beyond the basic metadata needed for citation, as Table 2 shows.

One way to think about this is that there are some metadata that describe properties of the software itself as source code, such as: authors, language, license, version number, location, etc. Let’s call this software creation metadata. And there are also metadata that describe how the code is being used, possibly including how it is built, such as: compiler version, operating system, parallel computing platform, command-line options, etc. Let’s call this software usage metadata. The software citation principles say that the software creation metadata are needed for citation, while the software usage metadata are needed for provenance and reproducibility.

While a person who uses some software in their research can determine the software usage metadata, this person cannot determine the software creation metadata. This can only be done by the software creators. So, Software Heritage cannot provide the metadata needed for software citation – this is why there is a gap.

Filling most of the metadata gap

But the authors of the software who want to be cited can fill this gap, and could do so relatively easily. They just would need to create a single metadata file in the root of their repository, with an agreed upon name.

The first time I heard this, it was suggested by Martin Fenner, based on work done in the CodeMeta project, which has the goal of creating a minimal metadata schema for science software and code, in JSON and XML. Martin provided an example of how this could be done: the codemeta.json file in the repository https://github.com/datacite/maremma.  According to Martin, the process by which DataCite today could generate a DOI and a citation from this is semi-manual and involves using for DataCite XML generation.

If code developers created a codemeta.json file in their repository when they started working on their project, they would then just need to keep it up to data, much like they do their README (description of their project) or CONTRIBUTORS (who has contributed to the project) files, and they might not need to create a CITATION (how the project should be cited) file. Or, the CONTRIBUTORS and CITATION file could be generated from the codemeta.json as part of continuous integration, or as part of releasing or packaging.

Since Software Heritage would keep all versions of codemeta.json with the corresponding versions of the software code, it would be relatively easy to retrospectively build the proper citation metadata for any version of the software.

The rest of the metadata gap

This still leaves a portion of the gap: How do we build a citation to code when the authors don’t care about credit and have not provided a codemeta.json file?  This also needs be answered to cite almost all software that has been built to-date.

Most of the metadata needed for this (see Table 2) can be extracted from the repository directly. The one thing that cannot is the authors. While some would argue that the authors are the same as the contributors to the repository, I don’t agree. There are contributors to a software project who may not contribute to the repository (e.g. a person who gets the funding and other resources to enable the project, a person who provides training on the software) and there are contributors to the repository who are not authors of the software (e.g., an administrator who adds license information to source code files). Therefore, I think the best thing to do is simply to identify the project as the authors (e.g., authors = “CodeMeta Project”), and if the authors feel this is incorrect, they can create a codemeta,json file to provide the correct information.

Summary

To put this all together, I am really excited to see Software Heritage emerge, as this archive of all source code will enable a lot of understanding of software, as the project claims. And I’m also excited Software Heritage will better enable software citation than anything we have today. Finally, this also opens a path by which software authors can create a maintain a single file in each repository to provide the metadata needed to make software citation almost automatic.

Published by:

Daniel S. Katz

Chief Scientist at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Categories Uncategorized10 Comments

10 thoughts on “Software Heritage and repository metadata: a software citation solution”

  1. Hi Dan,

    Really nice writeup.

    Currently in CodeMeta, we have several tools that will let software authors automatically generate codemeta.json files by detecting metadata in the package. For instance, R packages already list author information, including roles and even support using ORCID IDs now (something we’ve tried to catalyze at in the codemeta project): see https://github.com/ropensci/codemetar as an example.

    As you know, this is a rather separate step from the process of going from data repository to DataCite metadata. Repositories like Zenodo already have to map their internal metadata into the DataCite XML; the fact that this is rough and lossy was a big motivation for CodeMeta. Lars tells me Zenodo plans to shift next year from it’s internal .zenodo.json format to codemeta.json; this would effectively automate the process of going from user -> data repo -> datacite metadata with minimal metadata loss, and hopefully encourage the uptake of richer metadata (in particular, things like ORCID IDs which we would really like if we want to compute citation information about authors etc). Likewise we hope to get a codemeta-based software metadata description up and running in DataONE; which would similarly provide some incentive for authors to bother making a decent codemeta.json file and also streamline the translation of this data between repos, DataCite, and elsewhere.

    Looking forward to talking more to SoftwareHeritage folks and others in testing this out. The input from you, Martin, Arfon and others last year to align CodeMeta with the simple schema.org vocabulary and tooling has proven really powerful so far!

    Like

    1. Finding commits or releases in Software Heritage

      Dear Dan,
      thanks for this nice piece on the work we are doing at Software Heritage: the fact that our mission is to collect all software source code out there, and not just a part of it, can indeed be a game changer when it comes to software citation and preservation.

      I just want to mention that you can already mark as “done” the first item in the list of desiderata in this article: locating in Software Heritage an existing commit or release found on GitHub (or any other git repo).

      Let’s consider, for example, my library parmap, whose development is currently hosted here https://github.com/rdicosmo/parmap and take the hash of one of the commits, for example this one
      https://github.com/rdicosmo/parmap/commit/a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97

      Then you can trivially find it on Software Heritage here

      https://archive.softwareheritage.org/api/1/revision/a9f3396f372ed4a51d75e15ca16c1c2df1fc5c97/

      Indeed, in our current architecture, we use git-compatible hashes in the giant Merkle tree we maintain at Software Heritage, and that makes things quite straightforward.

      Let me add that, as explained in our white paper https://hal.archives-ouvertes.fr/hal-01590958, we are on a mission to cover all existing repositories, and version control systems, so you will find in Software Heritage also software maintained using subversion, mercurial, and the like. Getting the right hash for these kinds of VCS will not be as straightforward, but it will be doable too.

      Like

Leave a comment