Compact identifiers for software: The last missing link in user-oriented software citation?

I would like to have a well-functioning, easy-to-use, software citation system in place. In brief, I believe that doing so would increase the recognition of the role of software and its developers and maintainer in research, and would provide them credit in a manner that would be useful in the academic and related settings in which many of them work. This is the motivation for the FORCE11 software citation activities (both the initial principles working group and the current implementation working group).

I now believe that compact identifiers have a key role to play in making this happen, along with CodeMeta and Software Heritage, working together in the framework provided by the Software Citation Principles. As I’ve written about much of this before, I’m just going to briefly summarize where we are now, and where we might be able to go.

The problem

The process of citing a work, such as a paper, involves 3 distinct steps:

  1. A creator submits the work to a publisher, along with metadata describing the work
  2. The publisher, possible after a peer-review step, publishes the work, registering it and its metadata, including a newly minted DOI for work.
  3. Someone who wants to cite the work includes some identifying metadata about it, possibly including the DOI, in another work.

Software, and specifically open source software, is often freely distributed while it is being developed, and in some sense, development may never be complete, as the software will have to maintained so that it continues to work as hardware and underlying software changes. However, the software may have releases: points where the developers feel that they have assembled a version that will be useful to others. In both cases, the developers usually don’t document the software creation metadata that would be needed to consider submitting the work. And since there typically is no step 1 (submitting the work to a publisher), there is no publisher who demands this metadata, and there is likewise no step 2 that registers the metadata and mints an identifier. Thus, step 3 cannot be completed either: the software cannot be cited in the normal way a published paper can be cited.

The Software Citation Principles try to fix this: they suggest that developers publish their releases through a repository such as Zenodo (step 1), which fills the registration role (step 2), and enables citation (step 3).  However, there are a number of problems with this solution, mostly related to the fact that researcher use software that is not published, sometimes because the developers are not publishing their software, or because they are but the version used does not match a published release.

A solution?

In these cases, the set of projects: Software Heritage, CodeMeta, and compact identifiers might provide a workable solution.

First, Software Heritage is building an archive of all public software in version control systems. This means there is an archival version of the software that can be referred to by a user who wants to cite that software.

Ideally, such citations would include metadata about the software. CodeMeta is a developing guideline for the minimal metadata that software developers can easily provide, stored in a JSON-LD file (called CodeMeta.json) in their software repository, which would be captured by Software Heritage.

Finally, the researcher who wants to cite the unpublished software needs a way to identify the software. This is where compact identifiers enter. Compact identifiers are a method for referring to a stored object in a registered repository. For example, pdb:2gc4 is a compact identifier for Protein Data Bank accession (item) number 2gc4. Compact identifier metaresolvers (identifiers.org and N2T.net) then can be used to fully resolve a compact identifier, for example: http://identifiers.org/pdb:2gc4.  The compact identifier metaresolvers could work with Software Heritage to define compact identifiers for all the repositories that Software Heritage works with, and a means to resolve them to a pointer in the Software Heritage archive.  For example, github:repo_name/commithash.

With these three projects working together, those who want to cite software would no longer be limited to citing software that has been published, but could easily and effectively cite any archived software

Acknowledgements

I want to thank John Kunze and Sarala Wimalaratne for their work in compact identifiers, Martin Fenner for his work in metadata and PIDs in general, and in software specifically, Greg Janee for discussions related to PID services, Roberto Di Cosmo for his work and discussions about Software Heritage, Carl Boettiger and Matt Jones for their work and discussions about CodeMeta, Arfon Smith and Kyle Neimeyer for co-leading the FORCE11 software citation working group with me, and Martin and Neil Chue Hong for co-leading the FORCE11 software citation implementation working group with me.

Advertisements

Published by:

danielskatz

Assistant Director for Scientific Software and Applications at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Leave a comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s