Creation, publication, and citation: Issues in software citation versus paper and data citation

This post begins with an idea from a response to an NIH request for information that I wrote with Kyle E. Niemeyer and Arfon M. Smith, discussing what I then thought were the two needed roles to make software citation work: creating/publishing and citing, which I referred to as push (creating the object to be cited) and pull (actually citing it.)

Upon further thought, I now believe that there are really three relevant steps, which can be seen if we look at papers as an example:

  1. The creator of a paper (aka, an author) submits the paper to a publisher
  2. After some number of steps, the publisher publishes the paper and assigns it an identifier, most likely a DOI.
  3. Someone who wants to refer to the paper within another work cites the metadata of the paper, likely including the identifier.

A key property is that there is an order of and a break between these three activities: creation, publication, and citation.  In addition, self-citation (where the creator in step 1 is the same as the citer in step 3) happens generally as a series, where the creator wants to show some connection to their own past work. This mostly works for data as well, using the data citation principles, though self-citation is different, since the data that is cited in a paper may be published almost simultaneously with the paper (though the data publication need to slightly precede the paper so that the identifier for the data can be generated and used in the citation in the paper.)

However, it’s not clear that this fully holds for software, though the software citation principles try to force it to do so.  A common example is that a creator (or in many cases a set of creators) develops software on GitHub, where the software is never really complete, though it may be released at different stages (versions) during its development.  Today, someone else who uses that software will likely not cite it, but if they do, they will cite the repository.  Here, we are missing step 2, and because of this, step 3 cannot fully be completed, because there is no clear metadata or identifier for the software that was used.

The software citation principles recommend the insertion of step 2 in the process.  Specifically, that a creator should publish each software release (for example, through Zenodo or figshare.) Then, users of the software can use the metadata and identifier that the publisher provides.  This is a reasonable solution in many cases, because it allows the reader of the paper to recover (access) the software that was cited.

But in some cases, it will not work, because it adds a step to the software developers workflow that they may not care about enough to implement.  Even if we do get to a future time in which developers routinely published their software releases, what happens until then, or for existing software?

The real problem is that this set of discrete coarse steps (create, publish, cite) does not match how open source is developed and used, which is more fine-grained and iterative. And, because open source development mostly occurs in the open, there is no natural need for the publish step at all, other than marketing and credit, which are not primary concerns in all projects.

(Note that commercial software does go through a stage equivalent to publication, where a version is released and is sold.)

If we go back to papers, what happens if the citer wants to refer to something that has not been published?  At some level in school, students are taught to avoid this situation, but later, they may be taught to cite the material as a personal communication.  In particular, in the APA Publication Manual, a distinction is made between recoverable and unrecoverable data.  The APA manual recommends that recoverable data (that which can be accessed by the reader via the citation information) should be cited as a formal citation, and unrecoverable data should be referred to within the text as “(author, personal communication, date)”.

But for software, this distinction between recoverable (published) and unrecoverable (not available) doesn’t work.  All versions of software on GitHub, even if never published, are recoverable by default (more or less barring the project being deleted from GitHub, though even here they could be recovered from a local repository.)

So what does this mean?  Is software fundamentally different than other products, like papers or data sets?  Can we credit software creators?

As a product, software is not technically different than papers or data sets, though the social and community methods and mechanisms around it are different because software is never really complete in the same way a paper or a data set are complete, and also because the open source style of fine-grained rapid iterative development and use cycle in software is very different than the coarse-grained and slow creation and use cycle in papers.  Additionally, the role of open source tools such as Git have distributed the publication-like step for software, and have removed the single easy place to register metadata and assign identifiers that exists for other products.

Regarding credit, as we stated in the Software Citation Principles paper:

It is not that academic software needs a separate credit system from that of academic papers, but that the need for credit for research software underscores the need to overhaul the system of credit for all research products.

The realization that the three-step model of distinct creator, publisher, and citer doesn’t really fit modern open source practices is another argument for that overhaul.


Published by:

Daniel S. Katz

Chief Scientist at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Categories Uncategorized5 Comments

5 thoughts on “Creation, publication, and citation: Issues in software citation versus paper and data citation”

  1. Regarding software release, my impression is that it fails in a typical academic setting, where it is important to accumulate citations on a single output. This favours sparse releases (to accumulated citations), while it’s arguably better to release often to share and test new features. Do you have any data or experience with that?


      1. Thank you. A GROUPID would be absolutely great, in many situations beyond software releases and pre-prints and papers, as you point out. I hope to see this implemented, or at least considered ‘higher up’ and discussed more widely.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s