Software metrics – what to count and how

Measuring software impact is important, for a variety of reasons, and a variety of people, including software creators, their employers, their funders, potential users of the software, and academics studying software and science. This topic has been studied by many, including some nice work by Thain, Tannenbaum, and Livny about 10 years ago.

What should we count?

In brief, to measure software impact, we need to count something. I’ve previously suggested some things that we might want to count are:

How many downloads? (easiest to measure, least value)
How many contributors?
How many uses?
How many papers cite it?
How many papers that cite it are cited? (hardest to measure, most value)

But these mix up a bunch of types of metrics: download metrics, installation/build metrics, use/run metrics, and impact metrics. Here, I want to give some examples of each, and talk about how the counts are actually made.

How should we count it?

Build metrics example

OpenQBMM, an open-source implementation of Quadrature-Based Moment Methods, tracks the number of successful builds of the code, obtained using a curl instruction in the build script, which mimicks a click on a dedicated bitly URL. Two URLs are used, one to track builds of the development version of the code, and one for the stable version. Data corresponding to the development version are reported here because no stable version has been released at this time. The count does not exclude multiple builds from the same user or contributors who rebuild their code to include their changes. Users are informed about the tracking operation, and are provided instructions to remove the curl instruction on the download page of the software website, in case they do not want their build to be counted.

Use metrics examples

GridFTP – GridFTP, a popular tool for large file transfers, counts usage (transfers) as described in its documentation. Specifically, the servers to/from which the file is transferred send a UDP packet is sent to the GridFTP developers at the end of each transfer, containing statistics about the transfer. This is on by default, but users can opt out through a option set in the transfer command, or through an environment variable.

BoneJ – a software bundle for bone image analysis, collects data through phone-home code, similar in idea to GridFTP. In BoneJ, Google Analytics is used to report when a plugin’s run() method completes successfully. Again like GridFTP, the data that is collected is clearly identified in the documentation, and users can opt out.

PsychoPy does similar tracking to BoneJ, again clearly identifying what is tracked and why, as do a number of other packages, I suspect.

sempervirens – an experimental prototype for gathering anonymous, opt-in usage data for open scientific software. This project intends to build shared infrastructure that other projects can use. There will be some manual capability to control opt-in/opt-out (e.g., a special command you can run), but the main way is that when people are using the Jupyter Notebook (and hopefully later other interface projects as well, like Spyder or RStudio), it will notice if you haven’t expressed a preference either way, and pop up a little box inviting you to opt-in to data collection. The key idea here is that this would control your opt-in/opt-out on behalf of all the projects that are using the sempervirens infrastructure.
The actual information collected will be up to individual projects. Some examples would be “how many users in total”, “how many users are on which versions of the software”, “how many people use the Jupyter Notebook interface to run Python versus R versus Julia”, and “how many times did someone encounter this particular warning condition in this library.” Anything that the software can compute at all could be collected, but, of course, there will be formal guidelines about what’s acceptable (an extreme example is software that asks people for their name for some reason, which shouldn’t be sent to the server that’s collecting anonymous metrics!).
sempervirens is planning that NumFOCUS, as a neutral non-profit-in-the-public-interest, will act as a host for the shared infrastructure and steward access to the data (though this is not yet formally agreed to.)

duecredit – a tool aimed at providing the “publication references” for what modules/methods/papers were used for a given analysis so they could be collated and later provided to a user in a ready to use (e.g., to accompany the methods section in a paper) form, such as free-form text or a BibTeX file. Unlike other tools aiming to track entire software packages, duecredit aims to annotate specific functions/(sub)modules to provide only the relevant references, but with good detail. Future work might include providing a centralized collection/reporting service/portal to aggregate and report voluntary usage submission statistics.

Debian popularity contest, measures software installation and usage, though usage is defined as being in use, not a count of usages. Similar services are provided by Debian sub-projects (such as NeuroDebian) and derivatives (such as Ubuntu). This allows regular (weekly) automated submissions of what (software or data) packages are installed and used (if filesystem is mounted with atime). Participation is completely voluntary and offered as an option during system installation, and later could be easily changed to opt-out or opt-in. No reliable assessment of a multiplier to assess overall number of users is available, but anecdotal evidence points to somewhere from 10x to 100x.

XALT, an open source tool that collects job level activity on a cluster into a local database, also falls into the collecting use metrics category. However, it is somewhat different than the other examples listed, since it does not collect information about the usage a particular software package in any centralized location, but rather, collects data on a per-cluster basis with the aim of directly supporting the cluster owners and administrators. Of course, it might also benefit the software developers, particular if they focus on developing for a small number of high-performance systems, or if another tool could gather XALT data from multiple instances.

Cactus, an open source problem solving environment designed for scientists and engineers that has grown out of the numerical relatively community counts registered users — people who are willing to say they are users of the software, with their name on the Cactus Code web page. For a small community like this in relativistic astrophysics, this number is pretty meaningful though small, as there are not too many numerical relativists, but this can be correlated against high performing groups to see the overlap between Cactus users and, for example, authors of highly cited papers in the field.
Cactus also provides test accounts for new users on a supercomputer. They email the developers through an interface to ask for these accounts. These new users are usually students who the developers might not otherwise know about.

Impact metrics example

Depsy – a system to track and display the impact of research software, currently for software written in Python or R and stored in PyPI or CRAN, respectively, will count four types of impact: downloads (similar to what any code repository can count); software reuse (if one package is reused in another package), literature mentions (citations of a software package), and social mentions (tweets, etc.)

Other related work

The topic of software metrics was discussed at the WSSSPE3 meeting in September 2015. Some discussion is in this GitHub issue. This working group was following up on some discussion that happened in the 2015 SI2 PI meeting in February 2015 (see the metrics discussion in the final report), and the start of a draft white paper from the WSSSPE3 group is available.

If you know of other work that should be added to this blog post, please let me know and I will add it. I am not interested in building an exhaustive list of such software, but if there are other interesting methods for tracking builds, usage, or impact, I would like to add representative examples.

Acknowledgements

Much of the content of this blog is assembled from the various cited projects and activities. While I have been trying to publicize this as a concern, it is the work done by others in actually implementing these concepts that has shown this is an issue to be discussed and worked on, and that it can be solved. In particular, most of the text about OpenQBMM is from Alberto Passalacqua, most of the sempervirens text is from Nathaniel Smith, much of the duecredit and Debian popularity contest text is from Yaroslav O. Halchenko, and much of the text about Cactus is from Gabrielle Allen.

Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.