Advancing social aspects of the research software ecosystem: Citation and metrics

While I was a program director at NSF, I led the SI2 (Software Infrastructure for Sustained Innovation) program that had been created by Manish Parashar and Abani Patra, then was led by Gabrielle Allen, before it was my turn to shape it and lead it for about four years, from 2012 to 2016. This program initially funded projects that developed and maintained software, as well as projects that planned and built community institutes around software. And while many of these projects were (and in some cases continue to be) successful, I also quickly realized that the $15-$30m annually that I could provide to these projects was just a small fraction of the overall NSF funding that was spent on creating software.

I thought that we could encourage better, more sustainable software by thinking about the social aspects of software work, specifically, that if we could reward people when their software was used, they would have an incentive to make it as good and easy-to-use as possible, and this could lead to more people using existing software instead of developing new software that was similar to existing software.

To examine this, in 2014 I worked with Joshua Rosenboom, who led the NSF Science of Science and Innovation Policy (SciSIP) program, to gain support and budget from our division directors and our directorate associate directors to publish a Dear Colleague Letter (DCL), a reminder to the community of existing NSF funding mechanisms, and a note that we were interested in projects using these mechanisms that had a specific focus, in this case, on citation, attribution, and metrics related to software and data.

As we wrote in the DCL

How scientific research is conducted across all science disciplines is changing. One important direction of change is toward more open science, often driven by projects in which the output is purely digital, i.e., software or data. Scientists and engineers who develop software and generate data for their research spend significant time in the initial development of software or data frameworks, where they focus on the instantiation of a new idea, the widespread use of some infrastructure, or the evaluation of concepts for a new standard. Despite the growing importance of data and software products the effort required for their production is neither recognized nor rewarded. At present there is a lack of well-developed metrics with which to assess the impact and quality of scientific software and data. Unlike generally accepted citation-based metrics for papers, software and data citations are not systematically collected or reported. NSF seeks to explore new norms and practices in the research community for software and data citation and attribution, so that data producers, software and tool developers, and data curators are credited for their contributions.

A search of the NSF awards database (selected from the results of a search for expired awards, started between 1 July 2014 and 1 October 2014, with reference code 8004) shows that this funded seven EAGER projects:

and one workshop project:

Overall, these projects have led to progress in citation and metrics. Each project award link above contains a project outcomes report that includes papers and other products. And perhaps more significantly, these projects have shaped the social environment in which software (and data) are produced. Specifically:

  • The Mayernik/UCAR project focused on the use of identifiers, and used GROBID to classify papers that mentioned an NCAR supercomputer, which is related to work now ongoing at U Texas to find software mentions in papers where formal citations don’t exist.
  • The Kellogg/UC Davis project worked in the context of geodynamics to understand barriers in software development, current software citation practices, and to develop solutions to increase the occurrence of software citation in the literature, including a tool that is now part of the Computational Infrastructure for Geodynamics (CIG): the attribution builder for citation (abc).
  • The Sliz/Harvard project created a similar tool: AppCiter, which is part of the SBGrid and BioGrids infrastructures, making it easier for researchers in these environments to cite the software they use in their research. [Disclosure – I recently became a member of the SBGrid advisory board.]
  • The Seltzer/Harvard project led to improvements in storing data provenance and making data citable in Dataverse, open source data repository software that is in use at 75 institutions around the world.
  • The Cruse/California Digital Library project developed into the vibrant Making Data Count community initiative, which believes that the development of open data metrics will encourage broad adoption and our best path towards that is global community involvement throughout each phase of work. It is working to produce evidence-based studies on researcher behavior in citing and using data, and to drive a community of practice around open, normalized data usage and data citation as initial steps towards the development of research data assessment metrics.
  • The Vardigan/Michigan & Hoyle/Kansas collaborative project worked to improve the Data Documentation Initiative (DDI) standard, an international standard for describing the data produced by surveys and other observational methods in the social, behavioral, economic, and health sciences, and produced a custom metadata model to describe citation-like supporting materials in addition to data.
  • The Berez-Kroeker/Hawaii project brought about increased attention to reproducible research in linguistics, especially with regard to the citation of research data. The team led a community effort to define principles for data citation in linguistics (Austin Principles of Data Citation in Linguistics) and recommendations for researchers and publishers (Tromsø Recommendations for the Citation of Research Data in Linguistics).
  • The Ahalt/UNC-led workshop brought together a cross-section of researchers, funders, publishers, and communities in a national, interdisciplinary discussion and exploration of new norms and practices for software and data citation and attribution. The workshop identified social and technical challenges facing current software development and data generation efforts and explored viable methods and metrics to support software and data attribution in the scientific research community. The workshop report proposed pilot projects to endeavor to implement and experiment with actionable ideas, including topics such as improving cross-referencing of citation data, increasing the use of identifiers for people and products, creating a consistent citation format for software and data, and developing guidelines for trusted software repositories, many of which now have been begun or accomplished. 

This Dear Colleague Letter and the projects that it led to have been reasonably successful at raising awareness of the issues around software and data citation and in taking action to improve them, both in specific disciplinary communities and more generally. However, much work is still needed. While there is no existing focused program in this area, a more recent NSF DCL on collaboration between CISE and SBE researchers might include the opportunity for new projects in this area, and some other government agencies and private foundations have shown willingness to fund such projects. 

While most of the projects that were funded have ended, they have built community and community work in this area, including groups within the Research Data Alliance (RDA) and FORCE11. The Research Software Alliance (ReSA) has also started to examine this area under its people and policy themes. If you are interested in software citation, the area I know the best, you can join the FORCE11 Software Citation Implementation Working Group. Regarding software metrics, the CHAOSS project is a good starting point to get involved. For research software in general, you can join the ReSA mailing list. Regarding data, RDA has a number of interest and working groups you can join.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s