While on a call today for the ACM Emerging Interest Group on Reproducibility and Replicability, I realized that for computational reproducibility to become pervasive, we need to solve a scalability problem. Today, a lot of computational reproducibility is checked by hand, whether this is in terms of artifact evaluation in a conference or a paper, or with post-publication checks.
I was reminded of fault tolerance and how some algorithms can be checked after they are run to determine if they ran correctly, at least with respect to transient errors. The highest cost way of doing such a check is to simply run the algorithm a second time to see if the same results are returned. This is 100% overhead. In many cases, however, there are lower cost checks that are possible (thus with much lower overhead) such as result checking, derived from algorithm based fault tolerance (ABFT). I was involved in some work at JPL in this area around the turn of the century, for example, where we checked an O(n2) matrix mutliplication with an O(n) check. In this work, another key element was being able to know whether a calculation was correct, subject to both numeric and algorithmic noise.
The similarity here is that hand-checking of reproducibility is similar to redundant calculations. If we were to do this for all calculations, this would have a huge cost in terms of our time. If on the other hand, we could build low overhead methods to reproduce our work, this would be an incentive for increased reproducibility.
This seems to me to call for automated reproducibility checking, similar to how continuous integration works for software development. This is not a new idea, nor is it my idea. The first time I heard about this was, I think, in James Howison’s “Retract bit-rotten publications: Aligning incentives for sustaining scientific software” at WSSSPE2. I think it’s worth revisiting and republicizing this idea.
- What are the pros and cons of automated reproducibility checking versus the hand performed reproducibility checking we are doing today?
- What infrastructure would we need to make this pervasive?
- How would this impact common peer-review practices?
- Would this help make progress toward the larger goal of better research, or would this instead lead to emphasis on one element that doesn’t really help overall?