Is software reproducibility possible and practical?

This blog is based on part of a talk I gave in January 2017, and the thinking behind it, in turn, is based on my view of a series of recent talks and blogs, and how they might be fit together. The short summary is that general software reproducibly is hard at best, and may not be practical except in special cases.

Software reproducibility here means the ability for someone to replicate a computational experiment that was done by someone else, using the same software and data, and then to be able to change part of it (the software and/or the data) to better understand the experiment and its bounds.

I’m borrowing from Carole Goble (slide 12), who defines:

  • Repeat: the same lab runs the same experiment with the same set up
  • Replicate: an independent lab runs the same experiment with the same set up
  • Reproduce: an independent lab varies the experiment or set up
  • Reuse: an independent lab runs a different experiment

(I am aware that my choice of definitions for replicate and reproduce is very much a matter of dispute, but I nonetheless choose to use them this way. If you prefer the other choice, please feel free to switch the words in your mind as you read.)

Note that an interesting alternative way of looking at this is using the concept of confirmation depth, as proposed by David Soergel, which is meant to be a measure of the reproducibility of scientific research. It’s defined as: given two experiments that provide the same result, how many steps back from that result is the first commonality of materials or methods found? Conversely, how similar is the derivation of the inputs shared by the two experiments? The shallowest form of confirmation is peer review, while deeper forms, such as using different software, different approaches, or different inputs, give more confidence in results. Though their intents are somewhat different, confirmation depth thus overlaps Goble’s definitions, with Soergel’s shallow confirmation depth outside the scope of Goble’s definitions, and Goble’s reuse outside the scope of Soergel’s deepest confirmation depth.

Much like Mark Twain’s definition of classic books as those that people praise but don’t read (see Following the Equator, Chapter 25), reproducibility seems to be a goal mostly discussed in the abstract, but not actually practiced, though there are notable exceptions, such as ClaerboutDonoho, etc., as discussed by Lorena Barba, who is also one of a small number of researchers who are very seriously attempting to make their work reproducible. As Barba mentions, our culture and our institutions do not reward reproducibility; we generally don’t have incentives or practices that translate the high-level concept of reproducibility into actions that support actual reproducibility.

Reproducibility can be hard due to a unique situation, for example, data can be taken with a unique instrument or can be transient, meaning that the data cannot be recollected, so that the starting point for reproducibility might have to assume the same data. Or perhaps a unique computer system was used, so that the calculation itself cannot be repeated. What’s more, given limited resources, reproducibility is considered less important than new research. For example, a computer run that took months is unlikely to be repeated, because generating a new result is seen as a better use of the computing resources than reproducing the old result. In the days when Moore’s Law applied to computer speeds, waiting a few years would allow these heroic calculations to be reproduced, though they rarely were.

But time is an important factor in software reproducibility. Konrad Hinsen has coined the term software collapse for the fact that software stops working eventually if is not actively maintained. He says that software stacks used in computational science have a nearly universal multi-layer structure:

  • Project-specific software: whatever it takes to do a computation using software building blocks from the lower three levels: scripts, workflows, computational notebooks, small special-purpose libraries and utilities
  • Discipline-specific research software: tools and libraries that implement models and methods which are developed and used by research communities
  • Scientific infrastructure: libraries and utilities used for research in many different disciplines, such as LAPACK, NumPy, or Gnuplot
  • Non-scientific infrastructure: operating systems, compilers, and support code for I/O, user interfaces, etc.

where software in each layer builds on and depends on software in all layers below it, and any changes in any lower layer can cause it to collapse.

Hinsen goes on to say that just addressing project-specific software (the top layer) isn’t enough to solve software collapse; the lower layers are still likely to change. And the options he suggests (and I’ve named) are similar to those available to house owners facing the risk of earthquakes:

  • Teardown – treat your home as having minimal value, subject to collapse at any time, and in case of collapse, start from scratch
  • Repair – whenever shaking foundations cause damage, do repair work before more serious collapse happens
  • Flexible – make your house or software robust against perturbations from below
  • Bedrock – Choose stable foundations

Hansen suggests that many researchers building new code for what they think is a single use choose teardown, while most active projects that are building code intended to last choose repair. While engineers know how to build flexible buildings to survive a given level of shaking, we don’t know how do this in software (though people like Jessica Kerr and Rich Hickey are talking potential practical and social solutions, and as suggested by Greg Wilson, perhaps new computer science research could more fundamentally address it.) The bedrock choice is possible, as demonstrated by the military and NASA, but it also dramatically limits innovation, so it’s probably not appropriate for research projects.

Much of the immediate inspiration for this blog post and Hinsen’s is a blog from C. Titus Brown called “Archivability is a disaster in the software world.” Titus talked about why using containers or VMs isn’t satisfactory: they themselves are not robust, and they provide bitwise reproducibility, but aren’t scientifically useful, because as black boxes, you can’t really remix the contents. Titus suggested we either run everything all the time, or accept that exact repeatability has a half-life.

Running everything all the time is related to Hinsen’s repair option, it just guarantees you know when it’s time to make repairs. This could be done through continuous analysis (as proposed many times, for example by Howison at WSSSPE2 and in a preprint by Beaulieu-Jones and Greene), similar to continuous integration.

Accepting that exact repeatability has a half-life has a nice architectural parallel: we don’t build houses to last forever, and that seems fine. But if we do accept this, perhaps we should consider costs and benefits (as is done for houses, where we don’t worry about earthquakes for sheds, but we do for hospitals and highways.)

For example, if software-enabled results could be made reproducible at no cost, we would do so. If the cost was 10x the cost of original result, we probably wouldn’t. If the cost was equal to the cost of the original result, we probably still wouldn’t, though maybe in some cases, for particularly important results, we would. What about if the cost of making the work reproducible was an extra 10%? Or an extra 20%?

We need to balance both the cost of reproducibility vs. lost opportunity of new research, and the cost of not having reproducibility vs. the lost opportunity of future reuse. This could be a specific question about any one experiment/result, or it could be a general question about the culture of science. Similarly, if it’s not practical to make everything reproducible, we could also use a cost/benefit ratio to determine what to do in any particular case.

Going back to the original question that started this blog, “is software reproducibility possible and practical?” I think the answer is yes, over a short period, but more generally, the answer is no. As Barba and Brown and others are doing, we can provide reproducibility over a short period, though our incentives don’t align with this and it’s fairly difficult, so most researchers don’t. In the longer term, our systems do not really support reproducibility, mostly due to software collapse issues as defined by Hinsen, and the fact that containers can support repeatability and replicability, but not reproducibility or reusability. Additionally, the costs associated with full long-term reproducibility are considered worthwhile for all software-based research. A method to get to reproducibility and reusability while not dramatically reducing our ability to innovative is potentially to overcome software collapse by using flexible or fuzzy APIs between underlying layers, though much computer science research is needed to enable this.  This could also lower some of the costs, increasing the amount of work that could be made reproducible.


Thanks to Kyle E. Niemeyer, Sandra Gesing, Matt Turk, and Kyle Chard for their comments on an earlier draft of this blog, and of course, all of the people who I’ve named and whose work I’ve quoted, paraphrased, or linked to; any incorrect interpretations of their work are mine.

Advertisements

Published by:

danielskatz

Assistant Director for Scientific Software and Applications at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

7 Comments

7 thoughts on “Is software reproducibility possible and practical?”

  1. I like the term “software collapse”, because that really is what it is. Having built — and maintained — software we distribute openly for the better part of 20 years now, I see this nearly every day: a very sizable fraction of patches we merge deal with (i) configuration issues trying to keep up with OS development, tool chains, and dependency hell, (ii) updating the code base to how newer and different compilers interpret computer languages. We really try very very hard to be compatible and portable, but the reality is that I have no confidence at all that 5 years from now, when Trilinos/PETSc/SLEPc/TBB/UMFPACK and compilers will have gone through several more revisions, today’s code base will come even close to compiling successfully. It’s hard for me to admit this, because we really spend so much time on it, but I do think that that is the reality for all large-scale projects that work at a sufficiently high level.

    What this means in practice is that you can’t expect to really archive anything for posterity that is more than just a few lines of an interpreted language. I think it’s realistic to save a 500 line Matlab script — if the Matlab interpreter changes, it can’t be *that* much work to update 500 lines. But it doesn’t make all that much sense to archive the nearly 1M lines of C++ in deal.II along with a paper we publish. I wished it did, but I don’t think it really does: in a few years, you’d have to adjust so much code to get it to run that only an expert could make that happen. That doesn’t really help others reproduce our results.

    Like

  2. This is a very important topic, but the reference for why “VMs and containers not being robust”: https://thehftguy.com/2016/11/01/docker-in-production-an-history-of-failure/ is rather misleading. It’s nothing more than a rant of a sys admin struggling to orchestrate a complex set of services using evolving Docker software. It tells us very little about likelihood of running a simple scientific Docker container image 5-10 years from now. It’s just not very relevant to the way containers are used in science. Furthermore, the way you added the reference suggest that VMs and Docker containers have the same problems, but the referenced post does not talk about VMs. There are many success stories of researchers running VM images that are many years old with not problems.

    Container images and VMs are not perfect (they can make remixing harder for example), but are a relatively cheap way to greatly increase chances that you would be able to rerun your code in the future. For that reason, I do recommend that most scientists should use them.

    Like

  3. Hi Dan,

    Great summary; in general I agree. I use the term “bitrot” rather than software collapse but it’s the same thing 🙂

    Is there a way to flip ongoing maintenance of reproducibility from a deadweight cost to an activity that generates resources?

    For example, can we highlight that reproducible workflows could make wonderful regression tests for the software components in them? Could then, reproducible workflows be seen as something worth maintaining by the projects producing the components?

    Similarly reproducible workflows could communicate how software components are being used in the field, allowing projects to save money on user requirements efforts. Similarly they could be used to identify components that are substitutes for each other, thus decreasing the costs of funding multiple partially overlapping projects.

    Might showing an ability to reproduce another’s work act as a signal of competence in grant applications?

    No silver bullet, of course, but can we specify the value of reproducible workflows and a mechanism to capture that value and apply it to maintenance. Sometimes, though, the value will be insufficient even if can be captured and sometimes large value simply can’t be efficiently captured. That is an extension of your ideas above: the value of reproducibility is often unclear and the activity of maintenance doesn’t capture it’s own value.

    Like

    1. Thanks James.

      Re terms: I’ve used bitrot and software rot in the past, but I think software collapse is much more descriptive and less likely to be confused with other issues, such as when bitrot is used for bits of data on physical media to be corrupted.

      Thanks for your other suggestions as well – these are good ideas to explore.

      Like

  4. Again regarding containers/VMs: there is a distinction to be drawn between their role in a) distributing software in an immediately executable form (which can at least ensure that your software is reusable), and b) effectively documenting the author’s suggested environment for building _and_ running the code (which can additionally aid with repurposing the software). For containers this is roughly analogous to distributing the container itself (with instructions on how to invoke it) vs distributing the (automated!) scripts used to build it (e.g. the Dockerfile, Vagrantfile, playbook etc). This is, admittedly, a simplified view, and there’s definitely room for existing tools to improve, but I do think that the situation has improved greatly since I first started writing research software.

    Like

  5. Thanks for this – I’m particularly pleased to see you’ve included ideas from a broad spectrum of folk familiar to the WSSSPE community, and also for the reference to Soergel’s ‘Confirmation depth’.

    Leaving aside the main question (of practicality), I think it is also worth highlighting a couple of angles (apologies if these seem obvious):

    1. Reproduction is not verification. [1] Prior to the 20th Century, peers would attempt to reproduce others’ findings with the materials they have to hand. Even if they were following protocols as reported, the conditions were different (think biological replicates). If these attempts were successful, the findings would gain traction, but such ‘me too’ reports didn’t verify any theories postulated by the original authors. Societal patterns in science have changed since then, but this original principle remains valid.

    In the long view, the degree of ‘Confirmation depth’ required for any particular postulate to become accepted as true is far higher than simply allowing others to see the same results from the same experimental setup. That’s basically the same as inviting one of your peers to your lab to use your equipment and reagents, as well as following your instructions (or even watching you do the experiment for them). However, the magnitude of the verification process depends on the complexity of the model that describes the postulate (gravitational lensing and the speed of light experiments come to mind here).

    2. Reproducing the output of a computational experiment tool chain does not verify correctness of the chain. There are a couple of implications:
    i. if there are errors in the chain – fixing the errors (usually) means a different set of results. That’s why we need version numbers.
    ii. For ‘strict reproducibility’, defects must be preserved at all levels for the whole stack to be reproducible.

    Taking point 1 into consideration again, what I mean by #2 is that ensuring software reproducibility has no bearing on the validity of its results. The slightly-less gloomy interpretation, however, is that if you can explain observed differences between two versions of a computational experiment’s tool chain, that could contribute a degree of confirmation depth – particularly if your postulate is coupled directly to the differences in the chain. The downside of the less-gloomy interpretation is that we might therefore have to maintain all versions of our software – even the broken ones.. which is probably not tractable.

    Returning to the original question – on the tractability of reproducible science, I found Hinsen’s observations on the ‘time/volume’ tradeoff in molecular dynamics simulations has important parallels in other areas of computational science – and not just those that employ integrators. For the most part, computational experiments in any field which involve the generation of large volumes of intermediate results are probably not worth reproducing, providing the salient details can be efficiently archived. If there are no accepted tools for archiving in a particular field, however, then that field has a problem for which data storage facilities can only provide a short term solution.

    [1]. Thanks to the folks from Plan9 who attended the GSoC mentors summit back in 2010 for the observation that reproducibility has nothing to do with validity in computational science!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s