The Need for Notebooks

In response to Doug Thain’s question:

What currently-available tools do you recommend for enabling reproducible scientific computing?  Is there a tool that we ought to have, but do not?

P.S. I am using “reproducibility” as an easy shorthand for re-usability, re-creation, verification, and related tasks that have already seen some discussion.  Please interpret the question broadly.

I think the most important thing we need is an electronic lab notebook that really allows us to go back and understand exactly what we did, repeat it, modify it, record it, etc.  If you accept this, it leads to a number of points:

  1. Why don’t more people (including me) do this now?
  2. What tool(s) should we use?
  3. How should this integrate into the publication process (for papers, software, data, etc.)
It’s a bit hypocritical for me to write this post, since I’ve never really been a good notebook user or note taker, whether in classes, meetings, or labs.  And I think this is true of a lot of us.  Part of it is probably making it a matter of habit, and it’s hard to break a bad habit or start a new one.  And part of it is the lack of good tools with good interfaces.

Tools I’ve seen that I’ve liked include VizTrails for data exploration and visualization, and Project Jupyter for a lot of other things.  And software version control (e.g., Git) is also part of the answer, most likely.  While these independent tools have their strengths and weaknesses, I don’t think they really fit the bill.

I would really like something that’s much more integrated into my computer, probably at a runtime or OS layer, that helps me understand all of my work, not just my visualization or coding.

If we did have something that really was more an automated work-tracking notebook, we could use it to help us with publications as well.  For example, in my idea of transitive credit, if we are going to decide what products contributed to a new product, a starting point is the the products we used during the creation of the new product, which could be created by such a notebook.  Or we could track our reading list in the period leading towards a new paper as a starting point for the papers we should reference.

And, this potentially leads us beyond the PDF on the path towards executable papers, which would be a key step towards reproducible science in the large.

(note: this is a crossposted blog from http://reproduciblescience.blogspot.com/2015/07/the-need-for-notebooks.html)

Disclaimer: Some work by the author was supported by the National Science Foundation (NSF) while working at the Foundation; any opinion, finding, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF.

Published by:

Daniel S. Katz

Chief Scientist at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Categories UncategorizedTags 5 Comments

5 thoughts on “The Need for Notebooks”

  1. I just reviewed a great paper by Stephen Piccolo et al that hopefully will come out soon, that talks about this area. The two dominant notebook systems at this point are Jupyter and knitr, and they fit most people’s workflows very well, as far as I can tell. The one practical obstacle to their use that I see is in long-running tasks where workflow systems (make, makeflow, etc.) can be important for orchestrating the graph of commands.

    Tools and integration (#2 and #3) are being explored and we already have some pretty good answers, AFAICT. As for why more people don’t use them, well, it’s mostly a matter of training and incentives; Software and Data Carpentry can help with the former, while funding agencies, publishers, and reviewers can help with the latter.

    Like

  2. Hi Dan,
    Great post; I agree entirely with your points that while tools like GitHub and Jupyter may be part of the solution, they focus (both in design and common use) only on specific parts of research and do not usually capture the bigger picture of ideas, or all the rest of the process that might be captured in an electronic laboratory notebook. I have kept an open lab notebook online of all my research for the past 5.5 years (http://www.carlboettiger.info/2012/09/28/Welcome-to-my-lab-notebook.html), which has moved across 4 different platforms (openwetware wiki -> wordpress blog -> jekyll -> knitr/pandoc/jekyll/docker/circle-ci mashup) so far. Of course many other open and enterprise versions of electronic lab notebooks also exist. From my experience, I believe that it is not a specific technology that we need (surely these will continue to change and evolve), but that the general concept of a web-native electronic notebook makes it relatively easy to adapt and incorporate specific technologies. For instance, I’ve experimented with the reading list idea you mentioned in several ways, from citing what I’m reading in notebook entries (tagged by project) to rss-feeds of reading lists by topic (e.g. via mendeley groups).

    My notebooks from more recent years have been less detailed or regular than they were in my graduate-school days, (despite/or because of?) being more tightly integrated with computational reproducible technology. (Spending more time on administrative tasks hasn’t helped). While I agree that the challenge is part in lack of tools and part in lack of habit, I think the more salient barriers are (1) the disconnect between the rather excellent tools we have vs the far more limited scope provided by the tools of scholarly publishing, and (2) not so much the lack of habits in individuals in this regard as it is the lack of habit as a community in sharing and discussing these details.

    Like

Leave a comment