Expressing workflows as code vs data

I was part of a panel in the WORKS workshop at the SC17 conference a couple of months ago, where I gave a short talk called “Expressing and Sharing Workflows” (slides). I’ve written this blog to put the slides into words and add some details I left out.

First, what is a workflow? To me, it’s a set of tasks and dependencies between them. Workflows can be represented as directed graphs (where nodes are tasks and edges are dependencies). in some cases, these graphs are acyclic, i.e., DAGs. But computer programs can also be thought of as graphs at many different levels, such as components, functions, or instructions.

So how is a workflow different than a computer program? I think it’s because the tasks are well-specified in a workflow, with clear inputs and outputs, and that the tasks are often longer, with run times on the order of seconds to hours.

In some sense, though, a program is a workflow. So why should we express a workflow differently than we do a program? I want to suggest that a program or script is a natural and useful way to express a workflow. For example, most people’s first workflow is a shell script that’s used to automate and repeat a set of tasks. But in a shell script, the workflow logic can be unclear, so there are tools like YesWorkflow that can be used to annotate the script to expose the workflow logic. Beyond a shell script some use (scripting/programming) languages such as Swift or Python (through Parsl) to express workflows. While both Swift and Parsl expose the workflow logic in a program using “app” definitions to define tasks, Swift does so as a workflow language, while Parsl is a library-based approach that uses standard Python.

Bertram Ludäscher shared some information about YesWorkflow with me, which I’m interpreting and sharing in this paragraph. The name comes from the idea that Yes, scripts are (can be) Workflows too. But, the workflow is usually hidden in the script. The idea of YesWorkflow is to let the script author reveal the structure by declaring tasks (steps) and the dataflow between tasks. This revealing is a modeling step, and can be done at different levels. At a very coarse level, the model of the workflow can be one big black box with inputs and outputs. At a fine level, the workflow can be many steps, linked by dataflow. The model explains (perhaps graphically) the conceptual workflow that the script author has in their mind, such as the relevant steps and relevant data, and this model can then be shared with the script. And the conceptual YesWorkflow model can also be queried; it can be linked with runtime observables and provenance.

Let me also explain Parsl in a little more detail, as an example of a scripting method for expressing workflows. It’s a Python-based parallel scripting library, based on ideas in Swift. In Parsl, tasks are exposed as functions, and they can either be Python functions or wrappers around external functions, such as shell scripts or executables. For example:

@App('bash', data_flow_kernel)
def echo(message, outputs=[]):
    return 'echo {0} &> {outputs[0]}’
@App('python', data_flow_kernel)
def cat(inputs=[]):
    with open(inputs[0]) as f:
        return f.readlines()

The return values from these tasks are futures, proxies that are immediately available even if the function is still computing and the real value is not yet available. This lets us then call other tasks that depend on these futures, though those tasks will not actually execute until the futures they need are satisfied or filled. In Parsl, the main Python code is used to glue the tasks together, for example:

hello = echo("Hello World!", outputs=['hello1.txt'])
message = cat(inputs=[hello.outputs[0]])

The idea behind Parsl is that this is easy to use, as it’s a library in standard Python, and it’s easy to understand, as the tasks and their inputs and outputs are clearly exposed.

On the other hand, a workflow expressed as data (a graph or a DAG) can be thought of as closer to the compiled (assembly) version of a program. This can be useful for lots of things, such as optimized performance, and detailed modeling, but it is not as useful for understanding and sharing the workflow as the higher level program.

To address three possible arguments

  1. I know that you can create a graphic image of the DAG, and while this is useful for understanding the workflow, I believe that a graphic created from the workflow script/code itself is even more useful.
  2. I know that assembly code is really still code, not data, but information is lost in going from high level code to assembly code. All code can be thought of as data (a set of bits), but what makes code not just data is that those bits can be interpreted as having content, and the amount of content varies depending on how the code is expressed. I view the workflow DAG as much closer to a structured data set than more general, more understandable code.  Of course, what the workflow allows to be expressed matters too.  Workflows that are static DAGs with simple scatter/gather patterns (as I believe is the case for Pegasus, Galaxy, and CWL) are written in the most limited language and are most like data. Adding conditionals (as in WDL) makes the workflow more like code.  Adding loops (as in Askalon) makes it even more like code.  And making it fully dynamic (like Swift and Parsl) moves to using a complete language, and makes it fully like code.
  3. I know you can program graphically, such as with Scratch, and this is, in some sense, a program as a graph, but I also think that this programming is typically higher-level than what is typically of a workflow stored as a DAG.

So, if we can think of workflows as either software or data, how should we share workflows? We share general software either as libraries, providing units of execution with well-defined APIs, or as source code, most often via source code repositories such as GitHub, or packaging systems/repositories such as PyPI and CRAN. On the other hand, we generally share data through data repositories, such as Dryad.

For workflows, we could follow the library model, and share sub-workflows, defined to provide well-specified functionality. Or we could follow a source code model, and share them as source code (scripts), which might be hard to understand, unless we use a tool like YesWorkflow or a workflow language/library like Swift or Parsl. Or we could follow the data model, and share them using a data repository, for example a data repository for workflows such as myExperiment.

Carole Goble and Dave DeRoure shared some information about myExperiment with me, which I’m interpreting and sharing in this paragraph. It started a little over 10 years ago, in November 2007, as a workflow commons for workflow sharing, designed using Web 2.0 principles, and it’s still actively being used. It’s the largest public collection of workflows, with workflows for multiple workflow systems, and as such a lot of researchers have used it as a set of workflows for their own computer science research on scientific workflows and e-Science, as well as for using individual workflows for particular science problems; there are more than 2400 entries in Google Scholar that refer to myExperiment. myExperiment is open source, provides a REST API, and is part of the Open Linked Data cloud, which contains 66k triples. It also introduced “packs,” which led to Research Objects. The myExperiment service is maintained by Manchester and Oxford universities, and it informs the design of other workflow sharing systems. As of November 2017, it had 10591 members, 393 groups, 3876 workflows, 1233 files, and 477 packs.

Today, GitHub is widely used for sharing software, and socially working on/with software (and many other types of documents.) And GitHub is used for sharing workflows today, both scripts and data.

  • Borrowing from “Software vs. data in the context of citation
  • A workflow as a program or a script is code, a creative work
  • Appropriate license: OSI-approved open source (e.g., BSD)
  • A workflow as a DAG is data?
  • Appropriate license: Creative Commons (e.g., CC-BY)?

So, let’s keep workflows as programs/scripts. We can use tools like YesWorkflow with scripts to expose the workflow concepts in them, and for scripts written in Parsl and similar tools, the workflow concepts should already be clear. And we can then simply use GitHub (or whatever is the social code development and sharing site of the day) to share them.

Acknowledgements: I want to thank Kyle Chard and Yadu Nand Babuji for their suggestions on how to briefly explain Parsl, Bertram Ludäscher for his information about YesWorkflow, and Carole Goble and Dave DeRoure for their information about myExperiment. Of course, any errors in the descriptions of these systems should be blamed on my misinterpretation of the information. I also want to thank Doug Thain, who was in the audience at the original talk and asked a number of great questions.

Advertisements

Published by:

danielskatz

Assistant Director for Scientific Software and Applications at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

1 Comment

One thought on “Expressing workflows as code vs data”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s