Using workflows expressed as code and workflows expressed as data together

I’ve previously written about the concept of workflows, sets of independent tasks connected by data dependencies, being expressed either as data, for example, a Pegasus DAG or CWL document, or as code, for example, a Python program written in Parsl.

A workflow expressed as code clearly has advantages: it may be easier to understand and to change, both for the original author and for others who might reuse it. A workflow expressed as data may also have advantages: it might be easier to re-execute or to use as a building block, though I’m not confident that either of these are correct. Overall, we can view the workflow expressed as code as more transparent, and the workflow expressed as data as more opaque.

I’ve come to realize that these do not have to be mutually-exclusive alternatives. It might make sense to think of a workflow first as code, and where at this level, you develop the workflow, experiment with it, and modify it, until it gets to a stable point, useful to others who want to perform the same processing in bulk (either large numbers of users, large numbers of usages, or both). At this point, you might convert the workflow into a data-based representation, and share it.

But don’t throw away the code-based representation. Either store it with the data-based representation, or publish/archive it, and have the data point to it as the source.

This can be thought of as a distribution mechanism, similar to how software is often distributed, as a binary or a container for most users, but as source code for those who want a deeper understanding, or perhaps the chance to make changes.

Carole Goble has pointed out to me that in this thinking, I’ve reinvented the concept of workflow lifecycle, which is at least 13 years old. She explains that workflows follow a cycle, from an experimentation/exploration phase (scientific hacking), where the workflow is an extension of the thought processes of the workflow maker. This may (or may not) be followed by a phase where the developer (or someone else) prepares for wider and repeated use by productization (documentation and optimization) and dissemination. This use by others can simply be use, or it can be further development.

Carole also points out that different types of users have different needs, from experts who want to be able to do as much as possible with a workflow, to other users who are happy to trade off complex features for a simpler user interface. Note that if we think about domain scientists, this level of expertise is only somewhat related to their disciplinary expertise; it’s more their computational expertise, though I think this relation is strengthening as many fields develop an increasing dependence on computing.

Given this, I still have a few remaining questions:

  1. I wonder if I can hire Carole to review all my work and suggest alternative ways to think about problems? 🙂
  2. What are the advantages to expressing a workflow as data, and do these apply more at the end of the workflow lifecycle than at the start?
  3. Do the advantages to representing a workflow as data apply more to non-computational experts? I suspect not, as I don’t see the representation as data being more intuitive to non-experts than the representation as code, and while on the other hand, a visual representation is almost certainly more intuitively understandable, the same visual representation can be generated either from the code representation or the data representation.
  4. Where does CWL fit in this? I used to think of it as a data representation, but then I learned that it could include JavaScript, which makes this less correct.

Acknowledgements: Thanks to Carole Goble for useful discussions, and Doug Thain for some useful questions when I first thought about this topic.

Published by:


Chief Scientist at NCSA, Research Associate Professor in CS, ECE, and the iSchool at the University of Illinois Urbana-Champaign; works on systems and tools (aka cyberinfrastructure) and policy related to computational and data-enabled research, primarily in science and engineering

Categories RSE, Uncategorized5 Comments

5 thoughts on “Using workflows expressed as code and workflows expressed as data together”

  1. Javascript expressions in CWL are limited to pure functions. Unlike “code” based workflows, you cannot use Javascript within CWL to change the execution environment or the workflow document on the fly. The intended use is for data reshaping between steps.

    One could imagine as an alternative, some gnarly YAML-based pure functional language for expressions, would that be more like “data” ?


    1. As I wrote in

      A workflow is a set of tasks and dependencies between them. Workflows can be represented as directed graphs (where nodes are tasks and edges are dependencies). in some cases, these graphs are acyclic, i.e., DAGs. But computer programs can also be thought of as graphs at many different levels, such as components, functions, or instructions.

      An example of a system that expresses a workflow as code is An example of a system that expresses a workflow as data is


  2. I would say the distinguishing feature is to how much you can predict the behavior of a workflow without having to actually run it. Workflow-as-code uses a general purpose language to construct the graph, so you don’t know what the graph is without actually running the code (although you may not need to run the actual workflow). Workflow-as-data is declarative, so you can determine the graph directly.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s