I’ve previously written about the concept of workflows, sets of independent tasks connected by data dependencies, being expressed either as data, for example, a Pegasus DAG or CWL document, or as code, for example, a Python program written in Parsl.
A workflow expressed as code clearly has advantages: it may be easier to understand and to change, both for the original author and for others who might reuse it. A workflow expressed as data may also have advantages: it might be easier to re-execute or to use as a building block, though I’m not confident that either of these are correct. Overall, we can view the workflow expressed as code as more transparent, and the workflow expressed as data as more opaque.
I’ve come to realize that these do not have to be mutually-exclusive alternatives. It might make sense to think of a workflow first as code, and where at this level, you develop the workflow, experiment with it, and modify it, until it gets to a stable point, useful to others who want to perform the same processing in bulk (either large numbers of users, large numbers of usages, or both). At this point, you might convert the workflow into a data-based representation, and share it.
But don’t throw away the code-based representation. Either store it with the data-based representation, or publish/archive it, and have the data point to it as the source.
This can be thought of as a distribution mechanism, similar to how software is often distributed, as a binary or a container for most users, but as source code for those who want a deeper understanding, or perhaps the chance to make changes.
Carole Goble has pointed out to me that in this thinking, I’ve reinvented the concept of workflow lifecycle, which is at least 13 years old. She explains that workflows follow a cycle, from an experimentation/exploration phase (scientific hacking), where the workflow is an extension of the thought processes of the workflow maker. This may (or may not) be followed by a phase where the developer (or someone else) prepares for wider and repeated use by productization (documentation and optimization) and dissemination. This use by others can simply be use, or it can be further development.
Carole also points out that different types of users have different needs, from experts who want to be able to do as much as possible with a workflow, to other users who are happy to trade off complex features for a simpler user interface. Note that if we think about domain scientists, this level of expertise is only somewhat related to their disciplinary expertise; it’s more their computational expertise, though I think this relation is strengthening as many fields develop an increasing dependence on computing.
Given this, I still have a few remaining questions:
- I wonder if I can hire Carole to review all my work and suggest alternative ways to think about problems? 🙂
- What are the advantages to expressing a workflow as data, and do these apply more at the end of the workflow lifecycle than at the start?
- Do the advantages to representing a workflow as data apply more to non-computational experts? I suspect not, as I don’t see the representation as data being more intuitive to non-experts than the representation as code, and while on the other hand, a visual representation is almost certainly more intuitively understandable, the same visual representation can be generated either from the code representation or the data representation.
Acknowledgements: Thanks to Carole Goble for useful discussions, and Doug Thain for some useful questions when I first thought about this topic.