400 likes | 494 Views
Using explicit control processes in distributed workflows to gather provenance. Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil. UFRJ. Agenda. Introduction Motivation
E N D
Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil UFRJ
Agenda • Introduction • Motivation • Control flow in data centric workflows • Objective • Provenance Gathering in Distributed Workflows with Explicit Control Flows • Case of Use • Control Flow on VisTrails • Conclusion
Distribution & Heterogeneity in Workflows • Scientific Wf enables data intensive analyses • Use of grid x remote parallel machines • Use of different WfMS • Different provenance capture mechanisms • Use Centralized x Distributed WfMS • often offer disjoint set of capabilities How to obtain a homogeneous provenance representation and capture mechanism?
Control flow matters in data centric workflows • Scientific workflows also need control structures to specify how the data flow should be directed • Goderis et al. [6] stress the importance of combining different models of computation in one scientific workflow • Bowers et al. [5] say that: • “modeling control-flow using only dataflow constructs can quickly lead to overly complex workflows that are hard to understand, reuse, reconfigure, maintain, and schedule” • Tudruj et al. [7] state the importance of general dynamic control flow, but focus on synchronization of parallel execution • Presented a set of generic control structures and proposed the use of a monitoring middleware
A real example: OrthoSearch workflow Detect distant homologies on five parasites associated with tropical neglected diseases
OrthoSearch specification in Kepler MAFFT/HMMER packages Best Hits Finder FormatDB BLAST InterPRO Time consuming tasks • Some lighweight tasks can run locally • Suppose we need to execute MAFFT/HMMER in a High Performance Environment • Just send it to a grid !
OrthoSearch - loops, choice, … MAFFT/HMMER packages How to map this to the grid language ? Best Hits Finder FormatDB BLAST InterPRO
OrthoSearch - loops, choice, … MAFFT/HMMER packages Alternatively, send one job at a time to execute remotely Best Hits Finder FormatDB LOCAL BLAST InterPRO Can be very inefficient !
OrthoSearch - loops, choice, … Rewrite this to the grid language. e.g. Triana, supports loops ! But, how to bring provenance data back to Kepler ? How to register loop iterations ?
OrthoSearch - loops, choice: other issues What if my available grid supports another WfMS ? What if the grid WfMS does not support loops ? What if my available grid does not have a WfMS ? Generic control flow modules with remote provenance gathering!
Motivation • Workflow design • Different WfMS present their own control structures, parallel execution models, etc. • Expose different modeling semantics to the users! • Provenance gathering • WfMS register provenance in their own schema • Often encompassing specific grid features • Based on application domain attributes Many challenges in changing WfMS for the same workflow A lot of mappings and conversions!
Objective • Diminish the dependence of the workflow definition on the WfMS • uncoupling the provenance gathering system from the WfMS • having some control flow of execution independent of the WfMS workflow specification language • Plugging control flow and provenance gathering modules along the workflow original tasks • the workflow specification can be executed almost independently of the current WfMS • provenance can be gathered uniformly
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al.
Scientific Workflow Control Flows Implicit LOOP COGs DB MAFFT hmmbuild HMMER Implicit DECISION hmmcalibrate Ptn DB hmmsearch hmmpfam ReciprocalsBest Hits Finder BLAST fastacmd formatdb Reannotated genes InterPRO
Scientific Workflow with Explicit Control Flows Initial condition MUX MAFFT hmmbuild HMMER hmmcalibrate Explicit LOOP Explicit DECISION T IF F hmmsearch hmmpfam Meta-Workflow eases migration of a Wf from WfMS to another! • All these modules can be sent to execute in any HPC environment • Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules
Control flow modules on VisTrails • All these control flow modules were made available on Vistrails • More explicit control is now available • Remote execution can keep specified control • Remote execution can bring provenance data back to Vistrails with compatible structure Advantages
Orthosearch on VisTrails Explicit DECISION External LOOP (parameter exploration) • All these inner modules (sub-workflow) can be sent to execute in a grid or HPC environment • Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules • In Vistrails the loop could not be implemented because it is a DAG based WfMS
Scientific Workflow - Heterogeinity COGs DB MAFFT hmmbuild HMMER hmmcalibrate Ptn DB hmmsearch Time consuming hmmpfam ReciprocalsBest Hits Finder BLAST fastacmd formatdb Reannotated genes InterPRO
Orthosearch on VisTrails REMOTE PARALLEL EXECUTION BLAST • BLAST modules should be sent to execute in PC cluster • Provenance gathering mechanisms can be inserted in the control flow modules to be sent to the parallel environement • In Vistrails this can be achieved using the MidMon modules
MidMon on VisTrails Implementation • Monitoring tool that checks scientific processes running on distributed environments • Message exchange-based tool • Decoupled and present modular infrastructure • Support to legacy applications on distributed resources Control Modules Data Modules BLAST
Concluding • We share the same motivation of Bowers et al., Goderis et al. and Tudruj et al. • And the same as Groth et al. • We propose: • A set of generic control-flow structures independent of WfMS • Our implementation has shown that: • Control-flow structures can allow generic sub-workflow remote execution • Remote process provenance can be captured in the same representation of the wf • Workflow refactoring is facilitated • Control-flow structures can be coupled to monitoring middleware Using explicit control flow Provenance independent of a WfMS
Conclusion • Distribution & Heterogeneity are inevitable in scientific workflows • Adding control-flow modules to the scientific workflow specification can help the execution by heterogeneous WfMS running on distributed environments • Acts as documentation of the execution control workflow • Allows to evaluate and monitor the activities of the workflow • Helps to gather provenance from heterogeneous and independent environments with low programming efforts • MidMon on top of VisTrails • Enable scientists to monitor the submitted jobs status on their desktops • Preserves workflows’ original features
Future work • Use workflow views, e.g. ZOOM* • Our solution makes the workflow very verbose • Use software component reuse and refactoring techniques to help the automatic incorporation of these modules • “Using Provenance to Improve Workflow Design” Tosta et al. • Work with other workflows from bioinformatics and oil industry
Thanks ! Using explicit control processes in distributed workflows to gather provenance Sergio M. S. da Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. MUX Describes a convergence between two or more input ports, resulting in just one branch
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. DEMUX Represents an incoming branch that diverges into two or more parts. Just one of the outgoing branches is enabled depending on a condition associated
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. STRING CONTROL The workflow is divided in two or more branches, and just one of them can be enabled; the other outgoing branches are withdrawn
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. NUMBER CONTROL All output data are originated simultaneously
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. NUMBER COMPARE Two or more incoming branches become one outgoing branch, which will be only enabled after the complete activation of all the input data.
Scientific Workflow Control Flows • A small set of generic workflow-level control modules • Based on workflow patterns by Van der Aalst et al. IF Same pattern of the Demux But present two differences : If has only two input ports and has a logical expression, where the scientists can create any condition they need.
MidMon • Offer a generic and lightweight monitoring tool that checks scientific processes running on distributed environments • Message exchange-based, 2 layered modular infrastructure • Decoupled and lightweight, crossing different network boundaries • Easy to deploy and manage • Support to legacy applications on distributed resources
Midmon Monitoring Data • state data may be possible to be monitored • it may be possible to monitor about the state of the environment • it may be possible to monitor about service availability
Midmon – State Data • List of task state data that it may be possible to monitor: • Progress of a service - Rely on check points within the service, or a service may be able to provide an estimate of its progress • Completion of a service - This could be a simple event that indicates that a service has produced all of its output file • Data consumption rate of a service - This is a measure of the rate at which service is consuming data from input file • Data production rate of a service - This is a measure of the rate at which service is generating data for output file
Midmon – State of the environment • A list of the useful data that it may be possible to monitor about the state of the environment is: • Available execution nodes - This could be a list of changes in the available execution nodes in the environment • Load on an execution node - This is a measure of the load in a execution node. It could be one, or a tuple, or a composite of services, e.g., the CPU load, the number of processes, and the free resources of the execution node • Load on a network link - This is a measure of the usage of a network link, in terms of the available bandwidth • Memory usage on an execution node - This is a measure of the usage of memory in a execution node
Midmon – Service availability • The following is a list of useful data that it may be possible to monitor about service availability • Available services - This could be a list of the services available as mapping targets for tasks in a workflow. The data could also include, e.g., the status of services currently deployed • Available data resources. This could be a list of the data resources available as mapping targets for inputs and outputs in a workflow
OrthoSearch – SSH version • Without Control-Flow modules
hmmSearch hmmPFam OrthoSearch on Kepler 1/3
FormatDB OrthoSearch on Kepler 2/3 FastaCmd
InterPro OrthoSearch on Kepler 3/3