WS-VLAM: Towards a Scalable Workflow System on the Grid

V. Korkhov, D. Vasyunin, A. Wibisono, V. Guevara-Masis, A. Belloum vkorkhov@science.uva.nl Institute of informatics Faculty of Science University of Amsterdam WS-VLAM: Towards a Scalable Workflow System on the Grid

Outline • Introduction: what is WS-VLAM? • Architecture of the WS-VLAM • Large-scale workflow support: • Distributed workflow engine and multi-cluster execution support • Hierarchical resource management and workload balancing • Workflow farming • Semantic workflow support • Conclusions

Introduction WS-VLAM (Virtual Lab AMsterdam) concepts: • Data driven workflow system • Data streaming between workflow components running on the Grid • Components: input and output ports for data exchange; parameters for control (during runtime as well); graphical output (X11) supported • GUI and engine decoupled, interfaced using WS-RF Engine (RTS – Run Time System): • Implemented as GT4 WS-RF service • Uses GT4 features (delegation service, GSI, notifications etc.)

WS-VLAM architecture

Large-scale distributed workflows support • Multi-cluster distributed experiments: distributed workflow engine • Heterogeneous resources: workload balancing and resource management • Complex workflows with parameter sweeps and iterative processing: workflow farming • Semantic support

Distributed workflow engine WS-VLAM GUI GT4 Service Container GT4 Service Container EPR WS-RTSM Factory WS-RTSM Factory GRAM GRAM Resource Manager Distributed RTSM Distributed RTSM WS-RTSM Instance WS-RTSM Instance GUI proxy Data proxy Data proxy GUI proxy Cluster 1 Cluster 2 Worker nodes Worker nodes Workflow components Workflow components Workflow components Workflow components

Hierarchical resource managementand workload balancing • Task level: Adaptive workload balancing for parallel applications (MPI) on heterogeneous resources • Job level: inter-task workload distribution and balancing for multi-task applications (DIANE user-level scheduling env.) • Workflow level: workflow farming

Workload balancing strategy(parallel and multi-task applications) • Distribution of divisible workload between tasks based on application characteristics (communications/computations ratio) and resource characteristics (CPU, memory, bw) • Weights are assigned to all the resources that execute tasks according to their capacities • Fast heuristic algorithm for approximate weighting of resources processing the workload • Iterative processing of similar data; measuring execution performance for each iteration and adapting weights (and thus workload distribution) on the fly

Workflow farming: adaptive data distribution W=1 WF 1 WF 1 is twice as slow! W=2 WF Distributor Estimator Iterative processing: Independent data or parameters WF 2 W=2 Each farmed workflow gets a single data element to process first to assess its performance. The speed of processing is evaluated, then the future workload distribution is determined according to this information. Weights reflecting the performance are assigned to the workflows.

Workflow farming: WF service WS-RTSM 1 WF 1 RTSM Factory XML topology WS-RTSM 2 1 WF 2 6 5 4 3 2 1 2 4 Data to farm 6 GUI Resource Manager 5 3 WS-RTSM 3 List of WS-RTSM EPRs WF 3 Perf Perf Perf Performance data Workflows WF1,2,3 are running, having WS interface, ready to process data from the RM “on-demand”

Semantic workflow support

Conclusions • WS-VLAM features towards large scale data driven workflows support: • Multi-cluster support for a single workflow, ability for data exchange between internal nodes of different clusters • Adaptive workload balancing for parallel applications (workflow components) on heterogeneous resources • Workload balancing on workflow level: parameter/data sweep for workflow • Semantic support for workflow composition

http://www.vl-e.nl/ http://www.science.uva.nl/~gvlam/wsvlam

WS-VLAM: Towards a Scalable Workflow System on the Grid