140 likes | 235 Views
A tuple space based jobsubmission system for distributing short jobs. Peter Praxmarer GUP, Joh. Kepler University Linz praxmarer@gup.jku.at. Agenda. Architecture Outline Challenges imposed on the jobsubmission system by POP-C++
E N D
A tuple space based jobsubmission system for distributing short jobs Peter Praxmarer GUP, Joh. Kepler University Linz praxmarer@gup.jku.at
Agenda • Architecture Outline • Challenges imposed on the jobsubmission system by POP-C++ • Challenges imposed on the jobsubmission system by the infrastructure • Requirements • Today’s system setup • Performance results Peter Praxmarer, GUP, Universität Linz
Architecture Outline Peter Praxmarer, GUP, Universität Linz
Challenges: POP-C++ (1) • Challenges imposed by the Nqueen app, POP-C++ • Performance • Assumes fast instantiation of so-called Remote Objects on the worker nodes. • Remote Objects are instantiated one after another • -> no bulk job submission of many Remote Objects at once possible. • -> Instead the creation of a single remote object corresponds to the submission of a single Grid-job. Peter Praxmarer, GUP, Universität Linz
Challenges: POP-C++ (2) • Challenges imposed by the Nqueen app, POP-C++ (continued) • Need for Outbound connectivity • Fault-tolerance • The failure of a Grid-job would require complex error-handling within the POP-C++ library or within the application built on top of POP-C++. • Negatively impacts the performance. Peter Praxmarer, GUP, Universität Linz
Challenges: gLite (1) • Challenges imposed by the grid infrastructure (gLite, ProActive VO) • Performance: • Delay of several minutes between the start of the submission of a gLite job and its start-up on a worker node • Reliability • High-rate of failed gLite jobs for various reasons Peter Praxmarer, GUP, Universität Linz
Challenges: gLite (2) • Challenges imposed by the grid infrastructure (continued) • Dynamically changing environment • Machines are being added/removed • Heterogeneous Resources • IBM PPC cluster (not yet supported by gLite) • Non-gLite cluster at EIF Peter Praxmarer, GUP, Universität Linz
Architecture Outline Peter Praxmarer, GUP, Universität Linz
Requirements for the jobsubmission system • Has to be fast • job requests have to be fulfiled immediately if a resource is available • Has to be reliable • A high number of job requests must be processed within a short time frame • Has to cope with the dynamic arrival of new resources • Should ensure that resources being advertised as being available, indeed are capable of successfully executing the Remote Object. Peter Praxmarer, GUP, Universität Linz
A Tuple Space based Jobsubmission System (1) • Based on ideas introduced in Carriero and Gelernter’s Linda system (1983) • A tuple space (TS) is a collection of tuples • A tuple is a data object containing one or more components. • E.g. (234, “command”), (234, 0, “result string”) • The TS is accessed through the following operations: insert, read, take Peter Praxmarer, GUP, Universität Linz
A Tuple Space based Jobsubmission System (2) • Currently Client-Server architecture (central TS server, multiple clients) • Implemented in C++ • Network connectivity via TCP (w/o GSI) • ACL enabled • Provides a simple but fast form of resource brokering Built to support the execution of short jobs Peter Praxmarer, GUP, Universität Linz
… IBM PPC@CSCS … crunchmachines@EIF (runs the NQueen App and some Worker) hydra @ GUP (hosts 2 TS) Today’s system setup ProActive VO LCG resources lcg-job-submit tisiphone@GUP (glite-UI) job-container started via ssh Peter Praxmarer, GUP, Universität Linz
Performance Results • Within this set up: • Served up to 500 Job-Container • Served the same number of job requests • Time for executing a single job: min. 0.103 s, max 4.328 s (sample size 750 jobs) • Detailed performance analysis to be published later. Peter Praxmarer, GUP, Universität Linz
Questions? Peter Praxmarer, GUP, Universität Linz