ToPoS: High-Throughput Parallel Processing Pipelines on the Grid

ToPoS:High-Throughput Parallel Processing Pipelines on the Grid Pieter van Beek SARA Computing and Networking Services High Performance Computing and Visualization e-Science Support

Users experiences with gLite • Overhead for starting jobs is considerable • Determining the best chunk size is difficult. • Too small -> large overhead • Too large -> timeouts and throughput problems. • Resource brokering is far from optimal • Jobs often fail and users create their own tools for administrative tasks

Resource Brokering Submitted jobs are sent to a CE immediately. When another CE becomes available, you won't use it automatically

Failing Jobs (1) • Common experiences: • Sorry, an Incomprehensible Error occurred • Your VOMS Credential has expired • What Job? • Success! (but there’s no output) • Failure! (but it ran just fine) • Out of Wall-time (but no CPU-time?) • A lot of “monitoring and resubmission” software is created again and again by many users.

Failing Jobs (2) • A real world example: • 27,000 jobs • duration: approx. 4 hrs • approx. 280 WNs • Theoretical duration: 16 days • But with a success rate of 70% … • Approx. 9 resubmissions • “Practical” duration: >2 months

Pilot Jobs • “Normal” jobs • Pilot jobs

Simplest possible solution:Topos I • An online counter, like a “page views” counter • Numbers are “leased” for some period • Leases must be renewed • Interfaced with HTTP (REST web service) • Can be used with any HTTP client (wget, browsers) • As little security as possible

Pilot job flow Pilot job Running pilot job Get unused token Submit Finished? Execute token task Pilot job with token no yes Delete token affirm token use

Advantages • Simple design and use • Using HTTP REST • Automatic resubmissions • Less overhead for large number of jobs. One pilot job can execute several tasks in sequence. • Improved scheduling • Easy job administration by querying Token Pool Server. • Progress • Fail rate

Topos I screenshots

Topos 2.x • Interfaced by WebDAV i.o. HTTP • Tokens are files, i.e. they have • identity • content • mime-type • properties • Token pools are directories • Tokens can be moved between directories • Allows users to build pipelines and workflows (high-level colored Petri nets)

Topos 2 screenshot

“Portfolio” • SciaGrid • Collaboration between SRON, KNMI, NIKHEF and SARA • Website where users can select • satellite data (Sciamachy) • data processors • Arnold Kuzniar and Jack Leunissen (WUR) • BLAST protein sequence alignment • Bas Dutilh (CMBI) • HAMMER sequence alignment (?) • Jan Bot (TUD)

Future directions • Documentation • ATOM/RSS instead of WEBDAV • Back to numbers instead of files • TODO

pieterb@sara.nl

ToPoS: High-Throughput Parallel Processing Pipelines on the Grid