150 likes | 259 Views
ToPoS: High-Throughput Parallel Processing Pipelines on the Grid. Pieter van Beek SARA Computing and Networking Services High Performance Computing and Visualization e-Science Support. Users experiences with gLite. Overhead for starting jobs is considerable
E N D
ToPoS:High-Throughput Parallel Processing Pipelines on the Grid Pieter van Beek SARA Computing and Networking Services High Performance Computing and Visualization e-Science Support
Users experiences with gLite • Overhead for starting jobs is considerable • Determining the best chunk size is difficult. • Too small -> large overhead • Too large -> timeouts and throughput problems. • Resource brokering is far from optimal • Jobs often fail and users create their own tools for administrative tasks
Resource Brokering Submitted jobs are sent to a CE immediately. When another CE becomes available, you won't use it automatically
Failing Jobs (1) • Common experiences: • Sorry, an Incomprehensible Error occurred • Your VOMS Credential has expired • What Job? • Success! (but there’s no output) • Failure! (but it ran just fine) • Out of Wall-time (but no CPU-time?) • A lot of “monitoring and resubmission” software is created again and again by many users.
Failing Jobs (2) • A real world example: • 27,000 jobs • duration: approx. 4 hrs • approx. 280 WNs • Theoretical duration: 16 days • But with a success rate of 70% … • Approx. 9 resubmissions • “Practical” duration: >2 months
Pilot Jobs • “Normal” jobs • Pilot jobs
Simplest possible solution:Topos I • An online counter, like a “page views” counter • Numbers are “leased” for some period • Leases must be renewed • Interfaced with HTTP (REST web service) • Can be used with any HTTP client (wget, browsers) • As little security as possible
Pilot job flow Pilot job Running pilot job Get unused token Submit Finished? Execute token task Pilot job with token no yes Delete token affirm token use
Advantages • Simple design and use • Using HTTP REST • Automatic resubmissions • Less overhead for large number of jobs. One pilot job can execute several tasks in sequence. • Improved scheduling • Easy job administration by querying Token Pool Server. • Progress • Fail rate
Topos 2.x • Interfaced by WebDAV i.o. HTTP • Tokens are files, i.e. they have • identity • content • mime-type • properties • Token pools are directories • Tokens can be moved between directories • Allows users to build pipelines and workflows (high-level colored Petri nets)
“Portfolio” • SciaGrid • Collaboration between SRON, KNMI, NIKHEF and SARA • Website where users can select • satellite data (Sciamachy) • data processors • Arnold Kuzniar and Jack Leunissen (WUR) • BLAST protein sequence alignment • Bas Dutilh (CMBI) • HAMMER sequence alignment (?) • Jan Bot (TUD)
Future directions • Documentation • ATOM/RSS instead of WEBDAV • Back to numbers instead of files • TODO