190 likes | 218 Views
SWARM provides infrastructure for managing distributed HPC clusters. Learn how to access, submit, and track jobs remotely and handle faults effectively.
E N D
{swarm} :Scheduling Large-scale Jobs over Loosely-Coupled HPC Clusters Sangmi Lee Pallickara, and Marlon Pierce Indiana University
Distributed HPC clusters How to access the clusters? How to submit jobs remotely? How to decide where to submit? How to track submitted jobs? How to detect faults? Desktop users Web portals Scientific Gateways
Motivation1: mRNA Sequence Clustering and Assembly pipeline • Sequence Assembly: Deriving consensus sequences (contigs) from individual overlapping DNA fragments. • Expressed Sequence Tag(EST) sequencing : assemble fragments of messenger RNAs • Stage 1: data preprocess: serial job • Stage 2: clustering mRNA fragments: large scale parallel job • Stage 4: assemble fragments within each cluster: large number of parallel or serial jobs
Motivation 2: BioDrugScore Project Structure Selection Ligands are chosen via search results limited by user provided property ranges • SamyMerough: IU School of Medicine • Application for computational drug design • Docking of a large number of compounds to a binding pocket, followed by ranking of the ensuing complexes and the selection of top candidates. • Customizing scoring functions for each pocket within these large numbers of receptors to rank molecules and predict efficacy and potential toxicity due to off-target effects early in the discovery process Parameter Selection Parameters on which to derive the function are chosen via check boxes Parameter Deviation Function is displayed with numerical and graphical details. Validation Stage Function validation results are displayed Calculate Terms TeraGrid calculates energy and entropy terms via AMBER
Existing and On-going activities • Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS) at SC08 • BoF on Megajobs: How to Run One Million Jobs at SC08
Computational Challenges • Dealing with policies of different HPC clusters: limits on concurrent jobs in batch queue, maximum WallClockTime • Dealing with different clusters: hardware, software, and maintenance issues • Dealing with different jobs: parallel jobs, serial jobs • Tracking the large number of jobs • Data management for input, output and temporary files including log/error message while executing large number of jobs. • Managing faults: Detecting, Reacting, Recovering and Preventing
SWARM at a glance Distributed HPC clusters Desktop users Swarm Infrastructure Web portals • Schedule millions of jobs over distributed clusters • A monitoring framework for large scale jobs • User based job scheduling • Ranking resources based on predicted wait times • Standard Web Service interface for web applications • Extensible design for the domain specific software logics Scientific Gateways
Desktop Users, Web Portal and Gateway style application Architecture Standard Web Service Interface Request Manager Resource pool Each of the users keeps resource pool containing certain number of tokens to the computing resource The number of tokens represents the limit on concurrent jobs in the batch system of the targeted resource. QBET Web Service Resource Ranking Manager DataModel Manager Fault Manager Hosted by UCSB User A’s Job Board RDMBS User A’s Job Queue User A’s Resource Pool Job Distributor Tokens for resource X,Y,Z MyProxy Server Hosted by TeraGrid Project Job Execution Manager Condor G with Birdbath High Performance Computing Clusters: Grid style clusters , and condor computing nodes
Queuing groups of jobs and matchmaking Task queue for User A First-in-first-served Finding first available resource in order of the resource list Request retrieved from Backend database Find available resource following the order of the list of resources Job Distributor Lonestar UserA with 100tokens 30 tokens ORNL UserAwith 10 tokens 30 tokens Resource pool Each of the users keeps a resource pool containing a certain number of tokens. The number of tokens represents the limit on the number of concurrent jobs in the batch system of the targeted resource. Cobalt UserB with 300tokens BigRed UserA with 800 tokens Steele UserA with 500 tokens
Describing A group of jobs • Standard Web service interface • Creating a ticket for the job group • Submitting job(s) with the associated ticket. • Encoded parameters into the Web Service interface • Location of executables • location of input data • location of output data • Arguments • job type (serial, parallel) • WallClockTime • Memory requirement • required number of computing nodes • list of resources that can execute this job (sorted or not)
Ranking the Computing Resources • Sorting the list of resources based on the wait time prediction for each of the batch queue systems. • Maintaining the Resource Ranking table with periodic access to the QBETS • QBETS provides queue delay predictions
Tracking the Status of Jobs • Statistical approach to provide the big picture for a large number of jobs • Tracking each of the jobs • Requested • Queued • Submitted • Idle • Completed • Held • Running • Store the information about job submission such as job description, timestamp, status history: easy to track job failure • Users can design their own data-model for log and error files: useful for Web portal or gateway style applications
Performance Evaluation • Host: 4.50GHz Intel Pentium 4 CPUs and 1 GB RAM • Client: 2.33 GHz Intel Xeon CPU and 8GB RAM • 1Gbps network • Axis2 Web service container Test scenario for the multi-users environment
Total turnaround time for the job submission and status check with various job sizes in a single-user environment
Average turnaround time for the various job sizes with varying number of concurrent users
Average turnaround time per operations with varying number of concurrent users
Conclusions • Swarm provides a light weight framework for users to submit a large number of jobs. • Swarm manages a large number of jobs based on user preferences. • Swarm schedules >100,000 of jobs and prioritizes multiple resources based on queue delay prediction. • Swarm provides a job monitoring scheme for a large number of job submissions. • Swarm provides an easy to customize software for application-specific requirement s including data-model and fault handling scheme.
Future work • Intelligent fault detecting • Extensible fault handling • Proactive fault prevention scheme • Incorporating with large scale data management to handle input and output data. • Scalability for an extremely large cluster
Thanks! {swarm}