220 likes | 231 Views
Learn how to detect misconfigured machines in grid batch systems, prevent failures, and increase system efficiency with distributed outlier detection algorithms in this comprehensive SEBD tutorial.
E N D
Mining for Misconfigured Machines in Grid Batch Systems [forthcoming SIGKDD 2006] Together with: Noam Palatin, Arie Lazarovitch, Ran Wolff SEBD Tutorial
Execution Submission Resource broker Batch System in Grid • Batch systems in Grid • Many organizations or administration sites. • 10000s machines • Heterogeneous machines • Non dedicated, • Different installation and configuration SEBD Tutorial
Motivation • Many potential causes of failures and misbehaviors • Software bugs, hardware, network , configuration • Current solutions • Manual diagnosis • Ruled based expert system. • Data mining • Limited, if any, prior knowledge • Related works: [Chen et.al. 04, classification of eBay failures] SEBD Tutorial
Solution guideline and Challenges • Function shipping is cheaper than data shipping • Distributed warehousing • Centralized data requires special resources • Distributed data mining • Low availability of machines • Asynchronous algorithm • Grid limitations • Use cross domain grid services • If possible, piggyback on underlying system • Portability • NetBatch, Conodr, LSF, …. • Batch system ontology • Non-intrusive data collection • Use available logs • Enrich with static data SEBD Tutorial
Data collector Data miner Grid Monitor System (GMS) • Data collector • Non-intrusively • Distributed Database • Preprocessing • Data miner • Distributed SEBD Tutorial
Data Preprocessing 5/2 01:16:23 ****************************************************** 5/2 01:16:23 ** condor_starter (CONDOR_STARTER) STARTING UP 5/2 01:16:23 ** /usr/local/condor/glibc23/condor-6.7.18/sbin/condor_starter 5/2 01:16:23 ** $CondorVersion: 6.7.18 Mar 11 2006 PRE-RELEASE-UWCS $ 5/2 01:16:23 ** $CondorPlatform: I386-LINUX_RH9 $ 5/2 01:16:23 ** PID = 17381 5/2 01:16:23 ****************************************************** 5/2 01:16:23 Submitting machine is "ds-con-sub.cs.technion.ac.il" 5/2 01:16:23 Submit UidDomain: "cs.technion.ac.il" 5/2 01:16:23 Initialized user_priv as "USER1" 5/2 01:16:23 Starting a VANILLA universe job with ID: 180.5 5/2 01:16:23 IWD: /home/user1/ 5/2 01:16:23 Input file: /dev/null 5/2 01:16:23 Output file: /home/user1/output.out 5/2 01:16:23 Error file: /home/user1/error.err 5/2 01:16:23 exec condor_exec.exe /home/user1/my_program.sh 5/2 01:16:23 Env = _CONDOR_SCRATCH_DIR=/var/condor/execute/dir_17381 5/2 01:16:23 Create_Process succeeded, pid=17382 5/2 02:06:44 Process exited, pid=17382, status=0 5/2 02:06:44 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 Job Job Execution Job Execution Status Job Execution Events Job Submission SEBD Tutorial
Data Preprocessing • Batch system ontology: hierarchical structure • General concepts. • Machine, Job, Pool, Matchmaking. • Benefits: • Enriches the data • Data reduction • Increases the GMS portability SEBD Tutorial
Detecting Misconfigured Machines • Misconfigured machines can cause • Obstruct the entire organization • Reduction goodput • Black holes • Our solution: Distributed outlier detection • Assumptions and limitations • Vast majority of machines are properly configured • Misconfigured machines behave differently • Pool contains clusters of similar machines • Difficult distribution SEBD Tutorial
Distributed HilOut Algorithm • Principles • Only subset of points • Exact solution • Distributed data • Sharing relevant points • The rule • Share a point if • It is an outlier of the current shared set • It is a neighbor of an outlier of the current shared set • It is a proof that a neighbor of an outlier is not an outlier itself • It is a neighbor of a point thought to be an outlier by another participant of the process SEBD Tutorial
Distributed Hilout Algorithm SEBD Tutorial
Distributed Hilout Algorithm SEBD Tutorial
Distributed Hilout Algorithm SEBD Tutorial
Distributed Hilout Algorithm SEBD Tutorial
Distributed Hilout Algorithm P2 P1 P3 SG3 S1 SG2 S2 S3 1 1 1 2 SG SEBD Tutorial
Distributed Hilout Algorithm P2 P1 P3 SG3 S1 SG2 S2 S3 1 3 2 1 SG SEBD Tutorial
Distributed Hilout Algorithm P2 P1 P3 SG3 SG1 S1 SG2 S2 S3 1 3 SG SEBD Tutorial
Evaluation • Environment: • Machines: 42 Linux machines. • 10 XEON 1800 MHz with 1GB RAM. • 6 XEON 2400 MHz with 2GB RAM. • 26 IBM PowerPC 2200 MHz 64 bit with 4GB RAM. • Benchmarks: About 9000 jobs were executed. • BYTEmark. • Bonnie. SEBD Tutorial
Evaluation – Qualitative Results • 3 of the top 4 suspected machines are actually misconfigured. • bh10: unknown reason. • i4: loaded by network service. • bh13: active HyperThreading. • i3: root file system was nearly full. SEBD Tutorial
Evaluation – Quantitative Results Scalability with the number of machines and data (n = 7) Scalability with the number of outliers (n) BYTEmark BYTEmark Bonnie Bonnie SEBD Tutorial
Evaluation – Interoperability A grid pool in regular working load BYTEmark (|SN| = 4334) Bonnie (|SN| = 4560) SEBD Tutorial
Evaluation – Interoperability A grid pool with twenty of the computers disabled BYTEmark (|SN| = 4334) Bonnie (|SN| = 4560) SEBD Tutorial
Future Work • Other algorithms for GMS. • Discovering sources of malfunctions • Discovering bottlenecks • System optimization • HCI (smarter user interface) • Dealing with temporal behaviors SEBD Tutorial