1 / 22

Mining for Misconfigured Machines in Grid Batch Systems [forthcoming SIGKDD 2006]

Mining for Misconfigured Machines in Grid Batch Systems [forthcoming SIGKDD 2006] Together with : Noam Palatin, Arie Lazarovitch, Ran Wolff. Execution. Submission. Resource broker. Batch System in Grid. Batch systems in Grid Many organizations or administration sites.

Download Presentation

Mining for Misconfigured Machines in Grid Batch Systems [forthcoming SIGKDD 2006]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining for Misconfigured Machines in Grid Batch Systems [forthcoming SIGKDD 2006] Together with: Noam Palatin, Arie Lazarovitch, Ran Wolff SEBD Tutorial

  2. Execution Submission Resource broker Batch System in Grid • Batch systems in Grid • Many organizations or administration sites. • 10000s machines • Heterogeneous machines • Non dedicated, • Different installation and configuration SEBD Tutorial

  3. Motivation • Many potential causes of failures and misbehaviors • Software bugs, hardware, network , configuration • Current solutions • Manual diagnosis • Ruled based expert system. • Data mining • Limited, if any, prior knowledge • Related works: [Chen et.al. 04, classification of eBay failures] SEBD Tutorial

  4. Solution guideline and Challenges • Function shipping is cheaper than data shipping • Distributed warehousing • Centralized data requires special resources • Distributed data mining • Low availability of machines • Asynchronous algorithm • Grid limitations • Use cross domain grid services • If possible, piggyback on underlying system • Portability • NetBatch, Conodr, LSF, …. • Batch system ontology • Non-intrusive data collection • Use available logs • Enrich with static data SEBD Tutorial

  5. Data collector Data miner Grid Monitor System (GMS) • Data collector • Non-intrusively • Distributed Database • Preprocessing • Data miner • Distributed SEBD Tutorial

  6. Data Preprocessing 5/2 01:16:23 ****************************************************** 5/2 01:16:23 ** condor_starter (CONDOR_STARTER) STARTING UP 5/2 01:16:23 ** /usr/local/condor/glibc23/condor-6.7.18/sbin/condor_starter 5/2 01:16:23 ** $CondorVersion: 6.7.18 Mar 11 2006 PRE-RELEASE-UWCS $ 5/2 01:16:23 ** $CondorPlatform: I386-LINUX_RH9 $ 5/2 01:16:23 ** PID = 17381 5/2 01:16:23 ****************************************************** 5/2 01:16:23 Submitting machine is "ds-con-sub.cs.technion.ac.il" 5/2 01:16:23 Submit UidDomain: "cs.technion.ac.il" 5/2 01:16:23 Initialized user_priv as "USER1" 5/2 01:16:23 Starting a VANILLA universe job with ID: 180.5 5/2 01:16:23 IWD: /home/user1/ 5/2 01:16:23 Input file: /dev/null 5/2 01:16:23 Output file: /home/user1/output.out 5/2 01:16:23 Error file: /home/user1/error.err 5/2 01:16:23 exec condor_exec.exe /home/user1/my_program.sh 5/2 01:16:23 Env = _CONDOR_SCRATCH_DIR=/var/condor/execute/dir_17381 5/2 01:16:23 Create_Process succeeded, pid=17382 5/2 02:06:44 Process exited, pid=17382, status=0 5/2 02:06:44 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0 Job Job Execution Job Execution Status Job Execution Events Job Submission SEBD Tutorial

  7. Data Preprocessing • Batch system ontology: hierarchical structure • General concepts. • Machine, Job, Pool, Matchmaking. • Benefits: • Enriches the data • Data reduction • Increases the GMS portability SEBD Tutorial

  8. Detecting Misconfigured Machines • Misconfigured machines can cause • Obstruct the entire organization • Reduction goodput • Black holes • Our solution: Distributed outlier detection • Assumptions and limitations • Vast majority of machines are properly configured • Misconfigured machines behave differently • Pool contains clusters of similar machines • Difficult distribution SEBD Tutorial

  9. Distributed HilOut Algorithm • Principles • Only subset of points • Exact solution • Distributed data • Sharing relevant points • The rule • Share a point if • It is an outlier of the current shared set • It is a neighbor of an outlier of the current shared set • It is a proof that a neighbor of an outlier is not an outlier itself • It is a neighbor of a point thought to be an outlier by another participant of the process SEBD Tutorial

  10. Distributed Hilout Algorithm SEBD Tutorial

  11. Distributed Hilout Algorithm SEBD Tutorial

  12. Distributed Hilout Algorithm SEBD Tutorial

  13. Distributed Hilout Algorithm SEBD Tutorial

  14. Distributed Hilout Algorithm P2 P1 P3 SG3 S1 SG2 S2 S3 1 1 1 2 SG SEBD Tutorial

  15. Distributed Hilout Algorithm P2 P1 P3 SG3 S1 SG2 S2 S3 1 3 2 1 SG SEBD Tutorial

  16. Distributed Hilout Algorithm P2 P1 P3 SG3 SG1 S1 SG2 S2 S3 1 3 SG SEBD Tutorial

  17. Evaluation • Environment: • Machines: 42 Linux machines. • 10 XEON 1800 MHz with 1GB RAM. • 6 XEON 2400 MHz with 2GB RAM. • 26 IBM PowerPC 2200 MHz 64 bit with 4GB RAM. • Benchmarks: About 9000 jobs were executed. • BYTEmark. • Bonnie. SEBD Tutorial

  18. Evaluation – Qualitative Results • 3 of the top 4 suspected machines are actually misconfigured. • bh10: unknown reason. • i4: loaded by network service. • bh13: active HyperThreading. • i3: root file system was nearly full. SEBD Tutorial

  19. Evaluation – Quantitative Results Scalability with the number of machines and data (n = 7) Scalability with the number of outliers (n) BYTEmark BYTEmark Bonnie Bonnie SEBD Tutorial

  20. Evaluation – Interoperability A grid pool in regular working load BYTEmark (|SN| = 4334) Bonnie (|SN| = 4560) SEBD Tutorial

  21. Evaluation – Interoperability A grid pool with twenty of the computers disabled BYTEmark (|SN| = 4334) Bonnie (|SN| = 4560) SEBD Tutorial

  22. Future Work • Other algorithms for GMS. • Discovering sources of malfunctions • Discovering bottlenecks • System optimization • HCI (smarter user interface) • Dealing with temporal behaviors SEBD Tutorial

More Related