130 likes | 149 Views
This paper explores the challenges of troubleshooting distributed systems and proposes using data mining techniques to identify patterns and solve system failures. It discusses various types of failures and suggests user-oriented diagnostic tools to improve system performance.
E N D
Troubleshooting Distributed Systems via Data Mining David Cieslak, Douglas Thain, and Nitesh Chawla University of Notre Dame
It’s Ugly in the Real World • Machine related failures: • Power outages, network outages, faulty memory, corrupted file system, bad config files, expired certs, packet filters... • Job related failures: • Crash on some args, bad executable, missing input files, mistake in args, missing components, failure to understand dependencies... • Incompatibilities between jobs and machines: • Missing libraries, not enough disk/cpu/mem, wrong software installed, wrong version installed, wrong memory layout... • Load related failures: • Slow actions induce timeouts; kernel tables: files, sockets, procs; router tables: addresses, routes, connections; competition with other users... • Non-deterministic failures: • Multi-thread/CPU synchronization, event interleaving across systems, random number generators, interactive effects, cosmic rays...
Reports of Bad News Grid2003: Thirty percent of ATLAS/CMS jobs failed! “Jobs often failed due to site configuration problems, or in groups from site service failures” - R. Gardner et al, “The Grid2003 Production Grid: Principles and Practice”, HPDC 2003. “Users ... need tools to help debug why failures happen.” Need “user oriented diagnostic tools” - J. Schopf, “State of Grid Users: 25 Conversations with UK eScience Groups”, Argonne Tech Report ANL/MCS-TM-278.
A “Grand Challenge” Problem: • A user submits one million jobs to the grid. • Half of them fail. • Now what? • Examine the output of every failed job? • Login to every site to examine the logs? • Resubmit and hope for the best? • We need some way of getting the big picture. • Need to identify problems not seen before.
The Wisdom of Secretary Rumsfeld As we know, There are known knowns. There are things we know we know. We also know There are known unknowns. That is to say We know there are some things We do not know. But there are also unknown unknowns, The ones we don't know We don't know. - Donald Rumsfeld, 12 February 2002
An Idea: • We have lots of structured information about the components of a grid. • Can we perform some form of data mining to discover the big picture of what is going on? • User: Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. • Admin: User “joe” is running 1000s of jobs that transfer 10 TB of data that fail immediately; perhaps he needs help. • Can we act on this information to improve the system? • User: Avoid resources that are working for you. • Admin: Assist the user in understand and fixing the problem.
Job ClassAd MyType = "Job" TargetType = "Machine" ClusterId = 11839 QDate = 1150231068 CompletionDate = 0 Owner = "dcieslak“ JobUniverse = 5 Cmd = "ripper-cost-can-9-50.sh" LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 ExitStatus = 0 ImageSize = 40000 DiskUsage = 110000 NumCkpts = 0 NumRestarts = 0 NumSystemHolds = 0 CommittedTime = 0 ExitBySignal = FALSE PoolName = "ccl00.cse.nd.edu" CondorVersion = "6.7.19 May 10 2006" CondorPlatform = I386-LINUX_RH9 RootDir = "/" Iwd = "/tmp/dcieslak/smotewrap1" MinHosts = 1 WantRemoteSyscalls = FALSE WantCheckpoint = FALSE JobPrio = 0 User = "dcieslak@nd.edu" NiceUser = FALSE Env = "LD_LIBRARY_PATH=." EnvDelim = ";" JobNotification = 0 WantRemoteIO = TRUE UserLog = "/tmp/dcieslak/smotewrap1/ripper-cost-can-9-50.log" CoreSize = -1 KillSig = "SIGTERM" Rank = 0.000000 In = "/dev/null" TransferIn = FALSE Out = "ripper-cost-can-9-50.output" StreamOut = FALSE Err = "ripper-cost-can-9-50.error" StreamErr = FALSE BufferSize = 524288 BufferBlockSize = 32768 ShouldTransferFiles = "YES" WhenToTransferOutput = "ON_EXIT_OR_EVICT" TransferFiles = "ALWAYS" TransferInput = "scripts.tar.gz,can-ripper.tar.gz" TransferOutput = "ripper-cost-50-can-9.tar.gz" ExecutableSize_RAW = 1 ExecutableSize = 10000 Requirements = (OpSys == "LINUX") && (Arch == "INTEL") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer) JobLeaseDuration = 1200 PeriodicHold = FALSE PeriodicRelease = FALSE PeriodicRemove = FALSE OnExitHold = FALSE OnExitRemove = TRUE LeaveJobInQueue = FALSE Arguments = "" GlobalJobId = "cclbuild02.cse.nd.edu#1150231069#11839.0" ProcId = 0 AutoClusterId = 0 AutoClusterAttrs = "Owner,Requirements" JobStartDate = 1150256907 LastRejMatchReason = "no match found" LastRejMatchTime = 1150815515 TotalSuspensions = 73 CumulativeSuspensionTime = 8179 RemoteWallClockTime = 432493.000000 LastRemoteHost = "hobbes.helios.nd.edu" LastClaimId = "<129.74.221.168:9359>#1150811733#2" MaxHosts = 1 WantMatchDiagnostics = TRUE LastMatchTime = 1150817352 NumJobMatches = 34 OrigMaxHosts = 1 JobStatus = 2 EnteredCurrentStatus = 1150817354 LastSuspensionTime = 0 CurrentHosts = 1 ClaimId = "<129.74.20.20:9322>#1150232335#157" RemoteHost = "vm2@sirius.cse.nd.edu" RemoteVirtualMachineID = 2 ShadowBday = 1150817355 JobLastStartDate = 1150815519 JobCurrentStartDate = 1150817355 JobRunCount = 24 WallClockCheckpoint = 65927 RemoteSysCpu = 52.000000 ImageSize_RAW = 31324 DiskUsage_RAW = 102814 RemoteUserCpu = 62319.000000 LastJobLeaseRenewal = 11 Machine ClassAd MyType = "Machine" TargetType = "Job" Name = "ccl00.cse.nd.edu" CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) MachineGroup = "ccl" MachineOwner = "dthain" CondorVersion = "6.7.19 May 10 2006" CondorPlatform = "I386-LINUX_RH9" VirtualMachineID = 1 ExecutableSize = 20000 JobUniverse = 1 NiceUser = FALSE VirtualMemory = 962948 Memory = 498 Cpus = 1 Disk = 19072712 CondorLoadAvg = 1.000000 LoadAvg = 1.130000 KeyboardIdle = 817093 ConsoleIdle = 817093 StartdIpAddr = "<129.74.153.164:9453>" Arch = "INTEL" OpSys = "LINUX" UidDomain = "nd.edu" FileSystemDomain = "nd.edu" Subnet = "129.74.153" HasIOProxy = TRUE CheckpointPlatform = "LINUX INTEL 2.4.x normal" TotalVirtualMemory = 962948 TotalDisk = 19072712 TotalCpus = 1 TotalMemory = 498 KFlops = 659777 Mips = 2189 LastBenchmark = 1150271600 TotalLoadAvg = 1.130000 TotalCondorLoadAvg = 1.000000 ClockMin = 347 ClockDay = 3 TotalVirtualMachines = 1 HasFileTransfer = TRUE HasPerFileEncryption = TRUE HasReconnect = TRUE HasMPI = TRUE HasTDP = TRUE HasJobDeferral = TRUE HasJICLocalConfig = TRUE HasJICLocalStdin = TRUE HasPVM = TRUE HasRemoteSyscalls = TRUE HasCheckpointing = TRUE CpuBusyTime = 0 CpuIsBusy = FALSE TimeToLive = 2147483647 State = "Claimed" EnteredCurrentState = 1150284871 Activity = "Busy" EnteredCurrentActivity = 1150877237 Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner"))) Requirements = (START) && (IsValidCheckpointPlatform) IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0)))) MaxJobRetirementTime = 0 CurrentRank = 1.000000 RemoteUser = "johanes@nd.edu" RemoteOwner = "johanes@nd.edu" ClientMachine = "cclbuild00.cse.nd.edu" JobId = "2929.0" GlobalJobId = "cclbuild00.cse.nd.edu#1150425594#2929.0" JobStart = 1150425941 LastPeriodicCheckpoint = 1150879661 ImageSize = 54196 TotalJobRunTime = 456222 TotalJobSuspendTime = 1080 TotalClaimRunTime = 597057 TotalClaimSuspendTime = 1271 MonitorSelfTime = 1150883051 MonitorSelfCPUUsage = 0.066660 MonitorSelfImageSize = 8244.000000 MonitorSelfResidentSetSize = 2036 MonitorSelfAge = 0 DaemonStartTime = 1150231320 UpdateSequenceNumber = 2208 MyAddress = "<129.74.153.164:9453>" LastHeardFrom = 1150883243 UpdatesTotal = 2785 UpdatesSequenced = 2784 UpdatesLost = 0 UpdatesHistory = "0x00000000000000000000000000000000" Machine = "ccl00.cse.nd.edu" Rank = ((Owner == "dthain") ||(Owner == "psnowber") ||(Owner == "cmoretti") ||(Owner == "jhemmes") ||(Owner == "gniederw")) * 2 + (PoolName =?= "ccl00.cse.nd.edu") * 1 User Job Log Job 1 submitted. Job 2 submitted. Job 1 placed on ccl00.cse.nd.edu Job 1 evicted. Job 1 placed on smarty.cse.nd.edu. Job 1 completed. Job 2 placed on dvorak.helios.nd.edu Job 2 suspended Job 2 resumed Job 2 exited normally with status 1. ...
User Job Log Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Success Class Failure Class RIPPER Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. Failure Criteria: exit !=0 core dump evicted suspended bad output
------------------------- run 1 ------------------------- Hypothesis: exit1 :- Memory>=1930, JobStart>=1.14626e+09, MonitorSelfTime>=1.14626e+09 (491/377) exit1 :- Memory>=1930, Disk<=555320 (1670/1639). default exit0 (11904/4503). Error rate on holdout data is 30.9852% Running average of error rate is 30.9852% ------------------------- run 2 ------------------------- Hypothesis: exit1 :- Memory>=1930, Disk<=541186 (2076/1812). default exit0 (12090/4606). Error rate on holdout data is 31.8791% Running average of error rate is 31.4322% ------------------------- run 3 ------------------------- Hypothesis: exit1 :- Memory>=1930, MonitorSelfImageSize>=8.844e+09 (1270/1050). exit1 :- Memory>=1930, KeyboardIdle>=815995 (793/763). exit1 :- Memory>=1927, EnteredCurrentState<=1.14625e+09, VirtualMemory>=2.09646e+06, LoadAvg>=30000, LastBenchmark<=1.14623e+09, MonitorSelfImageSize<=7.836e+09 (94/84). exit1 :- Memory>=1927, TotalLoadAvg<=1.43e+06, UpdatesTotal<=8069, LastBenchmark<=1.14619e+09, UpdatesLost<=1 (77/61). default exit0 (11940/4452). Error rate on holdout data is 31.8111% Running average of error rate is 31.5585%
Unexpected Discoveries • Purdue Teragrid (91343 jobs on 2523 CPUs) • Jobs fail on machines with (Memory>1920MB) • Diagnosis: Linux machines with > 3GB have a different memory layout that breaks some programs that do inappropriate pointer arithmetic. • UND & UW (4005 jobs on 1460 CPUs) • Jobs fail on machines with less than 4MB disk. • Diagnosis: Condor failed in an unusual way when the job transfers input files that don’t fit.
Many Open Problems • Strengths and Weaknesses of Approach • Correlation != Causation -> could be enough? • Limits of reported data -> increase resolution? • Not enough data points -> direct job placement? • Acting on Information • Steering by the end user. • Applying learned rules back to the system. • Evaluating (and sometimes abandoning) changes. • Creating tools that assist with “digging deeper.” • Data Mining Research • Continuous intake + incremental construction. • Creating results that non-specialists can understand.
Just Getting Started • Douglas Thain • University of Notre Dame • dthain@cse.nd.edu • We like to collect things: • Obscure failure modes. • War stories about how the bugs were found. • Log files from big batch runs.