520 likes | 723 Views
Attacking Data Intensive Science with Distributed Computing. Prof. Douglas Thain University of Notre Dame http://www.cse.nd.edu/~dthain. Outline. Large Scale Distributed Computing Plentiful Computing Resources World-Wide Challenges: Data and Debugging The Cooperative Computing Lab
E N D
Attacking Data Intensive Science with Distributed Computing Prof. Douglas Thain University of Notre Dame http://www.cse.nd.edu/~dthain
Outline • Large Scale Distributed Computing • Plentiful Computing Resources World-Wide • Challenges: Data and Debugging • The Cooperative Computing Lab • Distributed Data Management • Applications to Scientific Computing • Debugging Complex Systems • Open Problems in Distributed Computing • Proposal: The All-Pairs Abstraction
Plentiful Computing Power • As of 04 Sep 2006: • Teragrid • 21,972 CPUs / 220 TB / 6 sites • Open Science Grid • 21,156 CPUs / 83 TB / 61 sites • Condor Worldwide: • 96,352 CPUs / 1608 sites • At Notre Dame: • CRC: 500 CPUs • BOB: 212 CPUs • Lots of little clusters!
Who is using all of this? • Anyone with unlimited computing needs! • High Energy Physics: • Simulating the detector a particle accelerator before turning it on allows one to understand the output. • Biochemistry: • Simulate complex molecules under different forces to understand how they fold/mate/react. • Biometrics: • Given a large database of human images, evaluate matching algorithms by comparing all to all. • Climatology: • Given a starting global climate, simulate how climate develops under varying assumptions or events.
Buzzwords • Distributed Computing • Cluster Computing • Beowulf • Grid Computing • Utility Computing • Something@Home = A bunch of computers.
Some Outstanding Successes • TeraGrid: • AMANDA project uses 1000s of CPUs over months to calibrate and process data from a neutrino telescope. • PlanetLab: • Hundreds of nodes used to test and validate a wide variety of dist. and P2P systems: Chord, Pastry, etc... • Condor: • MetaNEOS project solves a 30-year-old optimization problem using brute force on 1000 heterogeneous CPUs across multiple sites over several weeks. • Seti@Home: • Millions of CPUs used to analyze celestial signals.
And now the bad news... Large distributed systems fall to pieces when you have lots of data!
Example: Grid3 (OSG) Robert Gardner, et al. (102 authors) The Grid3 Production Grid Principles and Practice IEEE HPDC 2004 The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory that has sustained for several months the production-level services required by… ATLAS, CMS, SDSS, LIGO…
Problem: Data Management The good news: • 27 sites with 2800 CPUs • 40985 CPU-days provided over 6 months • 10 applications with 1300 simultaneous jobs The bad news: • 40-70 percent utilization • 30 percent of jobs would fail • 90 percent of failures were site problems • Most site failures were due to disk space!
Problem: Debugging “Most groups reported problems in which a job had been submitted... and something had not performed correctly, but they were unable to determine where, why, or how to fix that problem...” Jennifer Schopf and Steven Newhouse, “State of Grid Users: 25 Conversations with UK eScience Users” Argonne National Lab Tech Report ANL/MCS-TM-278, 2004.
Both Problems: Debugging I/O • A user submits 1000 jobs to a grid. • Each requires 1 GB of input. • 100 start at once. (Quite normal.) • The interleaved transfers all fail. • The “robust” system retries... • (Happened last week in this department!)
A Common Thread: • Each of these problems: • “I can’t make storage do what I want!” • “I have no idea why this system is failing!” • Arises from the following: • Both service providers and users are lacking the tools and models that they need to harness and analyze complex environments.
Outline • Large Scale Distributed Computing • Plentiful Computing Resources World-Wide • Challenges: Data and Debugging • The Cooperative Computing Lab • Distributed Data Management • Applications to Scientific Computing • Debugging Complex Systems • Open Problems in Distributed Computing • Proposal: The All-Pairs Abstraction
Cooperative Computing Labat the University of Notre Dame • Basic Computer Science Research • Overlapping categories: Operating systems, distributed systems, grid computing, filesystems, databases... • Modest Local Operation • 300 CPUs, 20 TB of storage, 6 stakeholders • Keeps us honest + eat our own dog food. • Software Development and Publication • http://www.cctools.org • Students learn engineering as well as science. • Collaboration with External Users • High energy physics, bioinformatics, molecular dynamics... http://www.cse.nd.edu/~ccl
Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job Computing Environment I will only run jobs between midnight and 8 AM I will only run jobs when there is no-one working at the keyboard Miscellaneous CSE Workstations CPU CPU CPU Fitzpatrick Workstation Cluster CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk Condor Match Maker I prefer to run a job submitted by a CCL student. CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk CVRL Research Cluster CCL Research Cluster
CPU History Storage History
Flocking Between Universities Wisconsin 1200 CPUs Purdue A 541 CPUs Notre Dame 300 CPUs Purdue B 1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/
Problems and Solutions • “I can’t make storage do what I want!” • Need root access, configure, reboot, etc... • Solution: Tactical Storage Systems • I have no idea why this system is failing! • Multiple services, unreliable networks... • Solution: Debugging Via Data Mining
Why is Storage Hard? • Easy within one cluster: • Shared filesystem on all nodes. • But, limited to a few disks provided by admin. • Even a “macho” file server has limited BW. • Terrible across two or more clusters: • No shared filesystem on all nodes. • Too hard to move data back and forth. • Limited to using storage on head nodes. • Unable to become root to configure.
Conventional Clusters CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk
Tactical Storage Systems (TSS) • A TSS allows any node to serve as a file server or as a file system client. • All components can be deployed without special privileges – but with standard grid security (GSI) • Users can build up complex structures. • Filesystems, databases, caches, ... • Admins need not know/care about larger structures. • Takes advantage of two resources: • Total Storage (200 disks yields 20TB) • Total Bandwidth (200 disks at 10 MB/s = 2 GB/s)
1 – Uniform access between any nodes in either cluster Disk Disk Disk Disk Disk Disk Disk Disk Logical Volume Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk Disk 2 – Ability to group together multiple disks for a common purpose. Tactical Storage System CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Logical Volume CPU CPU CPU CPU CPU CPU CPU CPU
Appl Appl Appl Appl Appl Adapter Appl Adapter Appl Adapter Adapter Adapter Adapter Adapter Scalable Capacity/BW for Large Data Scalable Bandwidth for Small Data Secured by Grid GSI Credentials Logical Volume Server Server Server Server Server Server Server Disk Disk Disk Disk Disk Disk Disk Expandable File System Replicated File System WAN File System Tactical Storage Structures
Applications and Examples • Bioinformatics: • A WAN Filesystem for BLAST on EGEE grid. • Atmospheric Physics • A cluster filesystem for scalable data analysis. • Biometrics: • Dist. I/O for high-xput image comparison. • Molecular Dynamics: • GEMS: Scalable distributed database. • High Energy Physics: • Global access to software distributions.
Simple Wide Area File System • Bioinformatics on the European Grid • Users want to run BLAST on standard DBs. • Cannot copy every DB to every node of the grid! • Many databases of biological data in different formats around the world: • Archives: Swiss-Prot, TreMBL, NCBI, etc... • Replicas: Public, Shared, Private, ??? • Goal: Refer to data objects by logical name. • Access the nearest copy of the non-redundant protein database, don’t care where it is. Credit: Christophe Blanchet, Bioinformatics Center of Lyon, CNRS IBCP, France http://gbio.ibcp.fr/cblanchet, Christophe.Blanchet@ibcp.fr
Run BLAST on LFN://ncbi.gov/nr.data Where is LFN://ncbi.gov/nr.data? open(LFN://ncbi.gov/nr.data) Find it at: FTP://ibcp.fr/nr.data open(FTP://ibcp.fr/nr.data) RETR nr.data Wide Area File System HTTP Server BLAST nr.data EGEE File Location Service RFIO Server Adapter nr.data FTP Server RFIO gLite HTTP FTP nr.data
Credit: John Poirer @ Notre Dame Astrophysics Dept. Can only analyze the most recent data. 30-year archive 2 GB/day today could be lots more! buffer disk daily tape analysis code daily tape daily tape daily tape daily tape Expandable Filesystemfor Experimental Data Project GRAND http://www.nd.edu/~grand
Credit: John Poirer @ Notre Dame Astrophysics Dept. Can analyze all data over large time scales. analysis code 30-year archive 2 GB/day today could be lots more! Adapter buffer disk daily tape Logical Volume daily tape daily tape daily tape daily tape file server file server file server Expandable Filesystemfor Experimental Data Project GRAND http://www.nd.edu/~grand file server
Scalable I/O for Biometrics • Computer Vision Research Lab in CSE • Goal: Develop robust algorithms for identifying humans from (non-ideal) images. • Technique: Collect lots of images. Think up clever new matching function. Compare them. • How do you test a matching function? • For a set S of images, • Compute F(Si,Sj) for all Si and Sj in S. • Compare the result matrix to known functions. Credit: Patrick Flynn at Notre Dame CSE
F Computing Similarities
A Big Data Problem • Data Size: 10k images of 1MB = 10 GB • Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB • Would like to repeat many times! • In order to execute such a workload, we must be careful to partition both the I/O and the CPU needs, taking advantage of distributed capacity.
Move 200 TB at Runtime! Job Job Job Job Job Job Job Job Conventional Solution Disk Disk Disk Disk CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk
Job Job Job Job 1. Break array into MB-size chunks. Using Tactical Storage 3. Jobs find nearby data copy, and make full use before discarding. CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk 2. Replicate data to many disks.
Problems and Solutions • “I can’t make storage do what I want!” • Need root access, configure, reboot, etc... • Solution: Tactical Storage Systems • I have no idea why this system is failing! • Multiple services, unreliable networks... • Solution: Debugging Via Data Mining
It’s Ugly in the Real World • Machine related failures: • Power outages, network outages, faulty memory, corrupted file system, bad config files, expired certs, packet filters... • Job related failures: • Crash on some args, bad executable, missing input files, mistake in args, missing components, failure to understand dependencies... • Incompatibilities between jobs and machines: • Missing libraries, not enough disk/cpu/mem, wrong software installed, wrong version installed, wrong memory layout... • Load related failures: • Slow actions induce timeouts; kernel tables: files, sockets, procs; router tables: addresses, routes, connections; competition with other users... • Non-deterministic failures: • Multi-thread/CPU synchronization, event interleaving across systems, random number generators, interactive effects, cosmic rays...
A “Grand Challenge” Problem: • A user submits one million jobs to the grid. • Half of them fail. • Now what? • Examine the output of every failed job? • Login to every site to examine the logs? • Resubmit and hope for the best? • We need some way of getting the big picture. • Need to identify problems not seen before.
An Idea: • We have lots of structured information about the components of a grid. • Can we perform some form of data mining to discover the big picture of what is going on? • User: Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. • Admin: Joe is running 1000s of jobs with 10 TB of data that fail immediately; perhaps he needs help. • Can we act on this information? • User: Avoid resources that aren’t working for you. • Admin: Assist the user in understand and fixing the problem.
Job ClassAd MyType = "Job" TargetType = "Machine" ClusterId = 11839 QDate = 1150231068 CompletionDate = 0 Owner = "dcieslak“ JobUniverse = 5 Cmd = "ripper-cost-can-9-50.sh" LocalUserCpu = 0.000000 LocalSysCpu = 0.000000 ExitStatus = 0 ImageSize = 40000 DiskUsage = 110000 NumCkpts = 0 NumRestarts = 0 NumSystemHolds = 0 CommittedTime = 0 ExitBySignal = FALSE PoolName = "ccl00.cse.nd.edu" CondorVersion = "6.7.19 May 10 2006" CondorPlatform = I386-LINUX_RH9 RootDir = "/" Iwd = "/tmp/dcieslak/smotewrap1" MinHosts = 1 WantRemoteSyscalls = FALSE WantCheckpoint = FALSE JobPrio = 0 User = "dcieslak@nd.edu" NiceUser = FALSE Env = "LD_LIBRARY_PATH=." EnvDelim = ";" JobNotification = 0 WantRemoteIO = TRUE UserLog = "/tmp/dcieslak/smotewrap1/ripper-cost-can-9-50.log" CoreSize = -1 KillSig = "SIGTERM" Rank = 0.000000 In = "/dev/null" TransferIn = FALSE Out = "ripper-cost-can-9-50.output" StreamOut = FALSE Err = "ripper-cost-can-9-50.error" StreamErr = FALSE BufferSize = 524288 BufferBlockSize = 32768 ShouldTransferFiles = "YES" WhenToTransferOutput = "ON_EXIT_OR_EVICT" TransferFiles = "ALWAYS" TransferInput = "scripts.tar.gz,can-ripper.tar.gz" TransferOutput = "ripper-cost-50-can-9.tar.gz" ExecutableSize_RAW = 1 ExecutableSize = 10000 Requirements = (OpSys == "LINUX") && (Arch == "INTEL") && (Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (HasFileTransfer) JobLeaseDuration = 1200 PeriodicHold = FALSE PeriodicRelease = FALSE PeriodicRemove = FALSE OnExitHold = FALSE OnExitRemove = TRUE LeaveJobInQueue = FALSE Arguments = "" GlobalJobId = "cclbuild02.cse.nd.edu#1150231069#11839.0" ProcId = 0 AutoClusterId = 0 AutoClusterAttrs = "Owner,Requirements" JobStartDate = 1150256907 LastRejMatchReason = "no match found" LastRejMatchTime = 1150815515 TotalSuspensions = 73 CumulativeSuspensionTime = 8179 RemoteWallClockTime = 432493.000000 LastRemoteHost = "hobbes.helios.nd.edu" LastClaimId = "<129.74.221.168:9359>#1150811733#2" MaxHosts = 1 WantMatchDiagnostics = TRUE LastMatchTime = 1150817352 NumJobMatches = 34 OrigMaxHosts = 1 JobStatus = 2 EnteredCurrentStatus = 1150817354 LastSuspensionTime = 0 CurrentHosts = 1 ClaimId = "<129.74.20.20:9322>#1150232335#157" RemoteHost = "vm2@sirius.cse.nd.edu" RemoteVirtualMachineID = 2 ShadowBday = 1150817355 JobLastStartDate = 1150815519 JobCurrentStartDate = 1150817355 JobRunCount = 24 WallClockCheckpoint = 65927 RemoteSysCpu = 52.000000 ImageSize_RAW = 31324 DiskUsage_RAW = 102814 RemoteUserCpu = 62319.000000 LastJobLeaseRenewal = 11 Machine ClassAd MyType = "Machine" TargetType = "Job" Name = "ccl00.cse.nd.edu" CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) MachineGroup = "ccl" MachineOwner = "dthain" CondorVersion = "6.7.19 May 10 2006" CondorPlatform = "I386-LINUX_RH9" VirtualMachineID = 1 ExecutableSize = 20000 JobUniverse = 1 NiceUser = FALSE VirtualMemory = 962948 Memory = 498 Cpus = 1 Disk = 19072712 CondorLoadAvg = 1.000000 LoadAvg = 1.130000 KeyboardIdle = 817093 ConsoleIdle = 817093 StartdIpAddr = "<129.74.153.164:9453>" Arch = "INTEL" OpSys = "LINUX" UidDomain = "nd.edu" FileSystemDomain = "nd.edu" Subnet = "129.74.153" HasIOProxy = TRUE CheckpointPlatform = "LINUX INTEL 2.4.x normal" TotalVirtualMemory = 962948 TotalDisk = 19072712 TotalCpus = 1 TotalMemory = 498 KFlops = 659777 Mips = 2189 LastBenchmark = 1150271600 TotalLoadAvg = 1.130000 TotalCondorLoadAvg = 1.000000 ClockMin = 347 ClockDay = 3 TotalVirtualMachines = 1 HasFileTransfer = TRUE HasPerFileEncryption = TRUE HasReconnect = TRUE HasMPI = TRUE HasTDP = TRUE HasJobDeferral = TRUE HasJICLocalConfig = TRUE HasJICLocalStdin = TRUE HasPVM = TRUE HasRemoteSyscalls = TRUE HasCheckpointing = TRUE CpuBusyTime = 0 CpuIsBusy = FALSE TimeToLive = 2147483647 State = "Claimed" EnteredCurrentState = 1150284871 Activity = "Busy" EnteredCurrentActivity = 1150877237 Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000) || (State != "Unclaimed" && State != "Owner"))) Requirements = (START) && (IsValidCheckpointPlatform) IsValidCheckpointPlatform = (((TARGET.JobUniverse == 1) == FALSE) || ((MY.CheckpointPlatform =!= UNDEFINED) && ((TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform) || (TARGET.NumCkpts == 0)))) MaxJobRetirementTime = 0 CurrentRank = 1.000000 RemoteUser = "johanes@nd.edu" RemoteOwner = "johanes@nd.edu" ClientMachine = "cclbuild00.cse.nd.edu" JobId = "2929.0" GlobalJobId = "cclbuild00.cse.nd.edu#1150425594#2929.0" JobStart = 1150425941 LastPeriodicCheckpoint = 1150879661 ImageSize = 54196 TotalJobRunTime = 456222 TotalJobSuspendTime = 1080 TotalClaimRunTime = 597057 TotalClaimSuspendTime = 1271 MonitorSelfTime = 1150883051 MonitorSelfCPUUsage = 0.066660 MonitorSelfImageSize = 8244.000000 MonitorSelfResidentSetSize = 2036 MonitorSelfAge = 0 DaemonStartTime = 1150231320 UpdateSequenceNumber = 2208 MyAddress = "<129.74.153.164:9453>" LastHeardFrom = 1150883243 UpdatesTotal = 2785 UpdatesSequenced = 2784 UpdatesLost = 0 UpdatesHistory = "0x00000000000000000000000000000000" Machine = "ccl00.cse.nd.edu" Rank = ((Owner == "dthain") ||(Owner == "psnowber") ||(Owner == "cmoretti") ||(Owner == "jhemmes") ||(Owner == "gniederw")) * 2 + (PoolName =?= "ccl00.cse.nd.edu") * 1 User Job Log Job 1 submitted. Job 2 submitted. Job 1 placed on ccl00.cse.nd.edu Job 1 evicted. Job 1 placed on smarty.cse.nd.edu. Job 1 completed. Job 2 placed on dvorak.helios.nd.edu Job 2 suspended Job 2 resumed Job 2 exited normally with status 1. ...
User Job Log Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Job Ad Machine Ad Success Class Failure Class RIPPER Your jobs work fine on RH Linux 12.1 and 12.3 but they always seem to crash on version 12.2. Failure Criteria: exit !=0 core dump evicted suspended bad output
Unexpected Discoveries • Purdue Teragrid (91343 jobs on 2523 CPUs) • Jobs fail on machines with (Memory>1920MB) • Diagnosis: Linux machines with > 3GB have a different memory layout that breaks some programs that do inappropriate pointer arithmetic. • UND & UW (4005 jobs on 1460 CPUs) • Jobs fail on machines with less than 4MB disk. • Diagnosis: Condor failed in an unusual way when the job transfers input files that don’t fit.
Many Open Problems • Strengths and Weaknesses of Approach • Correlation != Causation -> could be enough? • Limits of reported data -> increase resolution? • Not enough data points -> direct job placement? • Acting on Information • Steering by the end user. • Applying learned rules back to the system. • Evaluating (and sometimes abandoning) changes. • Data Mining Research • Continuous intake + incremental construction. • Creating results that non-specialists can understand. • Next Step: Monitor 21,000 CPUs on the OSG!
Problems and Solutions • “I can’t make storage do what I want!” • Need root access, configure, reboot, etc... • Solution: Tactical Storage Systems • I have no idea why this system is failing! • Multiple services, unreliable networks... • Solution: Debugging Via Data Mining
Outline • Large Scale Distributed Computing • Plentiful Computing Resources World-Wide • Challenges: Data and Debugging • The Cooperative Computing Lab • Distributed Data Management • Applications to Scientific Computing • Debugging Complex Systems • Open Problems in Distributed Computing • Proposal: The All-Pairs Abstraction
Some Ruminations • These tools attack technical problems. • But, users still have to be clever: • Where should my jobs run? • How should I partition data? • How long should I run before a checkpoint? • Can we provide an interface such that: • Scientific users state what to compute. • The system decides where, when, and how. • Previous attempts didn’t incorporate data.
The All-Pairs Abstraction • All-Pairs: • For a set S and a function F: • Compute F(Si,Sj) for all Si and Sj in S. • The end user provides: • Set S: A bunch of files. • Function F: A self-contained program. • The computing system determines: • Optimal decomposition in time and space. • What resources to employ. (F easy to distr.) • What to do when failures occur.
S F F F F F 2 – Backend decides where to run, how to partition, when to retry failures... 1 – User uploads S and F into the system. 3 – Return result matrix to user. An All-Pairs Facility at Notre Dame 100s-1000s of machines All Pairs Web Portal CPU CPU CPU CPU Disk Disk Disk Disk
Our Mode of Research • Find researchers with systems problems. • Solve them by developing new tools. • Generalize the solutions to new domains. • Publish papers and software!