110 likes | 205 Views
The Difficulties of Distributed Data. Douglas Thain thain@cs.wisc.edu Condor Project University of Wisconsin http://www.cs.wisc.edu/condor. The Condor Project. Established in 1985. Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes.
E N D
The Difficultiesof Distributed Data Douglas Thain thain@cs.wisc.edu Condor Project University of Wisconsin http://www.cs.wisc.edu/condor
The Condor Project • Established in 1985. • Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. • Example installations: • 643 CPUs at UW-Madison in CS building • Comp architecture simulations • 264 CPUs at INFN all across Italy • CMS simulations • Serves two communities: Production software and computer science research.
No Repository Here! • No master source of anyone’s data at UW-CS Condor! • But, large amount of buffer space: • 128 * 10 GB + 64 *30 GB • Ultimate store is at other sites: • NCSA mass store • CERN LHC repositories • We concentrate on software for loading, buffering, caching, and producing output efficiently.
The Challenges of Large-Scale Data Access are… • 1 - Correctness! • Single stage: crashed machines, lost connections, missing libraries, wrong permissions, expired proxies… • End-to-end: A job is not “complete” until the output has been verified and written to disk. • 2 - Heterogeneity • By design: aggregated clusters. • By situation: Disk layout, buffer capacity, net load.
Your Comments: • Jobs need scripts that check readiness of system before execution. • (Tim Smith) • Single node failures not worth investigating: Reboot, reimage, replace. • (Steve DuChene) • “A cluster is a large error amplifier.” • (Chuck Boeheim)
Data Management in Condor • Production -> Research • Remote I/O • DAGMan • Kangaroo • Common denominators: • Hide errors from jobs -- they cannot deal with “connection refused” or “network down.” • Propagate failures first to scheduler, and perhaps later to the user.
Remote I/O • Relink job with Condor C library. • I/O is performed along TCP connection to the submit site: either fine-grained RPCs or whole-file staging. Some failures: NFS down DNS down Node rebooting Missing input On any failure: 1 - Kill -9 job 2 - Log event 3 - Email user? 4 - Reschedule Exec Site Job Exec Site Exec Site Exec Site Exec Site Exec Site Submit Site Exec Site Exec Site Job Exec Site
DAGMan(Directed Acyclic Graph Manager) • A persistent ‘make’ for distributed computing. • Handles dependencies and failures in multi-job tasks, including cpu and data movement. Run Remote Job If transfer fails… Retry up to 5 times. Stage Output Check Output Stage Input Begin DAG DAG Complete Run Remote Job If results are bogus… Retry up to 10 times.
Execution Site Storage Site Kangaroo • Simple Idea: Use all available net, mem, and disk to buffer data. “Hop” it to destination. • Background process, not job, is responsible for handling both faults and variations. • Allows overlap of CPU and I/O. App K K K K Data Movement System Disk
I/O Models Kangaroo Output: INPUT CPU CPU CPU CPU PUSH OUTPUT OUTPUT OUTPUT OUTPUT Stage Output: INPUT CPU CPU CPU CPU OUTPUT OUTPUT OUTPUT OUTPUT
In Summary… • Correctness is a major obstacle to high-throughput cluster computing. • Jobs must be protected from all of the possible errors in data access. • Handle failures in two ways: • Abort, and inform scheduler (not user.) • Fall back to alternate resource. • Pleasant side effect: higher throughput!