1 / 11

The Difficulties of Distributed Data

This article discusses the difficulties faced in accessing distributed data in high-throughput cluster computing environments, including correctness, heterogeneity, and data management. It also explores solutions such as remote I/O, DAGMan, and Kangaroo.

laurafarmer
Download Presentation

The Difficulties of Distributed Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Difficultiesof Distributed Data Douglas Thain thain@cs.wisc.edu Condor Project University of Wisconsin http://www.cs.wisc.edu/condor

  2. The Condor Project • Established in 1985. • Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. • Example installations: • 643 CPUs at UW-Madison in CS building • Comp architecture simulations • 264 CPUs at INFN all across Italy • CMS simulations • Serves two communities: Production software and computer science research.

  3. No Repository Here! • No master source of anyone’s data at UW-CS Condor! • But, large amount of buffer space: • 128 * 10 GB + 64 *30 GB • Ultimate store is at other sites: • NCSA mass store • CERN LHC repositories • We concentrate on software for loading, buffering, caching, and producing output efficiently.

  4. The Challenges of Large-Scale Data Access are… • 1 - Correctness! • Single stage: crashed machines, lost connections, missing libraries, wrong permissions, expired proxies… • End-to-end: A job is not “complete” until the output has been verified and written to disk. • 2 - Heterogeneity • By design: aggregated clusters. • By situation: Disk layout, buffer capacity, net load.

  5. Your Comments: • Jobs need scripts that check readiness of system before execution. • (Tim Smith) • Single node failures not worth investigating: Reboot, reimage, replace. • (Steve DuChene) • “A cluster is a large error amplifier.” • (Chuck Boeheim)

  6. Data Management in Condor • Production -> Research • Remote I/O • DAGMan • Kangaroo • Common denominators: • Hide errors from jobs -- they cannot deal with “connection refused” or “network down.” • Propagate failures first to scheduler, and perhaps later to the user.

  7. Remote I/O • Relink job with Condor C library. • I/O is performed along TCP connection to the submit site: either fine-grained RPCs or whole-file staging. Some failures: NFS down DNS down Node rebooting Missing input On any failure: 1 - Kill -9 job 2 - Log event 3 - Email user? 4 - Reschedule Exec Site Job Exec Site Exec Site Exec Site Exec Site Exec Site Submit Site Exec Site Exec Site Job Exec Site

  8. DAGMan(Directed Acyclic Graph Manager) • A persistent ‘make’ for distributed computing. • Handles dependencies and failures in multi-job tasks, including cpu and data movement. Run Remote Job If transfer fails… Retry up to 5 times. Stage Output Check Output Stage Input Begin DAG DAG Complete Run Remote Job If results are bogus… Retry up to 10 times.

  9. Execution Site Storage Site Kangaroo • Simple Idea: Use all available net, mem, and disk to buffer data. “Hop” it to destination. • Background process, not job, is responsible for handling both faults and variations. • Allows overlap of CPU and I/O. App K K K K Data Movement System Disk

  10. I/O Models Kangaroo Output: INPUT CPU CPU CPU CPU PUSH OUTPUT OUTPUT OUTPUT OUTPUT Stage Output: INPUT CPU CPU CPU CPU OUTPUT OUTPUT OUTPUT OUTPUT

  11. In Summary… • Correctness is a major obstacle to high-throughput cluster computing. • Jobs must be protected from all of the possible errors in data access. • Handle failures in two ways: • Abort, and inform scheduler (not user.) • Fall back to alternate resource. • Pleasant side effect: higher throughput!

More Related