1 / 25

Atlas: An Infrastructure for Global Computing

Atlas: An Infrastructure for Global Computing. People. Eric Baldeschwieler (UC Berkeley) Bobby Blumofe (UT Austin) Eric Brewer (UC Berkeley). Outline. Introduction Programming model Architecture Examples Discussion Limitations & Conclusion. Introduction.

jmilliken
Download Presentation

Atlas: An Infrastructure for Global Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Atlas: An Infrastructure for Global Computing

  2. People • Eric Baldeschwieler (UC Berkeley) • Bobby Blumofe (UT Austin) • Eric Brewer (UC Berkeley)

  3. Outline • Introduction • Programming model • Architecture • Examples • Discussion • Limitations & Conclusion

  4. Introduction Properties of a Internet computing infrastructure • Scalability: to 106 nodes • Heterogeneity: of machines & OSs • Fault tolerance: completion probability comparable to sequential program • Adaptive parallelism: dynamic set of resources

  5. Properties ... • Safety: Hosts must be secure • Anonymity: Secure privacy of client: data & program • Hierarchy: Locality of communication (local bandwidth typically is higher) • Ease of use: Minimize “costs” of participating. • Reasonable performance: Low overhead  Benefit from a small set of machines.

  6. Introduction ... • Atlas combines mechanisms from: • Cilk • Java • with new mechanisms. • Java “ensures”: • heterogeneity • safety

  7. Introduction ... Atlas: • extends Cilk’s work-stealing scheduler to a hierarchical Internet setting • uses Cilk-NOW’s mechanisms for: • adaptive parallelism • fault tolerance

  8. Programming Model • Applications are written in Java • When a native library is used, heterogeneity is limited to platforms that support it. • Programming model is: • a Java-based implementation of Cilk: • Non-blocking, explicit continuation passing threads • a Unix-like URL-based file system & local caching with coherence.

  9. Architecture Basic architecture Compute Server Client Manager Application (Java) Runtime library Java interpreter Native libraries (C or C++) Compute Server Compute Server Compute Server

  10. Architecture ... • Client is a Java application • connects to compute servers on machines other than its manager’s. • Idle servers steal work from busy ones.

  11. Architecture • Compute server: • relinquishes control when there is non-Atlas work (a screensaver?) • Runs as a daemon: • working • pings manager & siblings for work to steal

  12. Architecture: Porting Atlas • A Java runtime system • Port: • natively written URL-based file system • some support routines.

  13. Hierarchical Work Stealing Manager Manager Manager Manager Manager Compute Server Compute Server Compute Server

  14. Hierarchical Work Stealing ... • Manager keeps track of when its subtree is idle • If manager’s subtree is idle, manager steals work from its siblings • If a subtree has “too much” work, it “allows” work stealing from above What is definition & implementation of “too much”?

  15. Hierarchical Work Stealing • The authors claim that proven properties of Cilk hold in this hierarchical setting. • Goals: • Localize communication • Sub-trees map to domain hierarchy Administrators can control thread migration: • Outflow: Privacy • Inflow: Host security

  16. Examples • Fib: fine grained threads • POV-Ray: coarse grained threads Base 1 Node 3 Nodes 8 Nodes Fib (24) 1.3 80 40 (2.0) 31 (2.6) POV-Ray 20700 21000 - 2700 (7.8) Numbers in ( ) are speedups over 1-node case.

  17. Examples ... • POV-Ray is not written in Java • Partitioning is done in Java • 8 nodes: only 2% overhead. • What about larger P?

  18. Discussion • Scalable: Yes. • Heterogeneity: Incomplete until divorces itself from all native libraries. • Safety: • Java: OK. • Native libraries: ?

  19. Discussion ... • Fault tolerance: A timed out thread is recomputed from a checkpointmaintained by subtree (manager?) • What is affect on performance of checkpointing? Subtree rooted at a thread is its subcomputation.

  20. Fault Tolerance ... Subcomputations are transactions: • Authors claim: side effects can be undone • How does this relate to hierarchical work stealing?

  21. Discussion ... • Anonymity: A host executing a stolen subtree cannot determine client. • Managers are assumed to be trustworthy • Hierarchy: Yes, via manager hierarchy. • Ease of use: Interface incomplete. • clients submit jobs via a special “shell”

  22. Discussion ... • Adaptive parallelism: • “Owner” (?) of compute server sets a policy that defines when server is idle. • How? • When compute server becomes unavailable for Atlas work, all its sub-computations are moved to another computer server.

  23. Adaptive Parallelism ... • Moving a subcomputation requires updating information linking subcomputation to its: • parent • children • How long does it take to retreat? • Is sub-computation restarted? From checkpoint?

  24. Limitations • Atlas inherits tree-structured program limitation from Cilk. • But this is still a rich set! • Generalizing to non-tree-structured programs seems hard. • No shared variables among threads. • Global file system is read-only.

  25. Conclusion • Jicos design goals = those for Atlas. • Use JXTA to give Jicos a “file system” • Then, Jicos becomes Atlas’s heir.

More Related