1 / 58

Parallel Programming on Computational Grids

Parallel Programming on Computational Grids. Outline. Grids Application-level tools for grids Parallel programming on grids Case study: Ibis. Grids. Seamless integration of geographically distributed computers, databases, instruments The name is an analogy with power grids

Download Presentation

Parallel Programming on Computational Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Programming on Computational Grids

  2. Outline • Grids • Application-level tools for grids • Parallel programming on grids • Case study: Ibis

  3. Grids • Seamless integration of geographically distributed computers,databases, instruments • The name is an analogy with power grids • Highly active research area • Global Grid Forum • Globus middleware • Many European projects • e.g. Gridlab: Grid Application Toolkit and Testbed • VL-e (Virtual laboratory for e-Science) project • ….

  4. Why Grids? • New distributed applications that use data or instruments across multiple administrative domains and that need much CPU power • Computer-enhanced instruments • Collaborative engineering • Browsing of remote datasets • Use of remote software • Data-intensive computing • Very large-scale simulation • Large-scale parameter studies

  5. Web, Grids and e-Science • Web is about exchanging information • Grid is about sharing resources • Computers, data bases, instruments • e-Science supports experimental science by providing avirtual laboratory on top of Grids • Support for visualization, workflows, data management,security, authentication, high-performance computing

  6. Grids Harness distributed resources The big picture Application Application Application Potential Generic part Potential Generic part Potential Generic part Management of comm. & computing Virtual Laboratory Application oriented services Management of comm. & computing Management of comm. & computing

  7. Applications • e-Science experiments generate much data, that often is distributed and that need much (parallel) processing • high-resolution imaging: ~ 1 GByte per measurement • Bio-informatics queries: 500 GByte per database • Satellite world imagery: ~ 5 TByte/year • Current particle physics: 1 PByte per year • LHC physics (2007): 10-30 PByte per year

  8. Grid programming • The goal of a grid is to make resource sharing very easy (transparent) • Practice: grid programming is very difficult • Finding resources, running applications, dealing with heterogeneity and security, etc. • Grid middleware (Globus Toolkit) makes this somewhat easier, but is still low-level and changes frequently • Need application-level tools

  9. Application-level tools • Builds on grid software infrastructure • Isolates users from dynamics of the grid hardware infrastructure • Generic (broad classes of applications) • Easy-to-use

  10. Taxonomy of existing application-level tools • Grid programming models • RPC • Task parallelism • Message passing • Java programming • Grid application execution environments • Parameter sweeps • Workflow • Portals

  11. Remote Procedure Call (RPC) • GridRPC: specialize RPC-style (client/server) programming for grids • Allows coarse-grain task parallelism & remote access • Extended with resource discovery, scheduling, etc. • Example: NetSolve • Solves numerical problems on a grid • Current development: use web technology (WSDL, SOAP) for grid services • Web and grid technology are merging

  12. Task parallelism • Many systems for task parallelism (master-worker, replicated workers) exist for the grid • Examples • MW (master-worker) • Satin: divide&conquer (hierarchical master-worker)

  13. Message passing • Several MPI implementations exist for the grid • PACX MPI (Stutgart): • Runs on heterogeneous systems • MagPIe (Thilo Kielmann): • Optimizes MPI’s collective communicationfor hierarchical wide-area systems • MPICH-G2: • Similar to PACX and MagPIe, implemented on Globus

  14. Java programming • Java uses bytecode and is very portable • ``Write once, run anywhere’’ • Can be used to handle heterogeneity • Many systems now have Java interfaces: • Globus (Globus Java Commodity Grid) • MPI (MPIJava, MPJ, …) • Gridlab Application Toolkit (Java GAT) • Ibis is a Java-centric grid programming system

  15. Parameter sweep applications • Computations what are mostly independent • E.g. run same simulation many times with different parameters • Can tolerate high network latencies, can easily be made fault-tolerant • Many systems use this type of trivial parallelism to harness idle desktops • APST, SETI@home, Entropia, XtremWeb

  16. Workflow applications • Link and compose diverse software tools and data formats • Connect modules and data-filters • Results in coarse-grain, dataflow-like parallelism thatcan be run efficiently on a grid • Several workflow management systems exist • E.g. Virtual Lab Amsterdam (predecessor VL-e)

  17. Portals • Graphical interfaces to the grid • Often application-specific • Also portals for resource brokering, file transfers, etc.

  18. Outline • Grids • Application-level tools for grids • Parallel programming on grids • Case study: Ibis

  19. Distributed supercomputing • Parallel processing on geographically distributed computing systems (grids) • Examples: • SETI@home (), RSA-155, Entropia, Cactus • Currently limited to trivially parallel applications • Questions: • Can we generalize this to more HPC applications? • What high-level programming support is needed?

  20. Grids versus supercomputers • Performance/scalability • Speedups on geographically distributed systems? • Heterogeneity • Different types of processors, operating systems, etc. • Different networks (Ethernet, Myrinet, WANs) • General grid issues • Resource management, co-allocation, firewalls, security, authorization, accounting, ….

  21. Approaches • Performance/scalability • Exploit hierarchical structure of grids(previous project: Albatross) • Heterogeneity • Use Java + JVM (Java Virtual Machine) technology • General grid issues • Studied in many grid projects: VL-e, GridLab, GGF

  22. Speedups on a grid? • Grids usually are hierarchical • Collections of clusters, supercomputers • Fast local links, slow wide-area links • Can optimize algorithms to exploit this hierarchy • Minimize wide-area communication • Successful for many applications • Did many experiments on a homogeneous wide-area test bed (DAS)

  23. Example: N-body simulation • Much wide-area communication • Each node needs info about remote bodies CPU 1 CPU 1 CPU 2 CPU 2 Amsterdam Delft

  24. Trivial optimization CPU 1 CPU 1 CPU 2 CPU 2 Amsterdam Delft

  25. Wide-area optimizations • Message combining on wide-area links • Latency hiding on wide-area links • Collective operations for wide-area systems • Broadcast, reduction, all-to-all exchange • Load balancing Conclusions: • Many applications can be optimized to run efficiently on a hierarchical wide-area system • Need better programming support

  26. The Ibis system • High-level & efficient programming support for distributed supercomputing on heterogeneous grids • Use Java-centric approach + JVM technology • Inherently more portable than native compilation • “Write once, run anywhere ” • Requires entire system to be written in pure Java • Optimized special-case solutions with native code • E.g. native communication libraries

  27. Ibis programming support • Ibis provides • Remote Method Invocation (RMI) • Replicated objects (RepMI) - as in Orca • Group/collective communication (GMI) - as in MPI • Divide & conquer (Satin) - as in Cilk • All integrated in a clean, object-oriented way into Java, using special “marker” interfaces • Invoking native library (e.g. MPI) would give upJava’s “run anywhere” portability

  28. Compiling/optimizing programs JVM source bytecode Javacompiler bytecoderewriter bytecode source JVM JVM • Optimizations are done by bytecode rewriting • E.g. compiler-generated serialization (as in Manta)

  29. Satin: a parallel divide-and-conquer system on top of Ibis • Divide-and-conquer is inherently hierarchical • More general than master/worker • Satin: Cilk-like primitives (spawn/sync) in Java • New load balancing algorithm (CRS) • Cluster-aware random work stealing

  30. Example interface FibInter { public int fib(long n); } class Fib implements FibInter { int fib (int n) { if (n < 2) return n; return fib(n-1) + fib(n-2); } } Single-threaded Java

  31. GridLab testbed Example interface FibInter extends ibis.satin.Spawnable { public int fib(long n); } class Fib extends ibis.satin.SatinObject implements FibInter { public int fib (int n) { if (n < 2) return n; int x = fib (n - 1); int y = fib (n - 2); sync(); return x + y; } } Java + divide&conquer

  32. Ibis implementation • Want to exploit Java’s “run everywhere” property, but • That requires 100% pure Java implementation,no single line of native code • Hard to use native communication (e.g. Myrinet) or native compiler/runtime system • Ibis approach: • Reasonably efficient pure Java solution (for any JVM) • Optimized solutions with native code for special cases

  33. Ibis design

  34. Status and current research • Programming models • ProActive (mobility) • Satin (divide&conquer) • Applications • Jem3D (ProActive) • Barnes-Hut, satisfiability solver (Satin) • Spectrometry, sequence alignment (RMI) • Communication • WAN-communication, performance-awareness

  35. Challenges • How to make the system flexible enough • Run seamlessly on different hardware / protocols • Make the pure-Java solution efficient enough • Need fast local communication evenfor grid applications • Special-case optimizations

  36. Fast communication in pure Java • Manta system [ACM TOPLAS Nov. 2001] • RMI at RPC speed, but using native compiler & RTS • Ibis does similar optimizations, but in pure Java • Compiler-generated serialization at bytecode level • 5-9x faster than using runtime type inspection • Reduce copying overhead • Zero-copy native implementation for primitive arrays • Pure-Java requires type-conversion (=copy) to bytes

  37. Java/Ibis vs. C/MPI on Pentium-3 cluster (using SOR)

  38. Grid experiences with Ibis • Using Satin divide-and-conquer system • Implemented with Ibis in pure Java, using TCP/IP • Application measurements on • DAS-2 (homogeneous) • Testbed from EC GridLab project (heterogeneous) • Grid’5000 (France) – N Queens challenge

  39. VU (72 nodes) UvA (32) GigaPort (1-10 Gb) Leiden (32) Delft (32) Utrecht (32) Distributed ASCI Supercomputer (DAS) 2

  40. Satin on wide-area DAS-2

  41. Satin on GridLab • Heterogeneous European grid testbed • Implemented Satin/Ibis on GridLab, using TCP • Source: van Nieuwpoort et al., AGRIDM’03 (Workshop on Adaptive Grid Middleware, New Orleans, Sept. 2003)

  42. GridLab • Latencies: • 9-200 ms (daytime),9-66 ms (night) • Bandwidths: • 9-4000 KB/s

  43. Configuration

  44. Experiences • No support for co-allocation yet (done manually) • Firewall problems everywhere • Currently: use a range of site-specific open ports • Can be solved with TCP splicing • Java indeed runs anywhere modulo bugs in (old) JVMs • Need clever load balancing mechanism (CRS)

  45. Cluster-aware Random Stealing • Use Cilk’s Random Stealing (RS) inside cluster • When idle • Send asynchronous wide-area steal message • Do random steals locally, and execute stolen jobs • Only 1 wide-area steal attempt in progress at a time • Prefetching adapts • More idle nodes  more prefetching • Source: van Nieuwpoort et al., ACM PPoPP’01

  46. Performance on GridLab • Problem: how to define efficiency on a grid? • Our approach: • Benchmark each CPU with Raytracer on small input • Normalize CPU speeds (relative to a DAS-2 node) • Our case: 40 CPUs, equivalent to 24.7 DAS-2 nodes • Define: • T_perfect = sequential time / 24.7 • efficiency = T_perfect / actual runtime • Also compare against single 25-node DAS-2 cluster

  47. Results for Raytracer RS = Random Stealing, CRS = Cluster-aware RS

  48. Some statistics • Variations in execution times: • RS @ day: 0.5 - 1 hour • CRS @ day: less than 20 secs variation • Internet communication (total): • RS: 11,000 (night) - 150,000 (day) messages 137 (night) - 154 (day) MB • CRS: 10,000 - 11,000 messages 82 - 86 MB

  49. N-Queens Challenge • How many ways are there to arrange N non-attacking queens on an NxN chessboard ?

  50. Known Solutions • Currently, all solutions up to N=25 are known • Result N=25 was computed usingthe 'spare time' of 260 machines • Took 6 months • Used over 53 cpu-years!

More Related