1 / 21

Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd

Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd. Streamline Computing Ltd Spin out of Warwick (& Oxford) University Specialising in distributed (technical) computing Cluster and GRID computing technology

kyran
Download Presentation

Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Resource Management and Parallel Computation Dr Michael Rudgyard Streamline Computing Ltd

  2. Streamline Computing Ltd • Spin out of Warwick (& Oxford) University • Specialising in distributed (technical) computing • Cluster and GRID computing technology • 14 employees & growing; focussed expertise in: • Scientific Computing • Computer systems and support • Presently 5 PhDs in HPC and Parallel Computation • Expect growth to 20+ people in 2003

  3. Strategy • Establish an HPC systems integration company.. • ....but re-invest profits into software • Exploiting IPand significant expertise • First software product released • Two more products in prototype stage • Two complementary ‘businesses’ • Both high growth

  4. Track Record (2001 – date..) • Installations include: • Largest Sun HPC cluster in Europe (176 proc) • Largest Sun / Myrinet cluster in UK (128 proc) • AMD, Intel and Sun clusters at 21 UK Universities • Commercial clients include Akzo Noble, Fujitsu, Maclaren F1, Rolls Royce, Schlumberger, Texaco…. • Delivered a 264 proc Intel/Myrinet cluster: • 1.3 Tflop/s Peak !! • Forms part of the White Rose Computational Grid

  5. Streamline and Grid Computing • Pre-configured ‘grid’-enabled systems: • Clusters and farms • The SCore parallel environment • Virtual ‘desktop’ clusters • Grid-enabled software products: • The Distributed Debugging Tool • Large-scale distributed graphics • Scaleable, intelligent & fault tolerant parallel computing

  6. ‘Grid’-enabled turnkey clusters • Choice of DRMs and schedulers: • (Sun) GridEngine • PBS / PBS-Pro • LSF / ClusterTools • Condor • Maui Scheduler • Globus 2.x gatekeeper (Globus 3 ???) • Customised access portal

  7. The SCore parallel environment • Developed by the Real World Computing Partnership in Japan (www.pccluster.org). • Unique features, that are unavailable in most parallel environments: • Low latency, high bandwidth MPI drivers • Network transparency: Ethernet, Gigabit and Myrinet • Multi-user time-sharing (gang scheduling) • O/S level checkpointing and failover • Integration with PBS and SGE • MPICH-G port • Cluster management functionality

  8. ‘Desktop’ Clusters • Linux Workstation Strategy • Integrated software stack for HPTC (compilers, tools & libraries) – cf. UNIX workstations • Aim to provide a GRID at point of sale: • Single point of administration for several machines • Files served from front-end • Resource management • Globus enabled • Portal • A cluster with monitors !!

  9. The Distributed Debugging Tool • A debugger for distributed parallel application • Launched at Supercomputing 2002 • Aim is to be the de-facto HPC debugging tool • Linux ports for GNU, Absoft, Intel and PGI • IA64 and Solaris ports; AIX and HP-UX soon… • Commodity pricing structure ! • Existing architecture lends itself to the GRID: • Thin client GUI + XML middleware + back-end • Expect GRID-enabled version in 2003

  10. Distributed Graphics Software • Aims • To enablevery large models to be viewed and manipulated using commodity clusters • Visualisation on (local or remote) graphics client • Technology • Sophisticated data-partitioning and parallel I/O tools • Compression using distributed model simplification • Parallel (real-time) rendering • To be GRID-enabled within e-Science ‘Gviz’ project

  11. Parallel Compiler and Tools Strategy • Aim to invest in new computing paradigms • Developing parallel applications is far from trivial • OpenMP does not marry with cluster architecture • MPI is too low-level • Few skills in the marketplace ! • Yet growth of MPPs is exponential… • Most existing applications are not GRID-friendly • # of processors fixed • No Fault Tolerance • Little interaction with DRM

  12. DRM for Parallel Computation • Throughput of parallel jobs is limited by: • Static submission model: ‘mpirun –np …..’ • Static execution model: # processors fixed • Scaleability; many jobs use too many processors ! • Job Starvation • Available tools can only solve some issues • Advanced reservation and back-fill (eg Maui) • Multi-user time-sharing (gang scheduling) • The application itself must take responsibility !!

  13. Dynamic Job Submission • Job scheduler should decide the available processor resource ! • The application then requires: • In built partitioning / data management • Appropriate parallel I/O model • Hooks into the DRM • DRM requires: • Typical memory and processor requirements • LOS information • Hooks into the application

  14. Dynamic Parallel Execution • Additional resources may become available or be required by other applications during execution… • Ideal situation: • DRM informs application • Application dynamically re-partitions itself • Other issues: • DRM requires knowledge of the application (benefit of data redistribution must outweigh cost !) • Frequency of dynamic scheduling • Message passing must have dynamic capabilities

  15. The Intelligent Parallel Application • Optimal scheduling requires more information: • How well the application scales • Peak and average memory requirements • Application performance vs. architecture • The application ‘cookie’ concept: • Application (and/or DRM) should gather information about its own capabilities • DRM can then limit # of available processors • Ideally requires hooks into the programming paradigm…

  16. Fault Tolerance • On large MPPs, processors/components will fail ! • Applications need fault tolerance: • Checkpointing + RAID-like redundancy (cf SCore) • Dynamic repartitioning capabilities • Interaction with the DRM • Transparency from the user’s perspective • Fault-tolerance relies on many of the capabilities described above…

  17. Conclusions • Commitment to near-term GRID objectives • Turn-key clusters, farms and storage installations • On going development of ‘GRID-enabled’ tools • Driven by existing commercial opportunities…. • ‘Blue’-sky project for next generation applications • Exploits existing IP and advanced prototype • Expect moderate income from focussed exploitation • Strategic positioning: existing paradigms will ultimately be a barrier to the success of (V-)MPP computers / clusters !

More Related