1 / 46

Scaling Up: Teraflop to Petaflop Performance

Scaling Up: Teraflop to Petaflop Performance. SDSC Summer Institute 2006 Robert Harkness, SDSC. Reality Check. Top500 is about politics, not productivity HPC Challenge is a better measure, but narrow Industry driven by mass marketing, not HPC

Thomas
Download Presentation

Scaling Up: Teraflop to Petaflop Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling Up:Teraflop to Petaflop Performance SDSC Summer Institute 2006 Robert Harkness, SDSC

  2. Reality Check • Top500 is about politics, not productivity • HPC Challenge is a better measure, but narrow • Industry driven by mass marketing, not HPC • Cost of ownership (per peak flop) results in poor hardware design sub-optimal for HPC • Gap between peak and sustained growing exponentially • Continual increase in code complexity to compensate • Who measures the cost in productivity? • Performance on your application is what matters • Scientific results are the only measure of success

  3. Challenges • Processor speed falls below Moore’s law • Memory speed and cpu speed still diverging • Power, cooling and physical size • Reliability – HW & SW MTBF • Lack of HW investment forces MPP • MPP incurs overhead and high programmer load • Inherent limits to scaling & load balancing • Overcoming latency at every level • Operational model

  4. The end of Moore’s Law? Moore’s Law CPU MEM DISK

  5. Easy or Hard? • Easy problems? • Embarrassingly Parallel • High degree of locality • Nearest neighbor communication only • Hard problems? • Wide range of physical scales • Highly non-local communication • Multi-physics • Long relaxation time, long dynamical times • Intrinsically serial processes

  6. Limits to Scalability • Physics • Long-range interaction requiring global communication • Local nonlinear effects leading to load imbalance • Separation of time scales • Relaxation over many dynamical times can limit useable parallelism for domain decomposition • Computation • Correctness, vaidation & verification • I/O • Scheduling • Cost

  7. Full development cycle • Mathematical statement and decomposition • Cost analysis for practical problem size • Coding • Debugging • Production • Post-processing and data management • Archival storage

  8. Reaching 1 TFlop • How do you reach 1 TFlop today? • Net efficiency 10% => 10 TF system @ full scale • 2000 processors @ 5 GFlop peak each • 2000 MPI tasks or threads • O/S redundancy, replication overhead • Only DataStar and BigBen sustain 1 TFlop • Most users still in 1-100 Gflop range

  9. Reaching 1 Petaflop • 1 Pflop sustained requires 2-20 PFlop peak • Micro cores may reach 5 GHz, 4ops/cycle • Memory bandwidth starved in many cases • Efficiency likely less than today – say 5% • At 20 GFlop/core > 1 MILLION PROCESSORS • Custom processors could exceed 50% efficiency • FPGAs may be 100x faster than micros • Algorithms in hardware

  10. EarthSimulator C90 DataStar LLNL BG/L

  11. With Apologies to Jack WorltonBelief in Petaflop Computing Atheists +R,-P Heretics +R,P Believers +R,+P True Believers R,+P R,P ? -R,P Fanatics -R,+P Luddites -R,-P

  12. Future hw directions • Vector registers and functional units • MTA with large number of contexts • PIM for locality with SIMD-like economy • FPGAs and Accelerators • Superconductors, non-electronic devices • Carbon nanotubes • Spintronics • Optical storage

  13. A Petaflop for the Rest of Us • Investment in hardware and software • Reduce the burden on the programmer! • Real improvements in specific performance require $$ • New languages may help, but adoption will be slow and very risky for users and vendors • Keep complexity at a manageable level • Design for the future

  14. Factors Limiting Useful Parallelism • Latency • Load Imbalance • Synchronization overhead • Task & thread management • Competition for shared resources (bandwidth) • Parallelism at differing scales • Empty pipes and other no-ops

  15. Latency is Enemy #1 • Pipeline latency ~10cp • Cache latency ~1-10cp • Memory latency ~100-1000+ cp • Switch latency ~10,000-100,000+ cp • Software latency ~ 10,000-100,000+ cp • Speed of light :-( Its too slow! • Across the machine room ~ 1000 cp • Across the country ~ 100ms ~10^8-10^9 cp • Latency is getting worse!

  16. Latency Function

  17. Bandwidth is Enemy #2 • Excellent bandwidth within a microprocessor • Immediate loss going to external cache • Enormous on-chip L3 is possible now (~2B transistors!) • Irregular memory access patterns are incompatible with current design in microprocessor caches designed for streaming • Typically huge loss in going to memory • Memory is large and cheap but slow • Multi-cores share bandwidth – less per processor • Only custom hardware exposes parallelism in memory • Only custom hardware addresses irregular access • Network & I/O bandwidth usually least of all…

  18. Bandwidth-2 • Network bandwidth 10 Gbit/s – 40 Gbit/s • Requires many cpus to drive at that rate • Still not adequate for large computational processor count • Reasonable for file transfers • Achievable bandwidth depends on software • What will happen if networks get busier?

  19. Strong Scaling • Fixed problem size, increasing cpu count N • Serial fraction f, maximum speedup 1/f • Parallel fraction (1-f) is not independent of N

  20. Weak Scaling • Scale problem size with number of processors N • Limited by resources • Scaled speedup (Barsis): s’ is serial time on parallel system and p’ is parallel time then a serial process would take s’+p’N giving a scaled speedup of

  21. Weak or Strong? • The truth lies in between • Serial fraction may not be independent of problem size • “The Corollary of Modest Potential” (Snyder) • Real resource constraints or policy may limit weak scaling in practice • How big an allocation can one get? • Can a calculation be finished in < resource cycle? • Can a calculation be finished in the life of the machine?

  22. Example • Hydrodynamics on a fixed Eulerian mesh • Courant condition on timestep • Ghost cells for 3D decomposition

  23. Cost at fixed work per cpu

  24. Work Smarter, Not Harder…? • Adaptive Mesh Refinement • Can vastly exceed capability of uniform meshes • Different scaling model – higher overheads • Different MPP model: shared memory or globally addressable memory • Strong implications for HW design

  25. ENZO Hydrodynamical Cosmology: 2048^3 mesh on 2048 processors

  26. ENZO Strong Scaling

  27. ENZO Weak Scaling

  28. Decomposition • Choose decomposition • Select a method that exposes maximum parallel content - plan for execution with >> 10,000 threads • Choose the memory model • Is shared memory required? • Is globally addressable memory required? • Choose the I/O strategy • Massively parallel I/O must be designed for at the outset • MP is 100% overhead!

  29. Coding • No religion - use the best tool for the job • Stick to mainstream languages (C/C++/F95) • Strictly adhere to standards for portability • Use the minimum set of features you need • Check all possible result codes and design for error detection and recovery • Design in checkpointing and restart • Parallel I/O is essential • Fault tolerance?

  30. MPI • Send/Recv is not enough • Buffered, asynchronous messaging • But does your HW and/or MPI really allow it? • Aggregating messages for message length • Caution with derived types (holes) • Send bytes for speed (dangerous) • One-sided model: always use “get” not “put” • Beware of cache effects • Use your own instance for COMM_WORLD

  31. CAF and UPC • Coding for automatic messaging managed by the compiler – eliminates error-prone MP • Clean, logical approach but can lack flexibility • Exploit HW with real global addressing capability • Preserves investment in coding even with shared memory systems

  32. Co-Array Fortran • Co-Array Standard Fortran in 2008 • Almost trivial extension to Fortran • Arrays replicated on all images • Co-size is always equal to NUM_IMAGES() • Upper bound must always be [*] REAL :: X(NX)[ND,*] INTEGER :: II[*] … X(J) [2,3] = II[3] ! Automatic put/get generation

  33. CAF-2 • Limitations (as of current implementation *) • Co-array can be a derived type but can never be a component of a derived type • Co-arrays cannot be assumed size • REAL :: X(*)[*] is not allowed • Co-arrays cannot be assumed-shape • REAL :: X(:)[*] is not allowed • REAL :: Y(:)[:] is not allowed • Automatic co-arrays are not supported • A significant problem for dynamic structures • But co-arrays can be allocatable and can also appear in COMMON and EQUIVALENCE

  34. CAF-3 • Explicit synchronization • CALL SYNC_IMAGES() • Image ID • Index = THIS_IMAGE() • Number of images • Image_count = NUM_IMAGES()

  35. I.O - Advantages of HDF5 • Machine independent data format • No endian-ness issues • Easy control of precision • Parallel interface built on MPI I/O • High performance and very robust • Excellent logical design ! • Hierarchical structure ideal for consolidation • Easy to accommodate local metadata • Useful inspection tools • H5ls • H5dump

  36. Network Realities • Why do it at all? • NSF allocation (2.1 million SU in 2005) is too large for one center to support • Some architectures are better suited for different parts of the computational pipeline • Central location for processing and archival storage at SDSC (IBM P690s, HPSS, SAM-QFS, SRB) • TeraGrid backbone and GridFTP make it possible…

  37. Local and Remote Resources • SDSC • IBM Power4 P655 and P690 (DataStar) • IBM BlueGene/L • NCSA • TeraGrid IA-64 Cluster (Mercury) • SGI Altix (Cobalt) • PSC • Compaq ES45 Cluster (Lemieux) • Cray XT3 (Big Ben) • LLNL • IA-64 Cluster (Thunder) • NERSC • IBM Power3 (Seaborg) • IBM Power5 (Bassi) • Linux Networks Opteron cluster (Jaquard)

  38. Network Transfer Options • GridFTP • Clumsy, but fast: 250+ MB/sec across TG • globus-url-copy, tgcp • BBftp • Easy to use, moderate speed: 90 MB/sec across Abilene • SRB • Global accessibility, complex capability, wide support • Lower performance • Can be combined with faster methods but still provide global access • HPSS • Easy to use, moderate speed • Local support only

  39. Recommendations • Maximize parallelism in all I/O operations • Use HDF5 or MPI I/O • Process results while they are on disk • Never use scp when GridFTP or BBftp is available • Containerize/tar your data before archiving it! • Use md5 checksums when you move data • Archive your code and metadata as well as your results – the overhead is minimal but you will never regret it! • Use SRB to manage all results from a project • Maximize parallelism in your work flow

  40. Debugging • Built in self-test • Levels of debug detail and verbosity • Problem test suite for accuracy and performance • Regression tests • Reasonable scale for interactive debuggers • Make use of norms for error checking • Use full error detection • Check that results are independent of task count

  41. Dos and Donts • Master/Slave will not scale up far enough • Never serialize any part of the process • In particular, plan for parallel I/O • Instrument your code for computation, I/O and communication performance • Design for checkpointing – must be parallel! • Design for real-time visualization, monitoring and steering • Anticipate failure – check every result code, particularly with I/O and networking

  42. Dos and Donts-2 • Always use 64-bit address mode • Code using a flexible approach to precision • Define your own types • Use strongly typed languages • Use 32-bit floating point with caution • Beware of lack of support for 128-bit fp – especially in C • Run with arithmetics checks ON – not always default • Will you need 64-bit integers? • MPI, HDF5, libraries all assume 32-bit integer controls

  43. Scaling is not all… • Scalability is far less important than: • Correctness • Reproducible results (consider global operations) • Robust operation • Throughput (scientific output) • Computational performance • Every code declines in efficiency beyond some processor/task count - measure where! • Fewer & faster always beats more & slower

  44. Site Policy Issues • Productive Petascale systems will require a completely different approach to operations • Observatory/instrument-style operations • Planned computational campaigns • Long dedicated runs at near full scale • Dedicated support staff to ensure run-time reliability • Error recovery procedures • Hot spares • Planned data transfer/archival storage capcity • Long-term storage policy

  45. Magic Bullets • Sorry – there are no magic bullets…

More Related