770 likes | 952 Views
Introduction to High Performance Computing. Jon Johansson Academic ICT University of Alberta. Agenda. What is High Performance Computing? What is a “supercomputer”? is it a mainframe? Supercomputer architectures Who has the fastest computers? Speedup Programming for parallel computing
E N D
Introduction to High Performance Computing Jon Johansson Academic ICT University of Alberta
Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??
High Performance Computing • HPC is the field that concentrates on developing supercomputers and software to run on supercomputers • a main area of this discipline is developing parallel processing algorithms and software • programs that can be divided into little pieces so that each piece can be executed simultaneously by separate processors
High Performance Computing • HPC is about “big problems”, i.e. need: • lots of memory • many cpu cycles • big hard drives • no matter what field you work in, perhaps your research would benefit by making problems “larger” • 2d → 3d • finer mesh • increase number of elements in the simulation
Grand Challenges • weather forecasting • economic modeling • computer-aided design • drug design • exploring the origins of the universe • searching for extra-terrestrial life • computer vision • nuclear power and weapons simulations
Grand Challenges – Protein To simulate the folding of a 300 amino acid protein in water: # of atoms: ~ 32,000 folding time: 1 millisecond # of FLOPs: 3 1022 Machine Speed: 1 PetaFLOP/s Simulation Time: 1 year (Source: IBM Blue Gene Project) Ken Dil and Kit Lau’s protein folding model. IBM’s answer: The Blue Gene Project US$ 100 M of funding to build a 1 PetaFLOP/s computer Charles L Brooks III, Scripps Research Institute
Grand Challenges - Nuclear • National Nuclear Security Administration • http://www.nnsa.doe.gov/ • use supercomputers to run three-dimensional codes to simulate instead of test • address critical problems of materials aging • simulate the environment of the weapon and try to gauge whether the device continues to be usable • stockpile science, molecular dynamics and turbulence calculations http://archive.greenpeace.org/comms/nukes/fig05.gif
Grand Challenges - Nuclear ASCI White • March 7, 2002: first full-system three-dimensional simulations of a nuclear weapon explosion • simulation used more than 480 million cells (grid: 780x780x780) • if the grid is a cube • 1,920 processors on IBM ASCI White at the Lawrence Livermore National laboratory • 2,931 wall-clock hours or 122.5 days • 6.6 million CPU hours Test shot “Badger” Nevada Test Site – Apr. 1953 Yield: 23 kilotons http://nuclearweaponarchive.org/Usa/Tests/Upshotk.html
Grand Challenges - Nuclear • Advanced Simulation and Computing Program (ASC) • http://www.llnl.gov/asc/asc_history/asci_mission.html
Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??
What is a “Mainframe”? • large and reasonably fast machines • the speed isn't the most important characteristic • high-quality internal engineering and resulting proven reliability • expensive but high-quality technical support • top-notch security • strict backward compatibility for older software
What is a “Mainframe”? • these machines can, and do, run successfully for years without interruption (long uptimes) • repairs can take place while the mainframe continues to run • the machines are robust and dependable • IBM coined a term advertise the robustness of their mainframe computers : • Reliability, Availability and Serviceability (RAS)
What is a “Mainframe”? • Introducing IBM System z9 109 • Designed for the On Demand Business • IBM is delivering a holistic approach to systems design • Designed and optimized with a total systems approach • Helps keep your applications running with enhanced protection against planned and unplanned outages • Extended security capabilities for even greater protection capabilities • Increased capacity with more available engines per server
What is a Supercomputer?? • at any point in time the term “Supercomputer” refers to the fastest machines currently available • a supercomputer this year might be a mainframe in a couple of years • a supercomputer is typically used for scientific and engineering applications that must do a great amount of computation
What is a Supercomputer?? • the most significant difference between a supercomputer and a mainframe: • a supercomputer channels all its power into executing a few programs as fast as possible • if the system crashes, restart the job(s) – no great harm done • a mainframe uses its power to execute many programs simultaneously • e.g. – a banking system • must run reliably for extended periods
What is a Supercomputer?? • to see the worlds “fastest” computers look at • http://www.top500.org/ • measure performance with the Linpack benchmark • http://www.top500.org/lists/linpack.php • solve a dense system of linear equations • the performance numbers give a good indication of peak performance
What is a Supercomputer?? • count the number of “floating point operations” required to solve the problem • + - x / • results of the benchmark are so many Floating point Operations Per Second (FLOPS) • a supercomputer is a machine that can provide a very large number of FLOPS
Floating Point Operations • multiply 2 1000x1000 matrices • for each resulting array element • 1000 multiplies • 999 adds • do this 1,000,000 times • ~109 operations needed • increasing array size has the number of operations increasing as O(N3)
Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??
High Performance Computing • supercomputers use many CPUs to do the work • note that all supercomputing architectures have • processors and some combination cache • some form of memory and IO • the processors are separated from the other processors by some distance • there are major differences in the way that the parts are connected • some problems fit into different architectures better than others
High Performance Computing • increasing computing power available to researchers allows • increasing problem dimensions • adding more particles to a system • increasing the accuracy of the result • improving experiment turnaround time
Flynn’s Taxonomy • Michael J. Flynn (1972) • classified computer architectures based on the number of concurrent instructions and data streams available • single instruction, single data (SISD) – basic old PC • multiple instruction, single data (MISD) – redundant systems • single instruction, multiple data (SIMD) – vector (or array) processor • multiple instruction, multiple data (MIMD) – shared or distributed memory systems: symmetric multiprocessors and clusters • common extension: • single program (or process), multiple data (SPMD)
Architectures • we can also classify supercomputers according to how the processors and memory are connected • couple processors to a single large memory address space • couple computers, each with its own memory address space
Architectures • Symmetric Multiprocessing (SMP) • Uniform Memory Access (UMA) • multiple CPUs, residing in one cabinet, share the same memory • processors and memory are tightly coupled • the processors share memory and the I/O bus or data path
Architectures • SMP • a single copy of the operating system is in charge of all the processors • SMP systems range from two to as many as 32 or more processors
Architectures • SMP • "capability computing" • one CPU can use all the memory • all the CPUs can work on a little memory • whatever you need
Architectures • UMA-SMP negatives • as the number of CPUs get large the buses become saturated • long wires cause latency problems
Architectures • Non-Uniform Memory Access (NUMA) • NUMA is similar to SMP - multiple CPUs share a single memory space • hardware support for shared memory • memory is separated into close and distant banks • basically a cluster of SMPs • memory on the same processor board as the CPU (local memory) is accessed faster than memory on other processor boards (shared memory) • hence "non-uniform" • NUMA architecture scales much better to higher numbers of CPUs than SMP
Architectures University of Alberta SGI Origin SGI NUMA cables
Architectures • Cache Coherent NUMA (ccNUMA) • each CPU has an associated cache • ccNUMA machines use special-purpose hardware to maintain cache coherence • typically done by using inter-processor communication between cache controllers to keep a consistent memory image when the same memory location is stored in more than one cache • ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession
Architectures Distributed Memory Multiprocessor (DMMP) • each computer has its own memory address space • looks like NUMA but there is no hardware support for remote memory access • the special purpose switched network is replaced by a general purpose network such as Ethernet or more specialized interconnects: • Infiniband • Myrinet Lattice: Calgary’s HP ES40 and ES45 cluster – each node has 4 processors
Architectures • Massively Parallel Processing (MPP) Cluster of commodity PCs • processors and memory are loosely coupled • "capacity computing" • each CPU contains its own memory and copy of the operating system and application. • each subsystem communicates with the others via a high-speed interconnect. • in order to use MPP effectively, a problem must be breakable into pieces that can all be solved simultaneously
Architectures • lots of “how to build a cluster” tutorials on the web – just Google: • http://www.beowulf.org/ • http://www.cacr.caltech.edu/beowulf/tutorial/building.html
Architectures • Vector Processor or Array Processor • a CPU design that is able to run mathematical operations on multiple data elements simultaneously • a scalar processor operates on data elements one at a time • vector processors formed the basis of most supercomputers through the 1980s and into the 1990s • “pipeline” the data
Architectures • Vector Processor or Array Processor • operate on many pieces of data simultaneously • consider the following add instruction: • C = A + B • on both scalar and vector machines this means: • add the contents of A to the contents of B and put the sum in C' • on a scalar machine the operands are numbers • on a vector machine the operands are vectors and the instruction directs the machine to compute the pair-wise sum of each pair of vector elements
Architectures • University of Victoria has 4 NEC SX-6/8A vector processors • in the School of Earth and Ocean Sciences • each has 32 GB of RAM • 8 vector processors in the box • peak performance is 72 GFLOPS
Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??
BlueGene/L • The fastest on the 26th (Nov. 2006) top 500 list: • http://www.top500.org/ • installed at the Lawrence Livermore National Laboratory (LLNL) (US Department of Energy) • Livermore California
http://www.llnl.gov/asc/platforms/bluegenel/photogallery.htmlhttp://www.llnl.gov/asc/platforms/bluegenel/photogallery.html
BlueGene/L • processors: 131072 • memory: 32 TB • 64 racks – each has 2048 processors and 512 GB of RAM (256 MB/processor) • a Linpack performance of 280.6 TFlop/s • in Nov 2005 it was the only system ever to exceed the 100 TFlop/s mark • there are now 2 machines over 100 TFlop/s
Agenda • What is High Performance Computing? • What is a “supercomputer”? • is it a mainframe? • Supercomputer architectures • Who has the fastest computers? • Speedup • Programming for parallel computing • The GRID??
Speedup • how can we measure how much faster our program runs when using more than one processor? • define Speedup S as: • the ratio of 2 program execution times • constant problem size • T1 is the execution time for the problem on a single processor (use the “best” serial time) • TP is the execution time for the problem on P processors
Speedup • Linear speedup • the time to execute the problem decreases by the number of processors • if a job requires 1 week with 1 processor it will take less that 10 minutes with 1024 processors
Speedup • Sublinear speedup • the usual case • there are generally some limitations to the amount of speedup that you get • communication
Speedup • Superlinear speedup • very rare • memory access patterns may allow this for some algorithms
Speedup • why do a speedup test? • it’s hard to tell how a program will behave • e.g. • “Strange” is actually fairly common behaviour for un-tuned code • in this case: • linear speedup to ~10 cpus • after 24 cpus speedup is starting to decrease