1.15k likes | 1.17k Views
Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim Kaiser, Dmitry Pekurovsky, Mahidhar Tatineni, Ross Walker. Introduction to Supercomputers, Architectures and High Performance Computing. Topics. Intro to Parallel Computing Parallel Machines
E N D
Amit Majumdar Scientific Computing Applications Group SDSC Many others: Tim Kaiser, Dmitry Pekurovsky, Mahidhar Tatineni, Ross Walker Introduction to Supercomputers, Architectures and High Performance Computing
Topics Intro to Parallel Computing Parallel Machines Programming Parallel Computers Supercomputer Centers and Rankings SDSC Parallel Machines Allocations on NSF Supercomputers One Application Example – Turbulence
First Topic – Intro to Parallel Computing • What is parallel computing • Why do parallel computing • Real life scenario • Types of parallelism • Limits of parallel computing • When do you do parallel computing
What is Parallel Computing? • Consider your favorite computational application • One processor can give me results in N hours • Why not use N processors-- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem
Parallel computing is computing by committee • Parallel computing: the use of multiple computers or processors working together on a common task. • Each processor works on its section of the problem • Processors are allowed to exchange information with other processors Grid of Problem to be solved CPU #1 works on this area of the problem CPU #2 works on this area of the problem exchange y exchange exchange CPU #3 works on this area of the problem CPU #4 works on this area of the problem exchange x
Why Do Parallel Computing? • Limits of single CPU computing • Available memory • Performance/Speed • Parallel computing allows: • Solve problems that don’t fit on a single CPU’s memory space • Solve problems that can’t be solved in a reasonable time • We can run… • Larger problems • Faster • More cases • Run simulations at finer resolution • Model physical phenomena more realistically
Parallel Computing – Real Life Scenario • Stacking or reshelving of a set of library books • Assume books are organized into shelves and shelves are grouped into bays • Single worker can only do it in a certain rate • We can speed it up by employing multiple workers • What is the best strategy ? • Simple way is to divide the total books equally among workers. Each worker stacks the books one at a time. Worker must walk all over the library. • Alternate way is to assign fixed disjoint sets of bay to each worker. Each worker is assigned equal # of books arbitrarily. Workers stack books in their bays or pass to another worker responsible for the bay it belongs to.
Parallel Computing – Real Life Scenario • Parallel processing allows to accomplish a task faster by dividing the work into a set of substacks assigned to multiple workers. • Assigning a set of books to workers is task partitioning. Passing of books to each other is an example of communicationbetween subtasks. • Some problems may be completely serial; e.g. digging a post hole. Poorly suited to parallel processing. • All problems are not equally amenable to parallel processing.
Weather Forecasting • Atmosphere is modeled by dividing it into three-dimensional regions or cells, 1 mile x 1 mile x 1 mile - about 500 x 10 6 cells. • The calculations of each cell are repeated many times to model the passage of time. • About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step • 10 day forecast with 10 minute resolution => ~1.5x1014 flop • On a 100 Mflops (Mflops/sec) sustained performance machine would take: 1.5x1014 flop/ 100x106 flops = ~17 days • On a 1.7 Tflops sustained performance machine would take: 1.5x1014 flop/ 1.7x1012 flops = ~2 minutes
Other Examples • Vehicle design and dynamics • Analysis of protein structures • Human genome work • Quantum chromodynamics • Astrophysics • Earthquake wave propagation • Molecular dynamics • Climate, ocean modeling • CFD • Imaging and Rendering • Petroleum exploration • Nuclear reactor, weapon design • Database query • Ozone layer monitoring • Natural language understanding • Study of chemical phenomena • And many other scientific and industrial simulations
Types of Parallelism : Two Extremes • Data parallel • Each processor performs the same task on different data • Example - grid problems • Task parallel • Each processor performs a different task • Example - signal processing • Most applications fall somewhere on the continuum between these two extremes
PE #0 PE #1 PE #2 PE #4 PE #5 PE #6 PE #3 PE #7 Typical Data Parallel Program • Example: integrate 2-D propagation problem: Starting partial differential equation: Finite Difference Approximation: y x
Basics of Data Parallel Programming One code will run on 2 CPUs Program has array of data to be operated on by 2 CPU so array is split into two parts. program: … if CPU=a then low_limit=1 upper_limit=50 elseif CPU=b then low_limit=51 upper_limit=100 end if do I = low_limit, upper_limit work on A(I) end do ... end program CPU B CPU A program: … low_limit=1 upper_limit=50 do I= low_limit, upper_limit work on A(I) end do … end program program: … low_limit=51 upper_limit=100 do I= low_limit, upper_limit work on A(I) end do … end program
Inverse FFT Task Normalize Task FFT Task Multiply Task DATA Typical Task Parallel Application • Example: Signal Processing • Use one processor for each task • Can use more processors if one is overloaded
Basics of Task Parallel Programming • One code will run on 2 CPUs • Program has 2 tasks (a and b) to be done by 2 CPUs CPU A CPU B program.f: … initialize ... if CPU=a then do task a elseif CPU=b then do task b end if …. end program program.f: … initialize … do task a … end program program.f: … initialize … do task b … end program
How Your Problem Affects Parallelism • The nature of your problem constrains how successful parallelization can be • Consider your problem in terms of • When data is used, and how • How much computation is involved, and when • Importance of problem architectures • Perfectly parallel • Fully synchronous
Perfect Parallelism • Scenario: seismic imaging problem • Same application is run on data from many distinct physical sites • Concurrency comes from having multiple data sets processed at once • Could be done on independent machines (if data can be available) • This is the simplest style of problem • Key characteristic: calculations for each data set are independent • Could divide/replicate data into files and run as independent serial jobs • (also called “job-level parallelism”)
Fully Synchronous Parallelism • Scenario: atmospheric dynamics problem • Data models atmospheric layer; highly interdependent in horizontal layers • Same operation is applied in parallel to multiple data • Concurrency comes from handling large amounts of data at once • Key characteristic: Each operation is performed on all (or most) data • Operations/decisions depend on results of previous operations • Potential problems • Serial bottlenecks force other processors to “wait”
Limits of Parallel Computing • Theoretical Upper Limits • Amdahl’s Law • Practical Limits • Load balancing • Non-computational sections (I/O, system ops etc.) • Other Considerations • time to re-write code
Theoretical Upper Limits to Performance • All parallel programs contain: • Serial sections • Parallel sections • Serial sections – when work is duplicated or no useful work done (waiting for others) - limit the parallel effectiveness • Lot of serial computation gives bad speedup • No serial work “allows” perfect speedup • Speedup is the ratio of the time required to run a code on one processor to the time required to run the same code on multiple (N) processors - Amdahl’s Law states this formally
1 = S + f f / N s p Amdahl’s Law • Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. • Effect of multiple processors on run time • Effect of multiple processors on speed up (S = t1/tn) • Where • fs = serial fraction of code • fp = parallel fraction of code • N = number of processors • tn= time to run on N processors ( ) = + t f / N f t n p s 1
Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 250 fp = 1.000 200 fp = 0.999 Speedup fp = 0.990 150 fp = 0.900 100 50 0 0 50 100 150 200 250 Number of processors
80 fp = 0.99 70 60 50 Amdahl's Law 40 Reality 30 20 10 0 0 50 100 150 200 250 Number of processors Amdahl’s Law vs. Reality Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for speedup assuming that there are no costs for communications. In reality, communications will result in afurther degradation of performance. Speedup
Practical Limits: Amdahl’s Law vs. Reality • In reality, Amdahl’s Law is limited by many things: • Communications • I/O • Load balancing (waiting) • Scheduling (shared processors or memory)
When do you do parallel computing • Writing effective parallel application is difficult • Communication can limit parallel efficiency • Serial time can dominate • Load balance is important • Is it worth your time to rewrite your application • Do the CPU requirements justify parallelization? • Will the code be used just once?
Parallelism Carries a Price Tag • Parallel programming • Involves a learning curve • Is effort-intensive • Parallel computing environments can be complex • Don’t respond to many serial debugging and tuning techniques Will the investment of your time be worth it?
R e s o l u t i o n N e e d s F r e q u e n c y o f U s e E x e c u t i o n T i m e m u s t s i g n i f i c a n t l y t h o u s a n d s o f t i m e s d a ys or weeks s p o s i t i v e i n c r e a s e r e s o l u t i o n p r e - c o n d i t i o n b e t w e e n c h a n g e s o r c o m p l e x i t y w a n t t o i n c r e a s e d o z e n s o f t i m e s p o s s i b l e 4 - 8 h o u r s t o s o m e e x t e n t p r e - c o n d i b e t w e e n c h a n g e s c u r r e n t r e s o l u t i o n / c o m p l e x i t y a l r e a d y o n l y a f e w t i m e s m o r e t h a n n e e d e d b e t w e e n c h a n g e s m i n u t e s Test the “Preconditions for Parallelism” • According to experienced parallel programmers: • no green Don’t even consider it • one or more red Parallelism may cost you more than you gain • all green You need the power of parallelism (but there are no guarantees) t i o n n e g a t i v e p r e - c o n d i t i o n
Second Topic – Parallel Machines • Simplistic architecture • Types of parallel machines • Network topology • Parallel computing terminology
CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM Simplistic Architecture 30
Processor Related Terms • RISC: Reduced Instruction Set Computers • PIPELINE : Technique where multiple instructions are overlapped in execution • SUPERSCALAR: Multiple instructions per clock period 31
Network Interconnect Related Terms • LATENCY : How long does it take to start sending a "message"? Units are generally microseconds now a days. • BANDWIDTH : What data rate can be sustained once the message is started? Units are bytes/sec, Mbytes/sec, Gbytes/sec etc. • TOPLOGY: What is the actual ‘shape’ of the interconnect? Are the nodes connect by a 2D mesh? A ring? Something more elaborate? 32
Memory/Cache Related Terms CACHE : Cache is the level of memory hierarchy between the CPU and main memory. Cache is much smaller than main memory and hence there is mapping of data from main memory to cache. CPU Cache MAIN MEMORY 33
Memory/Cache Related Terms • ICACHE : Instruction cache • DCACHE (L1) : Data cache closest to registers • SCACHE (L2) : Secondary data cache • Data from SCACHE has to go through DCACHE to registers • SCACHE is larger than DCACHE • L3 cache • TLB : Translation-lookaside buffer keeps addresses of pages ( block of memory) in main memory that have recently been accessed 34
Memory/Cache Related Terms (cont.) CPU MEMORY (e.g., L1 cache) MEMORY(e.g., L2 cache, L3 cache) MEMORY(e.g., DRAM) File System 35
Memory/Cache Related Terms (cont.) • The data cache was designed with two key concepts in mind • Spatial Locality • When an element is referenced its neighbors will be referenced too • Cache lines are fetched together • Work on consecutive data elements in the same cache line • Temporal Locality • When an element is referenced, it might be referenced again soon • Arrange code so that date in cache is reused as often as possible 36
Types of Parallel Machines • Flynn's taxonomy has been commonly use to classify parallel computers into one of four basic types: • Single instruction, single data (SISD): single scalar processor • Single instruction, multiple data (SIMD): Thinking machines CM-2 • Multiple instruction, single data (MISD): various special purpose machines • Multiple instruction, multiple data (MIMD): Nearly all parallel machines • Since the MIMD model “won”, a much more useful way to classify modern parallel computers is by their memory model • Shared memory • Distributed memory
P P P P P P M M M M M M P P P P P P B U S Network M e m o r y Shared and Distributed memory Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (examples: CRAY T3E, XT; IBM Power,Sun and other vendor made machines ) Shared memory - single address space. All processors have access to a pool of shared memory. (examples: CRAY T90, SGI Altix) Methods of memory access : - Bus - Crossbar
Styles of Shared memory: UMA and NUMA Uniform memory access (UMA) Each processor has uniform access to memory - Also known as symmetric multiprocessors (SMPs) Non-uniform memory access (NUMA) Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (example: HP-Convex Exemplar, SGI Altix)
UMA - Memory Access Problems • Conventional wisdom is that systems do not scale well • Bus based systems can become saturated • Fast large crossbars are expensive • Cache coherence problem • Copies of a variable can be present in multiple caches • A write by one processor my not become visible to others • They'll keep accessing stale value in their caches • Need to take actions to ensure visibility or cache coherence
Machines • T90, C90, YMP, XMP, SV1,SV2 • SGI Origin (sort of) • HP-Exemplar (sort of) • Various Suns • Various Wintel boxes • Most desktop Macintosh • Not new • BBN GP 1000 Butterfly • Vax 780
Programming methodologies • Standard Fortran or C and let the compiler do it for you • Directive can give hints to compiler (OpenMP) • Libraries • Threads like methods • Explicitly Start multiple tasks • Each given own section of memory • Use shared variables for communication • Message passing can also be used but is not common
Distributed shared memory (NUMA) • Consists of N processors and a global address space • All processors can see all memory • Each processor has some amount of local memory • Access to the memory of other processors is slower • NonUniform Memory Access
Memory • Easier to build because of slower access to remote memory • Similar cache problems • Code writers should be aware of data distribution • Load balance • Minimize access of "far" memory
Programming methodologies • Same as shared memory • Standard Fortran or C and let the compiler do it for you • Directive can give hints to compiler (OpenMP) • Libraries • Threads like methods • Explicitly Start multiple tasks • Each given own section of memory • Use shared variables for communication • Message passing can also be used
Machines • SGI Origin, Altix • HP-Exemplar
Distributed Memory • Each of N processors has its own memory • Memory is not shared • Communication occurs using messages
Programming methodology • Mostly message passing using MPI • Data distribution languages • Simulate global name space • Examples • High Performance Fortran • Split C • Co-array Fortran
P P P P P P P P Bus Bus Memory Memory Interconnect Hybrid machines • SMP nodes (clumps) with interconnect between clumps • Machines • Cray XT3/4 • IBM Power4/Power5 • Sun, other vendor machines • Programming • SMP methods on clumps or message passing • Message passing between all processors