440 likes | 599 Views
Introduction to Scientific Computing. Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization. Outline. Introduction Software Parallelization Hardware. Introduction. What is Scientific Computing? Need for speed Need for memory
E N D
Introduction to Scientific Computing Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization
Outline • Introduction • Software • Parallelization • Hardware
Introduction • What is Scientific Computing? • Need for speed • Need for memory • Simulations tend to grow until they overwhelm available resources • If I can simulate 1000 neurons, wouldn’t it be cool if I could do 2000? 10000? 1087? • Example – flow over an airplane • It has been estimated that if a teraflop machine were available, would take about 200,000 years to solve (resolving all scales). • If Homo Erectus had a teraflop machine, we could be getting the result right about now.
Introduction (cont’d) • Optimization • Profile serial (1-processor) code • Tells where most time is consumed • Is there any “low fruit”? • Faster algorithm • Optimized library • Wasted operations • Parallelization • Break problem up into chunks • Solve chunks simultaneously on different processors
Compiler • The compiler is your friend (usually) • Optimizers are quite refined • Always try highest level • Usually –O3 • Sometimes –fast, -O5, … • Loads of flags, many for optimization • Good news – many compilers will automatically parallelize for shared-memory systems • Bad news – this usually doesn’t work well
Software • Libraries • Solver is often a major consumer of CPU time • Numerical Recipes is a good book, but many algorithms are not optimal • Lapack is a good resource • Libraries are often available that have been optimized for the local architecture • Disadvantage – not portable
Parallelization • Divide and conquer! • divide operations among many processors • perform operations simultaneously • if serial run takes 10 hours and we hit the problem with 5000 processors, it should take about 7 seconds to complete, right? • not so easy, of course
Parallelization (cont’d) • problem – some calculations depend upon previous calculations • can’t be performed simultaneously • sometimes tied to the physics of the problem, e.g., time evolution of a system • want to maximize amount of parallel code • occasionally easy • usually requires some work
Parallelization (3) • method used for parallelization may depend on hardware • distributed memory • each processor has own address space • if one processor needs data from another processor, must be explicitly passed • shared memory • common address space • no message passing required
Parallelization (4) proc 0 proc 0 proc 1 proc 1 proc 2 proc 2 proc 3 proc 3 proc 0 proc 1 proc 2 proc 3 mem 0 mem 0 mem 1 mem 1 mem 2 mem 3 mem shared memory distributed memory mixed memory
Parallelization (5) • MPI • for both distributed and shared memory • portable • freely downloadable • OpenMP • shared memory only • must be supported by compiler (most do) • usually easier than MPI • can be implemented incrementally
MPI • Computational domain is typically decomposed into regions • One region assigned to each processor • Separate copy of program runs on each processor
MPI • Discretized domain to solve flow over airfoil • System of coupled PDE’s solved at each point
MPI • Decomposed domain for 4 processors
MPI • Since points depend on adjacent points, must transfer information after each iteration • This is done with explicit calls in the source code
MPI • Diminishing returns • Sending messages can get expensive • Want to maximize ratio of computation to communication • Parallel speedup, parallel efficiency T = time n = number of processors
OpenMP for(i=0; i<N; i++){ do lots of stuff } • Usually loop-level parallelization • An OpenMP directive is placed in the source code before the loop • Assigns subset of loop indices to each processor • No message passing since each processor can “see” the whole domain
OpenMP for(i = 0; i < 7; i++) a[i] = 1; for(i = 1; i < 7; i++) a[i] = 2*a[i-1]; Example of how to do it wrong! • Can’t guarantee order of operations Parallelize this loop on 2 processors Proc. 0 Proc. 1
Hardware • A faster processor is obviously good, but: • Memory access speed is often a big driver • Cache – a critical element of memory system • Processors have internal parallelism such as pipelines and multiply-add instructions
Cache • Cache is a small chunk of fast memory between the main memory and the registers registers primary cache secondary cache main memory
Cache (cont’d) • Variables are moved from main memory to cache in lines • L1 cache line sizes on our machines • Opteron (blade cluster) 64 bytes • Power4 (p-series) 128 bytes • PPC440 (Blue Gene) 32 bytes • Pentium III (linux cluster) 32 bytes • If variables are used repeatedly, code will run faster since cache memory is much faster than main memory
Cache (cont’d) • Why not just make the main memory out of the same stuff as cache? • Expensive • Runs hot • This was actually done in Cray computers • Liquid cooling system
Cache (cont’d) • Cache hit • Required variable is in cache • Cache miss • Required variable not in cache • If cache is full, something else must be thrown out (sent back to main memory) to make room • Want to minimize number of cache misses
Cache example “mini” cache holds 2 lines, 4 words each for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] Main memory x[4] … … x[5] x[6] x[7]
Cache example (cont’d) • We will ignore i for simplicity • need x[0], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[1] x[2] x[3] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]
Cache example (cont’d) • need x[4], not in cache cache miss • load line from memory into cache • next 3 loop indices result in cache hits x[0] x[4] x[5] x[1] x[6] x[2] x[3] x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]
Cache example (cont’d) • need x[8], not in cache cache miss • load line from memory into cache • no room in cache! • replace old line x[8] x[4] x[5] x[9] x[6] a b x[7] for(i==0; i<10; i++) x[i] = i x[8] x[0] x[1] x[9] x[2] a b x[3] x[4] … … x[5] x[6] x[7]
Cache (cont’d) • Contiguous access is important • In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] …
Cache (cont’d) • In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) …
Cache (cont’d) • Rule: Always order your loops appropriately for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo enddo C Fortran
p-series • Shared memory • IBM Power4 processors • 32 KB L1 cache per processor • 1.41 MB L2 cache per pair of processors • 128 MB L3 cache per 8 processors
Blue Gene • Distributed memory • 2048 processors • 1024 2-processor nodes • IBM PowerPC 440 processors • 700 MHz • 512 MB memory per node (per 2 processors) • 32 KB L1 cache per node • 2 MB L2 cache per node • 4 MB L3 cache per node
BladeCenter • Hybrid memory • 56 processors • 14 4-processor nodes • AMD Opteron processors • 2.6 GHz • 8 GB memory per node (per 4 processors) • Each node has shared memory • 64 KB L1 cache per 2 processors • 1 MB L2 cache per 2 processors
Linux Cluster • Hybrid memory • 104 processors • 52 2-processor nodes • Intel Pentium III processors • 1.3 GHz • 1 GB memory per node (per 2 processors) • Each node has shared memory • 16 KB L1 cache per 2 processors • 512 KB L2 cache per 2 processors
For More Information • SCV web site http://scv.bu.edu/ • Today’s presentations are available at http://scv.bu.edu/documentation/presentations/ under the title “Introduction to Scientific Computing and Visualization”
Next Time • G & T code • Time it • Look at effect of compiler flags • profile it • Where is time consumed? • Modify it to improve serial performance