Computational Methods in Physics PHYS 3437

Computational Methods in Physics PHYS 3437 Dr Rob Thacker Dept of Astronomy & Physics (MM-301C) thacker@ap.smu.ca

Today’s Lecture • Introduction to parallel programming • Concepts – what are parallel computers, what is parallel programming? • Why do you need to use parallel programming? • When parallelism will be beneficial • Amdahl’s Law • Very brief introduction to OpenMP

Why bother to teach this in an undergrad Physics course? • Because parallel computing is now ubiquitous • Most laptops are parallel computers, for example • Dual/Quad core chips are already standard, in the future we can look forward to 8/16/32/64 cores per chip! • Actually Sun Microsystems already sells a chip with 8 cores • I predict that by 2012 you will be buying chips with 16 cores • If we want to use all this capacity, then we will need to run codes that can use more than one CPU core at a time • Such codes are said to be parallel • Exposure to these concepts will help you significantly if you want to go to grad school in an area that uses computational methods extensively • Because not many people have these skills! • If you are interested, an excellent essay on how computing is changing can be found here: • http://view.eecs.berkeley.edu/wiki/Main_Page

Some caveats • In two lectures we cannot cover very much on parallel computing • We will concentrate on the simplest kind of parallel programming • Exposes some of the inherent problems • Still gives you useful increased performance • Remember, making a code run 10 times faster turns a week into a day! • The type of programming we’ll be looking at is often limited in terms of the maximum speed-up possible, but factors of 10 are pretty common

Why can’t the compiler just make my code parallel for me? • In some situations it can, but most of the time it can’t • You really are smarter than a compiler is! • There are many situations where a compiler will not be able to make something parallel but you can • Compilers that can attempt to parallelize code are called “auto-parallelizing” • Some people have suggested writing parallel languages that only allow the types of code that can be easily parallelized • These have proven to not be very popular and are too restrictive • At present, the most popular way of parallel programming is to add additional commands to your original code • These commands are sometimes called pragmas or directives

Recap: von Neumann architecture Machine instructions are encoded in binary & stored – key insight! • First practical stored-program architecture • Still in use today • Speed is limited by the bandwidth of data between memory and processing unit • “von Neumann” bottleneck MEMORY DATA MEMORY PROGRAM MEMORY CONTROL UNIT PROCESS UNIT CPU INPUT OUTPUT Developed while working on the EDVAC design

Shared memory computers Program these computers using “OpenMP” extensions to C,FORTRAN CPU CPU CPU CPU MEMORY Traditional shared memory design – all processors share a memory bus All of the processors see the share the same memory locations. This means that programming these computers is reasonably straightforward. Sometimes called “SMP”s for symmetric multi-processor.

Distributed memory computers Program these computers using MPI or PVM extensions to C, FORTRAN NETWORK CPU CPU CPU CPU MEMORY MEMORY MEMORY MEMORY Really a collection of computers linked together via a network. Each processor has its own memory and must communicate with other processors over the network to get information from other memory locations. This is really quite difficult at times. This is the architecture of “computer clusters” (you could actually have each “CPU” here be a shared memory computer).

What do we mean by being able to do things in parallel? Suppose the input data of an operation is divided into series of independent parts Processing of the parts is carried out independently A simple example is operations on vectors/arrays where we loop over array indices do i=11,20 a(i)=a(i)*2. end do do i=21,30 a(i)=a(i)*2. end do do i=1,10 a(i)=a(i)*2. end do Task 2 Task 3 Task 1 Parallel execution ARRAY A(i)

Some subtleties • However, you can’t always do this • Consider do i=2,n a(i)=a(i-1) end do • This kind of loop has what we call a dependence • If you update a value of a(i) before a(i-1) has been updated then you will get the wrong answer compared to running on a single processor • We’ll talk a little more about this later, but it does mean that not every loop can be “parallelized”

Issues to be aware of • Parallel computing is not about being “cool” and doing lots and lots of “flops” • Flops = floating point operations per second • We want solutions to problems in a reasonable amount of time • Sometimes that means doing a lot of calculations – e.g. consider what we found about the number of collisions for molecules in air • Gains from algorithmic improvements will often swamp hardware improvements • Don’t be brain-limited, if there is a better algorithm use it

Algorithmic Improvements in n-body simulations Improvements in the speed of algorithms are proportionally better than the speed increase of computers over the same time interval.

Identifying Performance Desires Code Evolution timescale Frequency of Use Positive Precondition Daily Hundreds of executions between changes Monthly Changes each run Yearly Negative Precondition

Performance Characteristics Level of Synchronization Execution Time Positive Precondition Days None Hours Infrequent (every minute) Minutes Frequent (many per second) Negative Precondition

Data and Algorithm Algorithmic complexity* Data structures Positive Precondition Simple Regular, static Complex Irregular, dynamic *approximately the number of stages Negative Precondition

Requirements Positive Precondition Must significantly increase resolution/length of integration Need a factor of 2 increase Current resolution meets needs Negative Precondition

How much speed-up can we achieve? • Some parts of a code cannot be run in parallel • For example the loop over a(i)=a(i-1) from earlier • Any code that cannot be executed in parallel is said to be serial or sequential • Lets suppose in terms of the total execution time of a program a fraction fs has to be run in serial, while fp can be run in parallel on n cpus • Equivalently the time spent in each fraction will be ts and tp so the total time on 1 cpu is t1cpu=ts+tp • If we can run the parallel fraction on n cpus then it will take a time tp/n • The total time will then be tncpu=ts+tp/n

Amdahl’s Law • How much speed-up (Sn=t1cpu/tncpu) is feasible? • Amdahl’s Law is the most significant limit. Given our previous results and n processors, the maximum speed-up is given by: • Only if the serial fraction fs(=ts/(ts+tp)) is zero is perfect speed-up possible (at least in theory)

Have to achieve excellent parallelism here! Scaling similar for different fs here Amdahl’s Law Speed-up Ncpu

What is OpenMP? • OpenMP is a “pragma” based “application programmer interface” (API) that provides a simple extension to C/C++ and FORTRAN • Pragma is just a fancy word for “instructions” • It is exclusively designed for shared memory programming • Ultimately, OpenMP is a very simple interface to something called threads based programming • What actually happens when you break up a loop into pieces is that a number of threads of execution are created that can run the loop pieces in parallel

Serial Section Serial Section Serial Section Master Thread FORK JOIN FORK JOIN Parallel Section Parallel Section Threads based execution • Serial execution, interspersed with parallel In practice many compilers block execution of the extra threads during serial sections, this saves the overhead of the `fork-join’ operation

Some background to threads programming • There is actually an entire set of commands in C to allow you to create threads • You could, if you wanted, program with these commands • The most common thread standard is called POSIX • However, OpenMP provides a simple interface to a lot of the functionality provided by threads • If it is simple, and does what you need why bother going to the effort of using threads programming?

Components of OpenMP Directives (Pragmas in your code) Runtime Library Routines (Compiler) Environment Variables (set at Unix prompt)

OpenMP: Where did it come from? • Prior to 1997, vendors all had their own proprietary shared memory programming commands • Programs were not portable from one SMP to another • Researchers were calling for some kind of portability • ANSI X3H5 (1994) proposal tried to formalize a shared memory standard – but ultimately failed • OpenMP (1997) worked because the vendors got behind it and there was new growth in the shared memory market place • Very hard for researchers to get new languages supported now, must have backing from computer vendors!

Bottomline • For OpenMP & shared memory programming in general, one only has to worry about parallelism of work • This is because all the processors in a shared-memory computer can see all the same memory locations • On distributed-memory computers one has to worry both about parallelism of the work and also the placement of data • Is the value I need in the memory of another processor? • Data movement is what makes distributed-memory codes (usually written in something called MPI) so much longer – it can be highly non-trivial • Although it can be easy – it depends on the algorithm

First Steps • Loop level parallelism is the simplest and easiest way to use OpenMP • Take each do loop and make it parallel (if possible) • It allows you to slowly build up parallelism within your application • However, not all loops are immediately parallelizeable due to dependencies

FORTRAN do i=1,n Y(i)=a*X(i)+Y(i) end do C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do Loop Level Parallelism • Consider the single precision vector add-multiply operation Y=aX+Y (“SAXPY”) C/C++ for (i=1;i<=n;++i) { Y[i]+=a*X[i]; } #pragma omp parallel for \ private(i) shared(X,Y,n,a) for (i=1;i<=n;++i) { Y[i]+=a*X[i]; }

Comment pragmas for FORTRAN - ampersand necessary for continuation Denotes this is a region of code for parallel execution Good programming practice, must declare nature of all variables Thread SHARED variables: all threads can access these variables, but must not update individual memory locations simultaneously Thread PRIVATE variables: each thread must have their own copy of this variable (in this case i is the only private variable) In more detail C$OMP PARALLEL DO C$OMP& DEFAULT(NONE) C$OMP& PRIVATE(i),SHARED(X,Y,n,a) do i=1,n Y(i)=a*X(i)+Y(i) end do

A quick note • To be fully lexically correct you may want to include an C$OMP END PARALLEL DO • In f90 programs use !$OMP as a sentinel • Notice that the sentinels mean that the OpenMP commands look like comments • A compiler that has OpenMP compatibility turned on will see the comments after the sentinel • This means you can still compile the code on computers that don’t have OpenMP

How the compiler handles OpenMP • When you compile an OpenMP code you need to add “flags” to the compile line, e.g. • f77 –openmp –o myprogram myprogram.f • Unfortunately different compilers have different commands for turning on OpenMP support, the above will work on Sun machines • When the compiler flag is turned on, you now force the compiler to link in all of the additional libraries (and so on) necessary to run the threads • This is all transparent to you though

Requirements for parallel loops • To divide up the work the compiler needs to know the number of iterations to be executed – the trip count must be computable • They must also not exhibit any of the dependencies we mentioned • We’ll review this more in the next lecture • Actually a good test for dependencies is running the loop from n to 1, rather than 1 to n. If you get a different answer that suggests there are dependencies • DO WHILE is not parallelizable using these directives • There is actually a way of parallelizing DO WHILE using a different set of OpenMP commands, but we don’t have time to cover that • The loop can only have one exit point – therefore BREAK or GOTOs are not allowed

Performance limitations • Each time you start and end a parallel loop there is an overhead associated with the threads • These overheads must always be added to the time taken to calculate the loop itself • Therefore there is a limit on the smallest loop size that will achieve speed up • In practice, we need roughly 5000 floating point operations in a loop for it to be worth parallelizing • A good rule of thumb is that any thread should have at least 1000 floating point operations • Thus small loops are simply not worth the bother!

Summary • Shared memory parallel computers can be programmed using the OpenMP extensions to C,FORTRAN • Distributed memory computers require a different parallel language • The easiest way to use OpenMP is to make loops parallel by dividing work up among threads • Compiler handles most of the difficult parts of coding • However, not all loops are immediately parallelizable • Dependencies may prevent parallelization • Loops are made to run in parallel by adding directives (“pragmas”) to your code • These directives appear to be comments to ordinary compilers

Next Lecture • More details on dependencies and how we can deal with them

Computational Methods in Physics PHYS 3437