540 likes | 647 Views
Parallel coding. Approaches in converting sequential code programs to run on parallel machines. Goals. Reduce wall-clock time Scalability – increase resolution expand space without loss of efficiency. It ’ s all about efficiency. Poor data communication Poor load balancing
E N D
Parallel coding Approaches in converting sequential code programs to run on parallel machines
Goals • Reduce wall-clock time • Scalability – • increase resolution • expand space without loss of efficiency It’s all about efficiency Poor data communication Poor load balancing Inherently sequential algorithm nature
Efficiency • Communication overhead – data transfer is at most 10-3 the processing speed • load balancing– uneven load which is statically balanced may cause idle processor time • Inherently sequential algorithm nature – if all tasks should be performed serially, no room for parallelization Lack of efficiency could cause a parallel code to perform worse than a similar sequential code
Scalability Amdahl's Law states that potential program speedup is defined by the fraction of code (f) which can be parallelized
Scalability 1 1 speedup = -------- = ----------- 1 - f P/N + S • speedup • ----------------------- • N P = .50 P = .90 P = .99 • ----- ------- ------- ------- • 1.82 5.26 9.17 • 1.98 9.17 50.25 • 1.99 9.91 90.99 • 10000 1.99 9.91 99.02
Before we start - Framework • Code may be influenced/determined by machine architecture • The need to understand the architecture • Choose a programming paradigm • Choose the compiler • Determine communication • Choose the network topology • Add code to accomplish task control and communications • Make sure the code is sufficiently optimized (may involve the architecture) • Debug the code • Eliminate lines that impose unnecessary overhead
Before we start - Program • If we are starting with an existing serial program, debug the serial code completely • Identify the parts of the program that can be executed concurrently: • Requires a thorough understanding of the algorithm • Exploit any inherent parallelism which may exist. • May require restructuring of the program and/or algorithm. May require an entirely new algorithm.
Before we start - Framework • Architecture – Intel Xeon, 16Gb distributed memory, Rocks Cluster • Compiler – Intel FORTRAN/pgf • Network – star (mesh?) • Overhead – make sure the communication channels aren’t clogged (net admin) • Optimized Code – write c-code when necessary, use CPU pipelines, use debugged programs…
Improvement methods Sequential coding practice
The COMMON problem Problem: COMMON blocks are copied as one chunk of data each time a process forks The compiler doesn’t distinguish between active COMMON’s and redundant ones Sequential coding practice
The COMMON problem On NUMA (Non Uniform Memory Access) MPP/SMP (massively parallel processing/Symmetric Multi Processor) Vector machines This is rarely an issue On a Distributed Computer (clusters) Crucial (network is congested by this)!!! Problem: COMMON blocks are copied as one chunk of data each time a process forks The compiler doesn’t distinguish between declared COMMON’s and redundant ones Sequential coding practice
The COMMON problem • Resolution: • Pass only the required data for the task • Functional programming (pass arguments on the call) • On shared memory architectures use shmXXX commands • On distributed memory architectures use message passing Sequential coding practice
Swapping to secondary storage Problem: swapping is transparent but uncontrolled – the kernel cannot predict which pages are needed next, only determine which are needed frequently Swap space is a way to emulate physical ram, right?No, generally swap space is a repository to hold things from memory when memory is low. Things in swap cannot be addressed directly and need to be paged into physical memory before use, so there's no way swap could be used to emulate memory. So no, 512M+512M swap is not the same as 1G memory and no swap. KernelTrap.org Sequential coding practice
Swapping tosecondary storage - Example 381MB X 2 CPU– dual Intel Pentium3 Speed - 1000MHz RAM - 512 MB Compiler – IntelFortran Optimization – O2 (default) Sequential coding practice
Swap space grows on demand RAM is fully consumed 135sec X 4000kB/sec = 520MB Each direction ! Garbage collecting takes time (memory is not freed) For processing 800Mb of data, 1Gb of data travels at harddisk rate throughout the run Swapping to secondary storage Sequential coding practice
Swapping to secondary storage Problem: swapping is transparent but uncontrolled – the kernel cannot predict which pages are needed next, only determine which are needed frequently Swap space is a way to emulate physical ram, right?No, generally swap space is a repository to hold things from memory when memory is low. Things in swap cannot be addressed directly and need to be paged into physical memory before use, so there's no way swap could be used to emulate memory. So no, 512M+512M swap is not the same as 1G memory and no swap. KernelTrap.org Resolution: prevent swapping by adjusting the data amount into user process RAM size (read and write temporary files from/to disk). Sequential coding practice
Swapping to secondary storage On every node Memory size = 2GB Predicted number of jobs pending= 3 Use MOSIX for load balancing Work with data segments no grater than 600Mb/process (open files + memory + output buffers) Sequential coding practice
Paging, cache 16K Problem: like swapping, memory pages go in and out of CPU’s cache. Again, the compiler cannot predict the ordering of pages into the cache. Semi-controlled paging leads again to performance degradation Note: On-board memory is slower than cache memory (bus speed) but still faster than disk access Sequential coding practice
Paging, cache 16K Cache size (Xeon) = 512Kb So… Work in 512K chunks whenever possible (e.g. 256 X 256 double precision) Problem: like swapping, memory pages go in and out of CPU’s cache. Again, the compiler cannot predict the ordering of pages into the cache. Semi-controlled paging leads again to performance degradation Resolution: prevent paging by adjusting the data size to CPU cache Sequential coding practice
381Mb 244Kb CPU– Intel Pentium4 Speed - 1400MHz L2 Cache - 256 KB Compiler – IntelFortran Optimization – O2 (default) Example Sequential coding practice
2.3 times less code 516 times slower overall 361 times slower do-loop execution 36 Cache misses 3 Print statements 40 Function calls 3 Print statements Exampleresults Sequential coding practice
Workload summary Adjust to cache size Adjust to pages in sequence Adjust to RAM size Control disk activity fastest slowest
Sparse Arrays • Current– Dense (full) arrays • All array indices are occupied in memory • Matrix manipulations are usually element by element (no linear algebra manipulations when handling parameters on the grid) Sequential coding practice
Dense Arrays in HUCM:Cloud drop size distribution (F1) Number of nonzeros ~ 110,000 Load = 5% Number of nonzeros ~ 3,700 Load = 0.2% Sequential coding practice
Dense Arrays in HUCM:Cloud drop size distribution (F1)Lots of LHOLEs Number of nonzeros ~ 110,000 Load = 14% Number of nonzeros ~ 3,700 Load = 0.5% Sequential coding practice
Sparse Arrays • Current– Dense (full) arrays • All array subscripts occupy memory • Matrix manipulations are usually element by element (no linear algebra manipulations when handling parameters on the grid) • Improvement– Sparse arrays • Only non-zero elements occupy memory cells. Spare notation • When calculating algebraic matrices – run the profiler to check performance degradation due to sparse data Sequential coding practice
Sparse Arrays - HOWTO actual J I val displayed SPARSE is a supported datatype in Intel MathKernel library Sequential coding practice
DO LOOPs • Current– have no respect to memory layout. example – FORTRAN uses column major subscripts Memory layout Virtual layout: a 2D array (Column major) Sequential coding practice
Page limit (16k) DO LOOPs • Order of the subscript is crucial • Data pointer advances many steps • Many page faults Memory layout 1 2 Virtual layout: a 2D array (Column major) Sequential coding practice
DO LOOPs • Order of the subscript is crucial Memory layout 2 1 Virtual layout: a 2D array (Column major) Sequential coding practice
DO LOOPs - example 125Mb Sequential coding practice
42 times more idle crunching (an order of magnitude) DO LOOPs Wall-clock time the do-loop the ‘print’ statement A system call Sequential coding practice
DO LOOPs • Improvements • Reorder the DO LOOPs or • Rearrange the dimensions in an array: GFF2R(NI, NKR, NK, ICEMAX) -> GFF2R(ICEMAX, NKR, NK, NI) Innermost (fastest) running subscript Outermost (slowest) running subscript Sequential coding practice
Job Scheduling • Current • Manual batch: hard to track, no monitoring of the control • Improvements: • Batch scheduling / parameter sweep (e.g. shell scripts, NIMROD) • EASY/MAUI backfilling job scheduler Parallel coding practice
Load balancing • Current • Administrative - manual (and rough) load balancing: Haim • MPI, PVM,… libraries –no load balancing capabilities, software dependent: • RAMS – variable grid point area • MM5, MPP - ? • WRF - ? • File system– NFS A disaster!!!: client side caching, no segmented file locks, network congestion • Improvements: • MOSIX: kernel level governing, better monitoring of jobs, no stray (defunct) residues • MOPI – DFSA (not PVFS, and definitely not NFS) Parallel coding practice
NFS – client side cache every node has a non-concurrent mirror of the image • Write– 2 writes to the same location may crash the system • Read– old data may be read
Nearly Local bus rate Cannot perform better than network communication rate Parallel I/O – Local / MOPI MOSIX Parallel I/O System Parallel coding practice
Parallel I/O – Local / MOPI • Local – can be adapted with minor change in source code • MOPI - Needs installation but requires no changes in source code
converting sequential to parallel An easy 5-step method • Hotspot identification • Partition • Communication • Agglomeration • mapping Parallel coding practice
Parallelizing should be done methodically in a clean, accurate and meticulous way. However intuitive parallel programming is, it does not always allow straightforward automatic mechanical methods. One of the approaches - the methodical approach (Ian Foster): This particular method maximizes the potential for parallelizing and provide efficient steps that exploit this potential. Furthermore, it provides explicit checklists on completion of each step (not detailed here). Parallel coding practice
5-step hotspots Identify the hotspots – identify the parts of a program which consume the most run time. Our goal here is to know which code segments can and should be parallelized. Why? For e.g.: Greatly improve code that consumes 10% of the run time may increase performance by 10% whereas optimizing code that consumes 90% of the runtime may enable an order of magnitude speedup. How? Algorithm inspection (in theory) By looking at the code By Profiling (tools such as prof or another 3rd party) to identify bottlenecks Parallel coding practice
5-step partition1 Definition: The ratio between computation and communication is known as granularity Parallel coding practice
5-step partition2 • Goal Partition the tasks into the most fine grain ones. • Why? • We want to discover all the available opportunities for parallel execution, and to provide flexibility when we introduce the following steps (communication, memory and other requirements will enforce the optimal agglomeration and mapping) • How? • Functional Parallelism • Data Parallelism • Data decomposition – sometimes its easier to start off with partitioning the data into segments which are not mutually dependent Parallel coding practice
5-step partition3 Parallel coding practice
5-step partition4 • Goal Partition the tasks into the most fine grain ones. • Why? • We want to discover all the available opportunities for parallel execution, and to provide flexibility when we introduce the following steps (communication, memory and other requirements will enforce the optimal agglomeration and mapping) • How? • Functional Parallelism • Data Parallelism • Functional decomposition –partitioning the calculation into segments which are not mutually dependent (e.g. integration components are evaluated before the integration step) Parallel coding practice
5-step partition5 Parallel coding practice
5-step communication1 • Communication occurs during data passing and synchronization. We strive to minimize data communication between tasks or make them more coarse-grained • Sometimes the master process may encounter too much traffic coming in: If large data chunks must be transferred try to form hierarchies in aggregating the data • The most efficient granularity is dependent upon the algorithm and the hardware environment in which it runs • Decomposing the data has a crucial role here, consider revisiting step 2 Parallel coding practice
5-step communication2 Sending data out to sub-tasks: Point-to-point is best for sending personalized data to each independent task broadcast is good way to clog the network (all processors update the data, then need to send it back to the master) but we may find good use for it when a large computation can be performed once and lookup tables can be sent across the network Collection is usually used to perform mathematics like min, max, sum… Shared memory systems synchronize using the memory locking techniques Distributed memory systems may use blocking or non-blocking message passing. Blocking MP may be used for synchronization Parallel coding practice