370 likes | 481 Views
More on Parallel Computing. Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN gcf@indiana.edu. What is Parallel Architecture?.
E N D
More on Parallel Computing Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN gcf@indiana.edu jsumethod05 gcf@indiana.edu
What is Parallel Architecture? • A parallel computer is any old collection of processing elements that cooperate to solve large problems fast • from a pile of PC’s to a shared memory multiprocessor • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale? jsumethod05 gcf@indiana.edu
Parallel Computers -- Classic Overview • Parallel computers allow several CPUs to contribute to a computation simultaneously. • For our purposes, a parallel computer has three types of parts: • Processors • Memory modules • Communication / synchronization network • Key points: • All processors must be busy for peak speed. • Local memory is directly connected to each processor. • Accessing local memory is much faster than other memory. • Synchronization is expensive, but necessary for correctness. Colors Used in Following pictures jsumethod05 gcf@indiana.edu
Distributed Memory Machines • Every processor has a memory others can’t access. • Advantages: • Relatively easy to design and build • Predictable behavior • Can be scalable • Can hide latency of communication • Disadvantages: • Hard to program • Program and O/S (and sometimes data) must be replicated jsumethod05 gcf@indiana.edu
Communication on Distributed Memory Architecture • On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (owner-computes rule) • Information needed on edges for update must be communicated via explicitly generated messages Messages jsumethod05 gcf@indiana.edu
Distributed Memory Machines -- Notes • Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster BlueGene are quite similar. • Bandwidth and latency of interconnects different • The network topology is a two-dimensional torus for Paragon, three-dimensional torus for BlueGene, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2 • To program these machines: • Divide the problem to minimize number of messages while retaining parallelism • Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables) • Optimization: Pack messages together to reduce fixed overhead (almost always needed) • Optimization: Carefully schedule messages (usually done by library) jsumethod05 gcf@indiana.edu
BlueGene/L has Classic Architecture 32768 node BlueGene/L takes #1 TOP500 Position 29 Sept 2004 70.7 Teraflops jsumethod05 gcf@indiana.edu
BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D jsumethod05 gcf@indiana.edu
1024 Nodes full systemwith hypercube Interconnect 1987 MPP jsumethod05 gcf@indiana.edu
Shared-Memory Machines • All processors access the same memory. • Advantages: • Retain sequential programming languages such as Java or Fortran • Easy to program (correctly) • Can share code and data among processors • Disadvantages: • Hard to program (optimally) • Not scalable due to bandwidth limitations in bus jsumethod05 gcf@indiana.edu
Communication on SharedMemory Architecture • On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it • Nature of parallelism is identical to that for distributed memory machines but communication implicit as “just” access memory jsumethod05 gcf@indiana.edu
Shared-Memory Machines -- Notes • Interconnection network varies from machine to machine • These machines share data by direct access. • Potentially conflicting accesses must be protected by synchronization. • Simultaneous access to the same memory bank will cause contention, degrading performance. • Some access patterns will collide in the network (or bus), causing contention. • Many machines have caches at the processors. • All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently. jsumethod05 gcf@indiana.edu
Distributed Shared Memory Machines • Combining the (dis)advantages of shared and distributed memory • Lots of hierarchical designs. • Typically, “shared memory nodes” with 4 to 32 processors • Each processor has a local cache • Processors within a node access shared memory • Nodes can get data from or put data to other nodes’ memories jsumethod05 gcf@indiana.edu
Summary on Communication etc. • Distributed Shared Memory machines have communication features of both distributed (messages) and shared (memory access) architectures • Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax) • For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, “correct” version is used by other processors accessing this variable and that values living in caches are updated jsumethod05 gcf@indiana.edu
Seismic Simulation of Los Angeles Basin • This is (sophisticated) wave equation similar to Laplace example and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor jsumethod05 gcf@indiana.edu
Communication Must be Reduced • 4 by 4 regions in each processor • 16 Green (Compute) and 16 Red (Communicate) Points • 8 by 8 regions in each processor • 64 Green and “just” 32 Red Points • Communication is an edge effect • Give each processor plenty of memory and increase region in each machine • Large Problems Parallelize Best jsumethod05 gcf@indiana.edu
Irregular 2D Simulation -- Flow over an Airfoil • The Laplace grid points become finite element mesh nodal points arranged as triangles filling space • All the action (triangles) is near near wing boundary • Use domain decomposition but no longer equal area as equal triangle count jsumethod05 gcf@indiana.edu
Heterogeneous Problems • Simulation of cosmological cluster (say 10 million stars ) • Lots of work per star as very close together( may need smaller time step) • Little work per star as force changes slowly and can be well approximated by low order multipole expansion jsumethod05 gcf@indiana.edu
Load Balancing Particle Dynamics • Particle dynamics of this type (irregular with sophisticated force calculations) always need complicated decompositions • Equal area decompositions as shown here to load imbalance • If use simpler algorithms (full O(N2) forces) or FFT, then equal area best Equal Volume DecompositionUniverse Simulation 16 Processors Galaxy or Star or ... jsumethod05 gcf@indiana.edu
Reduce Communication Block Decomposition Cyclic Decomposition • Consider a geometric problem with 4 processors • In top decomposition, we divide domain into 4 blocks with all points in a given block contiguous • In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains • edge/area(bottom) = 2* edge/area(top) • So minimizing communication implies we keep points in a given processor together jsumethod05 gcf@indiana.edu
Minimize Load Imbalance Block Decomposition Cyclic Decomposition • But this has a flip side. Suppose we are decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by . • In Top decomposition only the white processor does any work while the other 3 sit idle. • Ffficiency 25% due to Load Imbalance • In Bottom decomposition all the processors do roughly the same work and so we get good load balance …... jsumethod05 gcf@indiana.edu
Parallel Irregular Finite Elements Processor • Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results • All the work is near crack jsumethod05 gcf@indiana.edu
Irregular Decomposition for Crack Processor Region assigned to 1 processor • Concentrating processors near crack leads to good workload balance • equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous • This is NP complete (exponenially hard) optimization problem but in practice many ways of getting good but not exact good decompositions WorkLoad Not Perfect ! jsumethod05 gcf@indiana.edu
Further Decomposition Strategies California gets its independence • Not all decompositions are quite the same • In defending against missile attacks, you track each missile on a separate node -- geometric again • In playing chess, you decompose chess tree -- an abstract not geometric space Computer Chess Tree Current Position(node in Tree)First Set MovesOpponents Counter Moves jsumethod05 gcf@indiana.edu
Summary of Parallel Algorithms • A parallel algorithm is a collection of tasks and a partial ordering between them. • Design goals: • Match tasks to the available processors (exploit parallelism). • Minimize ordering (avoid unnecessary synchronization points). • Recognize ways parallelism can be helped by changing ordering • Sources of parallelism: • Data parallelism: updating array elements simultaneously. • Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarsegrain size • fine is “internal” such as I/O and computation; coarse is “external” such as separate modules linked together jsumethod05 gcf@indiana.edu
Data Parallelism in Algorithms • Data-parallel algorithms exploit the parallelism inherent in many large data structures. • A problem is an (identical) algorithm applied to multiple points in data “array” • Usually iterate over such “updates” • Features of Data Parallelism • Scalable parallelism -- can often get million or more way parallelism • Hard to express when “geometry” irregular or dynamic • Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like) jsumethod05 gcf@indiana.edu
Functional Parallelism in Algorithms • Functional parallelism exploits the parallelism between the parts of many systems. • Many pieces to work on many independent operations • Example: Coarse grain Aeroelasticity (aircraft design) • CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel • Analysis: • Parallelism limited in size -- tens not millions • Synchronization probably good as parallelism natural from problem and usual way of writing software • Web exploits functional parallelism NOT data parallelism jsumethod05 gcf@indiana.edu
Pleasingly Parallel Algorithms • Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel • These are made up of independent concurrent components • Each client independently accesses a Web Server • Each roll of a Monte Carlo dice (random number) is an independent sample • Each stock can be priced separately in a financial portfolio • Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time) • Different parts of Seismic data can be processed independently • In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently • Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism jsumethod05 gcf@indiana.edu
Parallel Languages • A parallel language provides an executable notation for implementing a parallel algorithm. • Design criteria: • How are parallel operations defined? • static tasks vs. dynamic tasks vs. implicit operations • How is data shared between tasks? • explicit communication/synchronization vs. shared memory • How is the language implemented? • low-overhead runtime systems vs. optimizing compilers • Usually a language reflects a particular style of expressing parallelism. • Data parallel expresses concept of identical algorithm on different parts of array • Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts jsumethod05 gcf@indiana.edu
Data-Parallel Languages • Data-parallel languages provide an abstract, machine-independent model of parallelism. • Fine-grain parallel operations, such as element-wise operations on arrays • Shared data in large, global arrays with mapping “hints” • Implicit synchronization between operations • Partially explicit communication from operation definitions • Advantages: • Global operations conceptually simple • Easy to program (particularly for certain scientific applications) • Disadvantages: • Unproven compilers • As express “problem” can be inflexible if new algorithm which language didn’t express well • Examples: HPF • Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD jsumethod05 gcf@indiana.edu
Approaches to Parallel Programming • Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will not discuss; See Source Book for more on HPF • Typical Data Parallel Fortran Statements are full array statements • B=A1 + A2 • B=EOSHIFT(A,-1) • Function operations on arrays representing full data domain • Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature • Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" • Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory jsumethod05 gcf@indiana.edu
Shared Memory Programming Model • Experts in Java are familiar with this as it is built in Java Language through thread primitives • We take “ordinary” languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads • indicate which DO/for loop instances can be executed in parallel and where there are critical sections with global variables etc. • openMP is a recent set of compiler directives supporting this model • This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural jsumethod05 gcf@indiana.edu
Structure(Architecture) of Applications - I • Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism • Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the • “10,000” separate programs (e.g. structures,CFD ..) used in design of aircraft • the various filters used in Adobe Photoshop or Matlab image processing system • the ocean-atmosphere components in integrated climate simulation • The data-base or file system access of a data-intensive application • the objects in a distributed Forces Modeling Event Driven Simulation jsumethod05 gcf@indiana.edu
Structure(Architecture) of Applications - II • Modules are “natural” message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components • modules are what HPF needs task parallelism for • Often modules are naturally distributedwhereas parts of data parallel decomposition may need to be kept on tightly coupled MPP • Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism • Now if one takes a large CFD problem and divides into a few components, those “coarse grain data-parallel components” will be supported by computational grid technology • Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems jsumethod05 gcf@indiana.edu
Multi Server Model for metaproblems • We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs) • One runs 10,000 separate programs to design a modern aircraft which must be scheduled and linked ….. • All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research jsumethod05 gcf@indiana.edu
Multi-Server Scenario MultidisciplinaryControl (WebFlow) Parallel DBProxy Database Gateway Control NEOS ControlOptimization OptimizationService Origin 2000Proxy Agent-basedChoice ofCompute Engine MPP NetSolveLinear Alg.Server Matrix Solver IBM SP2Proxy Data AnalysisServer MPP jsumethod05 gcf@indiana.edu