More on Parallel Computing

More on Parallel Computing Spring Semester 2005 Geoffrey Fox Community Grids Laboratory Indiana University 505 N Morton Suite 224 Bloomington IN gcf@indiana.edu jsumethod05 gcf@indiana.edu

What is Parallel Architecture? • A parallel computer is any old collection of processing elements that cooperate to solve large problems fast • from a pile of PC’s to a shared memory multiprocessor • Some broad issues: • Resource Allocation: • how large a collection? • how powerful are the elements? • how much memory? • Data access, Communication and Synchronization • how do the elements cooperate and communicate? • how are data transmitted between processors? • what are the abstractions and primitives for cooperation? • Performance and Scalability • how does it all translate into performance? • how does it scale? jsumethod05 gcf@indiana.edu

Parallel Computers -- Classic Overview • Parallel computers allow several CPUs to contribute to a computation simultaneously. • For our purposes, a parallel computer has three types of parts: • Processors • Memory modules • Communication / synchronization network • Key points: • All processors must be busy for peak speed. • Local memory is directly connected to each processor. • Accessing local memory is much faster than other memory. • Synchronization is expensive, but necessary for correctness. Colors Used in Following pictures jsumethod05 gcf@indiana.edu

Distributed Memory Machines • Every processor has a memory others can’t access. • Advantages: • Relatively easy to design and build • Predictable behavior • Can be scalable • Can hide latency of communication • Disadvantages: • Hard to program • Program and O/S (and sometimes data) must be replicated jsumethod05 gcf@indiana.edu

Communication on Distributed Memory Architecture • On distributed memory machines, each chunk of decomposed data resides on separate memory space -- a processor is typically responsible for storing and processing data (owner-computes rule) • Information needed on edges for update must be communicated via explicitly generated messages Messages jsumethod05 gcf@indiana.edu

Distributed Memory Machines -- Notes • Conceptually, the nCUBE CM-5 Paragon SP-2 Beowulf PC cluster BlueGene are quite similar. • Bandwidth and latency of interconnects different • The network topology is a two-dimensional torus for Paragon, three-dimensional torus for BlueGene, fat tree for CM-5, hypercube for nCUBE and Switch for SP-2 • To program these machines: • Divide the problem to minimize number of messages while retaining parallelism • Convert all references to global structures into references to local pieces (explicit messages convert distant to local variables) • Optimization: Pack messages together to reduce fixed overhead (almost always needed) • Optimization: Carefully schedule messages (usually done by library) jsumethod05 gcf@indiana.edu

BlueGene/L has Classic Architecture 32768 node BlueGene/L takes #1 TOP500 Position 29 Sept 2004 70.7 Teraflops jsumethod05 gcf@indiana.edu

BlueGene/L Fundamentals • Low Complexity nodes gives more flops per transistor and per watt • 3D Interconnect supports many scientific simulations as nature as we see it is 3D jsumethod05 gcf@indiana.edu

1024 Nodes full systemwith hypercube Interconnect 1987 MPP jsumethod05 gcf@indiana.edu

Shared-Memory Machines • All processors access the same memory. • Advantages: • Retain sequential programming languages such as Java or Fortran • Easy to program (correctly) • Can share code and data among processors • Disadvantages: • Hard to program (optimally) • Not scalable due to bandwidth limitations in bus jsumethod05 gcf@indiana.edu

Communication on SharedMemory Architecture • On a shared Memory Machine a CPU is responsible for processing a decomposed chunk of data but not for storing it • Nature of parallelism is identical to that for distributed memory machines but communication implicit as “just” access memory jsumethod05 gcf@indiana.edu

Shared-Memory Machines -- Notes • Interconnection network varies from machine to machine • These machines share data by direct access. • Potentially conflicting accesses must be protected by synchronization. • Simultaneous access to the same memory bank will cause contention, degrading performance. • Some access patterns will collide in the network (or bus), causing contention. • Many machines have caches at the processors. • All these features make it profitable to have each processor concentrate on one area of memory that others access infrequently. jsumethod05 gcf@indiana.edu

Distributed Shared Memory Machines • Combining the (dis)advantages of shared and distributed memory • Lots of hierarchical designs. • Typically, “shared memory nodes” with 4 to 32 processors • Each processor has a local cache • Processors within a node access shared memory • Nodes can get data from or put data to other nodes’ memories jsumethod05 gcf@indiana.edu

Summary on Communication etc. • Distributed Shared Memory machines have communication features of both distributed (messages) and shared (memory access) architectures • Note for distributed memory, programming model must express data location (HPF Distribute command) and invocation of messages (MPI syntax) • For shared memory, need to express control (openMP) or processing parallelism and synchronization -- need to make certain that when variable updated, “correct” version is used by other processors accessing this variable and that values living in caches are updated jsumethod05 gcf@indiana.edu

Seismic Simulation of Los Angeles Basin • This is (sophisticated) wave equation similar to Laplace example and you divide Los Angeles geometrically and assign roughly equal number of grid points to each processor jsumethod05 gcf@indiana.edu

Communication Must be Reduced • 4 by 4 regions in each processor • 16 Green (Compute) and 16 Red (Communicate) Points • 8 by 8 regions in each processor • 64 Green and “just” 32 Red Points • Communication is an edge effect • Give each processor plenty of memory and increase region in each machine • Large Problems Parallelize Best jsumethod05 gcf@indiana.edu

Irregular 2D Simulation -- Flow over an Airfoil • The Laplace grid points become finite element mesh nodal points arranged as triangles filling space • All the action (triangles) is near near wing boundary • Use domain decomposition but no longer equal area as equal triangle count jsumethod05 gcf@indiana.edu

Heterogeneous Problems • Simulation of cosmological cluster (say 10 million stars ) • Lots of work per star as very close together( may need smaller time step) • Little work per star as force changes slowly and can be well approximated by low order multipole expansion jsumethod05 gcf@indiana.edu

Load Balancing Particle Dynamics • Particle dynamics of this type (irregular with sophisticated force calculations) always need complicated decompositions • Equal area decompositions as shown here to load imbalance • If use simpler algorithms (full O(N2) forces) or FFT, then equal area best Equal Volume DecompositionUniverse Simulation 16 Processors Galaxy or Star or ... jsumethod05 gcf@indiana.edu

Reduce Communication Block Decomposition Cyclic Decomposition • Consider a geometric problem with 4 processors • In top decomposition, we divide domain into 4 blocks with all points in a given block contiguous • In bottom decomposition we give each processor the same amount of work but divided into 4 separate domains • edge/area(bottom) = 2* edge/area(top) • So minimizing communication implies we keep points in a given processor together jsumethod05 gcf@indiana.edu

Minimize Load Imbalance Block Decomposition Cyclic Decomposition • But this has a flip side. Suppose we are decomposing Seismic wave problem and all the action is near a particular earthquake fault denoted by . • In Top decomposition only the white processor does any work while the other 3 sit idle. • Ffficiency 25% due to Load Imbalance • In Bottom decomposition all the processors do roughly the same work and so we get good load balance …... jsumethod05 gcf@indiana.edu

Parallel Irregular Finite Elements Processor • Here is a cracked plate and calculating stresses with an equal area decomposition leads to terrible results • All the work is near crack jsumethod05 gcf@indiana.edu

Irregular Decomposition for Crack Processor Region assigned to 1 processor • Concentrating processors near crack leads to good workload balance • equal nodal point -- not equal area -- but to minimize communication nodal points assigned to a particular processor are contiguous • This is NP complete (exponenially hard) optimization problem but in practice many ways of getting good but not exact good decompositions WorkLoad Not Perfect ! jsumethod05 gcf@indiana.edu

Further Decomposition Strategies California gets its independence • Not all decompositions are quite the same • In defending against missile attacks, you track each missile on a separate node -- geometric again • In playing chess, you decompose chess tree -- an abstract not geometric space Computer Chess Tree Current Position(node in Tree)First Set MovesOpponents Counter Moves jsumethod05 gcf@indiana.edu

Summary of Parallel Algorithms • A parallel algorithm is a collection of tasks and a partial ordering between them. • Design goals: • Match tasks to the available processors (exploit parallelism). • Minimize ordering (avoid unnecessary synchronization points). • Recognize ways parallelism can be helped by changing ordering • Sources of parallelism: • Data parallelism: updating array elements simultaneously. • Functional parallelism: conceptually different tasks which combine to solve the problem. This happens at fine and coarsegrain size • fine is “internal” such as I/O and computation; coarse is “external” such as separate modules linked together jsumethod05 gcf@indiana.edu

Data Parallelism in Algorithms • Data-parallel algorithms exploit the parallelism inherent in many large data structures. • A problem is an (identical) algorithm applied to multiple points in data “array” • Usually iterate over such “updates” • Features of Data Parallelism • Scalable parallelism -- can often get million or more way parallelism • Hard to express when “geometry” irregular or dynamic • Note data-parallel algorithms can be expressed by ALL programming models (Message Passing, HPF like, openMP like) jsumethod05 gcf@indiana.edu

Functional Parallelism in Algorithms • Functional parallelism exploits the parallelism between the parts of many systems. • Many pieces to work on  many independent operations • Example: Coarse grain Aeroelasticity (aircraft design) • CFD(fluids) and CSM(structures) and others (acoustics, electromagnetics etc.) can be evaluated in parallel • Analysis: • Parallelism limited in size -- tens not millions • Synchronization probably good as parallelism natural from problem and usual way of writing software • Web exploits functional parallelism NOT data parallelism jsumethod05 gcf@indiana.edu

Pleasingly Parallel Algorithms • Many applications are what is called (essentially) embarrassingly or more kindly pleasingly parallel • These are made up of independent concurrent components • Each client independently accesses a Web Server • Each roll of a Monte Carlo dice (random number) is an independent sample • Each stock can be priced separately in a financial portfolio • Each transaction in a database is almost independent (a given account is locked but usually different accounts are accessed at same time) • Different parts of Seismic data can be processed independently • In contrast points in a finite difference grid (from a differential equation) canNOT be updated independently • Such problems are often formally data-parallel but can be handled much more easily -- like functional parallelism jsumethod05 gcf@indiana.edu

Parallel Languages • A parallel language provides an executable notation for implementing a parallel algorithm. • Design criteria: • How are parallel operations defined? • static tasks vs. dynamic tasks vs. implicit operations • How is data shared between tasks? • explicit communication/synchronization vs. shared memory • How is the language implemented? • low-overhead runtime systems vs. optimizing compilers • Usually a language reflects a particular style of expressing parallelism. • Data parallel expresses concept of identical algorithm on different parts of array • Message parallel expresses fact that at low level parallelism implies information is passed between different concurrently executing program parts jsumethod05 gcf@indiana.edu

Data-Parallel Languages • Data-parallel languages provide an abstract, machine-independent model of parallelism. • Fine-grain parallel operations, such as element-wise operations on arrays • Shared data in large, global arrays with mapping “hints” • Implicit synchronization between operations • Partially explicit communication from operation definitions • Advantages: • Global operations conceptually simple • Easy to program (particularly for certain scientific applications) • Disadvantages: • Unproven compilers • As express “problem” can be inflexible if new algorithm which language didn’t express well • Examples: HPF • Originated on SIMD machines where parallel operations are in lock-step but generalized (not so successfully as compilers too hard) to MIMD jsumethod05 gcf@indiana.edu

Approaches to Parallel Programming • Data Parallel typified by CMFortran and its generalization - High Performance Fortran which in previous years we discussed in detail but this year we will not discuss; See Source Book for more on HPF • Typical Data Parallel Fortran Statements are full array statements • B=A1 + A2 • B=EOSHIFT(A,-1) • Function operations on arrays representing full data domain • Message Passing typified by later discussion of Laplace Example, which specifies specific machine actions i.e. send a message between nodes whereas data parallel model is at higher level as it (tries) to specify a problem feature • Note: We are always using "data parallelism" at problem level whether software is "message passing" or "data parallel" • Data parallel software is translated by a compiler into "machine language" which is typically message passing on a distributed memory machine and threads on a shared memory jsumethod05 gcf@indiana.edu

Shared Memory Programming Model • Experts in Java are familiar with this as it is built in Java Language through thread primitives • We take “ordinary” languages such as Fortran, C++, Java and add constructs to help compilers divide processing (automatically) into separate threads • indicate which DO/for loop instances can be executed in parallel and where there are critical sections with global variables etc. • openMP is a recent set of compiler directives supporting this model • This model tends to be inefficient on distributed memory machines as optimizations (data layout, communication blocking etc.) not natural jsumethod05 gcf@indiana.edu

Structure(Architecture) of Applications - I • Applications are metaproblems with a mix of module (aka coarse grain functional) and data parallelism • Modules are decomposed into parts (data parallelism) and composed hierarchically into full applications.They can be the • “10,000” separate programs (e.g. structures,CFD ..) used in design of aircraft • the various filters used in Adobe Photoshop or Matlab image processing system • the ocean-atmosphere components in integrated climate simulation • The data-base or file system access of a data-intensive application • the objects in a distributed Forces Modeling Event Driven Simulation jsumethod05 gcf@indiana.edu

Structure(Architecture) of Applications - II • Modules are “natural” message-parallel components of problem and tend to have less stringent latency and bandwidth requirements than those needed to link data-parallel components • modules are what HPF needs task parallelism for • Often modules are naturally distributedwhereas parts of data parallel decomposition may need to be kept on tightly coupled MPP • Assume that primary goal of metacomputing system is to add to existing parallel computing environments, a higher level supporting module parallelism • Now if one takes a large CFD problem and divides into a few components, those “coarse grain data-parallel components” will be supported by computational grid technology • Use Java/Distributed Object Technology for modules -- note Java to growing extent used to write servers for CORBA and COM object systems jsumethod05 gcf@indiana.edu

Multi Server Model for metaproblems • We have multiple supercomputers in the backend -- one doing CFD simulation of airflow; another structural analysis while in more detail you have linear algebra servers (Netsolve); Optimization servers (NEOS); image processing filters(Khoros);databases (NCSA Biology workbench); visualization systems(AVS, CAVEs) • One runs 10,000 separate programs to design a modern aircraft which must be scheduled and linked ….. • All linked to collaborative information systems in a sea of middle tier servers(as on previous page) to support design, crisis management, multi-disciplinary research jsumethod05 gcf@indiana.edu

Multi-Server Scenario MultidisciplinaryControl (WebFlow) Parallel DBProxy Database Gateway Control NEOS ControlOptimization OptimizationService Origin 2000Proxy Agent-basedChoice ofCompute Engine MPP NetSolveLinear Alg.Server Matrix Solver IBM SP2Proxy Data AnalysisServer MPP jsumethod05 gcf@indiana.edu

More on Parallel Computing

More on Parallel Computing

Presentation Transcript

Parallel Computing

Workshop on Parallel Computing

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Parallel Computing

Parallel Computing Explained Parallel Computing Overview

Parallel Computing

Parallel Computing

Parallel computing

Parallel Computing

Parallel Computing

Parallel Architectures Based on Parallel Computing , M. J. Quinn

Parallel Computing

Scalable Parallel Computing on Clouds

Parallel Computing

Parallel Computing on Graphics Processors

Parallel Computing

Efficient Data Parallel Computing on GPUs

Parallel Computing

Data Parallel Computing on Graphics Hardware

Parallel Computing

Parallel computing

Seminar on parallel computing