620 likes | 639 Views
Fall 2009 Jih-Kwon Peir Computer Information Science Engineering University of Florida. CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming. Chapter 10 Parallel Programming and Computational Thinking. Fundamentals of Parallel Computing. Parallel computing requires that
E N D
Fall 2009 • Jih-Kwon Peir • Computer Information Science Engineering • University of Florida CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Fundamentals of Parallel Computing • Parallel computing requires that • The problem can be decomposed into sub-problems that can be safely solved at the same time • The programmer structures the code and data to solve these sub-problems concurrently • The goals of parallel computing are • To solve problems in less time, and/or • To solve bigger problems, and/or • To achieve better solutions The problems must be large enough to justify parallel computing and to exhibit exploitable concurrency.
A Recommended Reading • Mattson, Sanders, Massingill, Patterns for Parallel Programming, Addison Wesley, 2005, ISBN 0-321-22811-1. • We draw quite a bit from the book • A good overview of challenges, best practices, and common techniques in all aspects of parallel programming
Key Parallel Programming Steps • To find the concurrency in the problem • To structure the algorithm so that concurrency can be exploited • To implement the algorithm in a suitable programming environment • To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently.
Challenges of Parallel Programming • Finding and exploiting concurrency often requires looking at the problem from a non-obvious angle • Computational thinking (J. Wing) • Dependences need to be identified and managed • The order of task execution may change the answers • Obvious: One step feeds result to the next steps • Subtle: numeric accuracy may be affected by ordering steps that are logically parallel with each other • Performance can be drastically reduced by many factors • Overhead of parallel processing • Load imbalance among processor elements • Inefficient data sharing patterns • Saturation of critical resources such as memory bandwidth
Shared Memory vs. Message Passing • We will focus on shared memory parallel programming • This is what CUDA is based on • Future massively parallel microprocessors are expected to support shared memory at the chip level • The programming considerations of message passing model is quite different! • Look at MPI (Message Passing Interface) and its relatives such as Charm++
Finding Concurrency in Problems • Identify a decomposition of the problem into sub-problems that can be solved simultaneously • A task decomposition that identifies tasks for potential concurrent execution • A data decomposition that identifies data local to each task • A way of grouping tasks and ordering the groups to satisfy temporal constraints • An analysis on the data sharing patterns among the concurrent tasks • A design evaluation that assesses of the quality the choices made in all the steps
Finding Concurrency – The Process Dependence Analysis Decomposition Group Tasks Task Decomposition Design Evaluation Order Tasks Data Decomposition Data Sharing This is typically a iterative process. Opportunities exist for dependence analysis to play earlier role in decomposition.
Task Decomposition • Many large problems can be naturally decomposed into tasks – CUDA kernels are largely tasks • The number of tasks used should be adjustable to the execution resources available. • Each task must include sufficient work in order to compensate for the overhead of managing their parallel execution. • Tasks should maximize reuse of sequential program code to minimize effort. “In an ideal world, the compiler would find tasks for the programmer. Unfortunately, this almost never happens.” - Mattson, Sanders, Massingill
Task Decomposition Example - Square Matrix Multiplication • P = M * N of WIDTH ● WIDTH • One natural task (sub-problem) produces one element of P • All tasks can execute in parallel in this example. N WIDTH M P WIDTH WIDTH WIDTH
Task Decomposition Example –Molecular Dynamics • Simulation of motions of a large molecular system • For each atom, there are natural tasks to calculate • Vibrational forces • Rotational forces • Neighbors that must be considered in non-bonded forces • Non-bonded forces • Update position and velocity • Misc physical properties based on motions • Some of these can go in parallel for an atom It is common that there are multiple ways to decompose any given problem Understand of the problem.
NAMD - molecular dynamics simulation For tutorial information , visit: http://www.ks.uiuc.edu/Training/SumSchool/materials/tutorials/02-namd-tutorial/namd-tutorial.pdf PatchList Data Structure Force & Energy Calculation Inner Loops SPEC_NAMD SelfComputes Objects ….... 144 iterations (per patch) PairComputes Objects …. ….. 1872 iterations (per patch pair) 6 Different NAMD Configurations (all independent) Independent Iterations
Data Decomposition • The most compute intensive parts of many large problem manipulate a large data structure • Similar operations are being applied to different parts of the data structure, in a mostly independent manner. • This is what CUDA is optimized for. • The data decomposition should lead to • Efficient data usage by tasks within the partition • Few dependencies across the tasks that work on different partitions • Adjustable partitions that can be varied according to the hardware characteristics
Data Decomposition Example - Square Matrix Multiplication • Row blocks • Computing each partition requires access to entire N array • Square sub-blocks • Only bands of M and N are needed • Important! Allow data sharing N WIDTH M P WIDTH WIDTH WIDTH
Tasks Grouping • Sometimes natural tasks of a problem can be grouped together to improve efficiency • Reduced synchronization overhead – all tasks in the group can use a barrier to wait for a common dependence • All tasks in the group efficiently share data loaded into a common on-chip, shared storage (Shard Memory) • Grouping and merging dependent tasks into one task reduces need for synchronization • CUDA thread blocks are task grouping examples.
Task Grouping Example - Square Matrix Multiplication • Tasks calculating a P sub-block • Extensive input data sharing, reduced memory bandwidth using Shared Memory • All synched in execution N WIDTH M P WIDTH WIDTH WIDTH
Task Ordering • Identify the data and resource required by a group of tasks before they can execute them • Find the task group that creates it • Determine a temporal order that satisfy all data constraints • Task ordering can be impacted by the scheduling constraints on GPU, i.e. not all thread blocks can be scheduled at the same time.
Task Ordering Example - Block Scheduling for Iterative PDE Solver Blocks scheduled in groups due to limited resources. No updated data from inactive blocks Try to minimize boundary nodes of the whole group Stripe scheduling Square scheduling
Task Ordering Example:Molecular Dynamics Neighbor List Vibrational and Rotational Forces Non-bonded Force Update atomic positions and velocities Next Time Step
Data Sharing • Data sharing can be a double-edged sword • Excessive data sharing can drastically reduce advantage of parallel execution • Localized sharing can improve memory bandwidth efficiency • Efficient memory bandwidth usage can be achieved by synchronizing the execution of task groups and coordinating their usage of memory data • Efficient use of on-chip, shared storage • Read-only sharing can usually be done at much higher efficiency than read-write sharing, which often requires synchronization
Data Sharing Example – Matrix Multiplication • Each task group will finish usage of each sub-block of N and M before moving on • N and M sub-blocks loaded into Shared Memory for use by all threads of a P sub-block • Amount of on-chip Shared Memory strictly limits the number of threads working on a P sub-block • Read-only shared data can be more efficiently accessed as Constant or Texture data
Data Sharing Example – Molecular Dynamics • The atomic coordinates • Read-only access by the neighbor list, bonded force, and non-bonded force task groups • Read-write access for the position update task group • The force array • Read-only access by position update group • Accumulate access by bonded and non-bonded task groups • The neighbor list • Read-only access by non-bonded force task groups • Generated by the neighbor list task group
Key Parallel Programming Steps • To find the concurrency in the problem • To structure the algorithm to translate concurrency into performance • To implement the algorithm in a suitable programming environment • To execute and tune the performance of the code on a parallel system Unfortunately, these have not been separated into levels of abstractions that can be dealt with independently.
Algorithm Consideration • A step by step procedure that is guaranteed to terminate, such that each step is precisely stated and can be carried out by a computer • Definiteness – the notion that each step is precisely stated • Effective computability – each step can be carried out by a computer • Finiteness – the procedure terminates • Multiple algorithms can be used to solve the same problem • Some require fewer steps • Some exhibit more parallelism • Some have larger memory footprint than others
Choosing Algorithm Structure Start Organize by Task Organize by Data Organize by Data Flow Linear Recursive Linear Recursive Regular Irregular Task Parallelism Divide and Conquer Geometric Decomposition Recursive Data Pipeline Event Driven
Mapping a Divide and Conquer Algorithm Thread 0 Thread 2 Thread 4 Thread 6 Thread 8 Thread 10 0 1 2 3 4 5 6 7 8 9 10 11 1 0+1 2+3 4+5 6+7 8+9 10+11 2 0...3 4..7 8..11 3 0..7 8..15 iterations Array elements
bx 0 1 2 tx bsize-1 0 1 2 N BLOCK_WIDTH WIDTH BLOCK_WIDTH M P 0 0 Psub 1 2 by WIDTH BLOCK_SIZE 1 ty bsize-1 BLOCK_WIDTH BLOCK_WIDTH BLOCK_WIDTH 2 WIDTH WIDTH Tiled (Stenciled) Algorithms are Important for Geometric Decomposition • A framework for memory data sharing and reuse by increasing data access locality. • Tiled access patterns allow small cache/scartchpad memories to hold on to data for re-use. • For matrix multiplication, a 16X16 thread block perform 2*256 = 512 float loads from device memory for 256 * (2*16) = 8,192 mul/add operations. • A convenient framework for organizing threads (tasks)
Increased Work per Thread for even more locality bx 0 1 2 tx TILE_WIDTH-1 0 1 2 • Each thread computes two element of Pdsub • Reduced loads from global memory (Md) to shared memory • Reduced instruction overhead • More work done in each iteration Nd TILE_WIDTH WIDTH TILE_WIDTH Pd Md 0 0 Pdsub Pdsub 1 ty by 2 1 WIDTH TILE_WIDTHE TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH
Double Buffering - a frequently used algorithm pattern • One could double buffer the computation, getting better instruction mix within each thread • This is classic software pipelining in ILP compilers Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Load next tile from global memory Compute current tile syncthreads() } Loop { Load current tile to shared memory syncthreads() Compute current tile syncthreads() }
bx 0 1 2 tx TILE_WIDTH-1 0 1 2 Nd TILE_WIDTH WIDTH TILE_WIDTH Md Pd 0 0 Pdsub 1 2 ty WIDTH TILE_WIDTHE by 1 TILE_WIDTH-1 TILE_WIDTH TILE_WIDTH TILE_WIDTH 2 WIDTH WIDTH Double Buffering • Deposit blue tile from register into shared memory • Syncthreads • Load orange tile into register • Compute Blue tile • Deposit orange tile into shared memory • …. • Not suitable if host synchronization is required
One can trade more work for increased parallelism - MPEG • Diamond search algorithm for motion estimation work efficient but sequential • Popular in traditional CPU implementations • Exhaustive Search totally parallel but work inefficient • Popular in HW and parallel implementations
An MPEG Algorithm based onData Parallelism Communication • Loops distributed –DOALL style • Replicates instructions and tables across accelerators • If instructions and data are too large for local memory… • Large memory transfers required to preserve data • Saturation of and contention for communication resources can leave computation resources idle Memory bandwidth constrains performance
Loop fusion & memory privatization • Stage loops fused into single DOALL macroblock loop • Memory privatization reduces main memory access • Replicates instructions and tables across processors • Local memory constraints may prevent this technique Novel dimensions of parallelism reduce communication
Pipeline or “Spatial Computing” Model • Each PE performs as one pipeline stage in macroblock processing • Imbalanced stages result in idle resources • Takes advantage of direct, accelerator to accelerator communication • Not very effective in CUDA but can be effective for Cell Efficient point-to-point communication can enable new models
From Moore’s Law to Amdahl’s Law – Moving to Many-Core CMP • General Ideas • Moore’s Law • Amdahl's Law • Processes and Threads • Concurrency vs. Parallelism
General Ideas “Andy giveth, and Bill taketh away.” No mater how fast hardware gets, software quickly expands of overwhelm the new hardware performance. Is above statement still TRUE?? “Unfortunately, many programs are so big that there is no one individual who really knows all the pieces, and so the amount of code sharing you get isn't as great. Also, the opportunity to go back and really rewrite something isn't quite as great, because there's always a new set of features that you're adding on to the same program.” “Technology happens, it's not good, it's not bad. Is steel good or bad?”
General Ideas • The major processor manufacturers and architectures, from Intel and AMD to Sun Sparc and IBM PowerPC, have run out of room with most of their traditional approaches to boosting CPU performance. • Silicon manufacturers are turning to hyperthreading and multicore architectures.
General Ideas • Ramifications of this change will hit hard in software development. • Typical software solutions can no longer suffice, relying on hardware for speed increases. • Software will need to adapt to the new hardware platforms. • Traditional programming models break down in n-core processor environments.
General Ideas • Intel CPU Introductions over time: • Clock Speed increases until ~2003 then levels off. • Number of transistors in a processor form continues to increase. • New Fermi: 3B transistors
General Ideas • Over the past 30 years, CPU designers have achieved performance gains through: • Clock Speed : Doing the same work faster. • Execution Optimization (microarchitecture): Doing more work per clock cycle. • Cache : Stay away from RAM as much as possible.
General Ideas • The previously mentioned “speedup” methods are concurrency agnostic methods – i.e. they will work with any sequential code base. • CPU Performance increase hit a wall in 2003 from a traditional uniprocessor approach. • It is becoming increasingly difficult to exploit faster CPU speeds. • Key physical limitations are standing in the way: • Heat • Power Consumption • Current Leakage
Moore’s Law • From: Electronics Magazine – April 1965 • “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. I believe that such a large circuit can be built on a single wafer.”
Moore’s Law • For chip integration • Worded differently: • The number of transistors on a chip will double about every two years. • Basically, Moore’s Law became a self fulfilling prophesy.
Moore’s Law • Heat, power consumption, and current leakage are limiting the sheer “transistor” size for creation of a single CPU. • Consumer demands are for smaller, more portable, yet feature rich digital devices. • Moore’s predictions still hold – transistor counts continue to rise… • … however, performance gains are going to be accomplished in fundamentally different ways. • Most current applications will no longer benefit from the free “clock cycle” ride without significant redesign.
Moore’s Law • Near future performance gains will be achieved through: • Hyperthreading: Running two or more threads in parallel inside a single CPU. • Multicore: Running two or more CPUs on a single silicon form factor (CPU chip). • Cache: Bringing instructions and data across the memory access bus in blocks and storing “on-die” for fast access. • Multicore example: 1 : CPU 0, 2 : Cache L2 CPU 0, 3 : CPU 1, 4 : Cache L2 CPU 1, 5 : System Request Interface, Crossbar Switch, Memory Controller, Hypertransport
Amdahl’s Law • Amdahl's law states that the overall speedup of applying the improvement will be: • P - proportion of a computation where the improvement has a speedup of S. • Example: • If an improvement can speed up 30% of the computation, P will be 0.3 • If the improvement makes the portion affected twice as fast, S will be 2. • Adding a second processor to a multithreaded application sets S to 2. The speedup of a program using multiple processors in parallel computing is limited by the sequential fraction of the program.
Amdahl’s Law • To get rich applications running on shrinking form factors, Moore’s Law and Amdahl’s Law become increasingly important in the design of software based systems. • Moore’s law dictates continued growth of CPU systems – driven toward multicore because of current physical limitations. • Amdahl’s law dictates the extent to which multiple cores will increase software system performance.
Stack Code segment Data segment Stack Stack thread thread … Processes and Threads • Modern operating systems load programs as processes • Resource holder • Execution • A process starts executing at its entry point as a thread • Threads can create other threads within the process • Each thread gets its own stack • All threads within a process share code & data segments thread main()
Thread 1 Thread 2 Thread 1 Thread 2 Concurrency vs. Parallelism • Concurrency: two or more threads are in progress at the same time: • Parallelism: two or more threads are executing at the same time • Multiple cores needed!