200 likes | 348 Views
Extending the Unified Parallel Processing Speedup Model. Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations of integrated circuits will continue to support increasing numbers of transistors.
E N D
Extending the Unified Parallel Processing Speedup Model • Computer architectures take advantage of low-level parallelism: multiple pipelines • The next generations of integrated circuits will continue to support increasing numbers of transistors. • How to make efficient use of the additional transistors? • Answer: Parallelism beyond multiple pipelines: adding multiple processors or processing components in a single chip or single package. • Each level of parallelism performance suffers from the law of diminishing returns outlined by Amdahl. • Incorporating multiple levels of parallelism results in higher overall performance and efficiency.
Presentation Content A discussion of practical and theoretical parallel speedup alternative methods and the efficient use of hardware/processing resources in capturing speedup. • Parallel Speedup/Amdahl’s Law, Scaled Speedup • Pipelined Processors • Multiprocessors and Multicomputers • Multiple concurrent threads • Multiple concurrent processes • Multiple levels of parallelism with integrated chips/packages that combine microcontrollers with Digital Signal Processing chips
Presentation Summary • Architects/Chip-Manufacturers are integrating additional levels of parallelism. • Multiple levels of speedup achieve higher speedups and greater efficiencies than increasing hardware at a single parallel level. • A balanced approach would achieve about the same level of efficiency in cost of hardware resources allocated, in delivering parallel speedup at each level of parallelism. • Numerous architectural approaches are possible, each with different trade-offs and performance returns. • Current technology is integrating DSP processing with microcontroller functionality - achieving up to three levels of parallelism.
S A A A A B B B B S A B A B S S A B A B Classic Model of Parallel Processing • Multiple Processors available (4) • A Process can be divided into serial and parallel portions • The parallel parts are executed concurrently • Serial Time: 10 time units • Parallel Time: 4 time units An example parallel process of time 10: S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts Executed on a single processor: Executed in parallel on 4 processors:
Amdahl’s Law (Analytical Model) • Analytical model of parallel speedup from 1960s • Parallel fraction () is run over n processors taking /n time • The part that must be executed in serial (1- ) gets no speedup • Overall performance is limited by the fraction of the work that cannot be done in parallel (1- ) • diminishing returns with increasing processors (n)
F D Pipelined Processing • Single Processor enhanced with discrete stages • Instructions “flow” through pipeline stages • Parallel Speedup with multiple instructions being executed (by parts) simultaneously • Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster Cycle: 1 2 3 4 5 OF EX WB F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle
Pipeline Performance • Speedup is serial time (nS) over parallel time • Performance is limited by the number of pipeline flushes (n) due to jumps • speculative execution and branch prediction can minimize pipeline flushes • Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources
Super-Scalar: Multiple Pipelines • Concurrent Execution of Multiple sets of instructions • Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline • Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes
Algorithm/Thread Level Parallelism • Example: Algorithms to compute Fast Fourier Transform (FFT) used in Digital Signal Processing (DSP) • Many separate computations in parallel (High Degree Of Parallelism) • Large exchange of data - much communication between processors • Fine-Grained Parallelism • Communication time (latency) may be a consideration if multiple processors are combined on a board of motherboard • Large communication load (fine-grained parallelism) can force the algorithm to become bandwidth-bound rather than computation-bound.
Parallel “threads of execution” could be a separate process could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism A A B B S S S S A A B B Simple Algorithm/Thread Parallelism Model P1 P2 Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.
Concurrent Execution of Multiple Processes Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n single process speedup S S A A A A B B B B S S A A A A B B B B S S S S S S S S A A A A B B B B Multiprocess Speedup No speedup - uniprocessor 12 t Single Process 8 t, Speedup = 1.5 Multi-Process 4 t, Speedup = 3 Two
Algorithm/Thread Parallelism - Analytical Model Multi-Process/Thread Speedup = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads
(Simple) Unified Model with Scaled Speedup Adds scaling factor on parallel work, while holding serial work constant k1= scaling factor on parallel portion = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads
Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip Capturing Multiple Levels of Parallelism
Architectural Variations DSP and microcontroller cores on same chip DSP also does microprocessor Microprocessor also does DSP Multiprocessor Each variation captures some speedup from all three levels Varying amounts of speedup from each level Each parallel level operates at a more efficient level than if all hardware resources were allocated to a single parallel level Trend in Microprocessor Architectures 1. Intra-Instruction Parallelism:Pipelines 2. Instruction-Level Parallelism:Super-Scalar - Multiple Pipelines 3. Algorithm/Thread Parallelism: • Multiple processing elements • Integrated DSP with microcontroller • Enhanced microcontroller to do DSP • Enhanced DSP processor that also functions as a microcontroller
Multiple Processors in a box: on a motherboard on back-plane with daughter-boards Shared-Memory Multiprocessors communication is through shared memory Clustered Multiprocessors another hierarchical level processors are grouped into clusters intra-cluster can be bus or network inter-cluster can be bus or network Distributed Multicomputers multiple computers loosely coupled through a network n-tiered Architectures modern client/server architectures More Levels of Parallelism Outside the Chip
Speedup of Client-Server, 2-Tier Systems • - workload balance,% of workload on client • = 1 (100%), completely distributed • = 0 (100%), completely centralized • n clients, m servers n CLIENTS m SERVERS LAN INTERNET LAN
Speedup of Client-Server, n-Tier Systems • m1 level 1 machines (clients) • m2 server2, m3 server3, m3 server3, etc. • 1- workload balance,% of workload on client • 2- % of workload on server2, 3- % of workload on server3, etc. SERVERS m2 m3 m4 m1 CLIENTS INTERNET LAN LAN SAN
Hierarchy of Embedded Parallelism • 1. N-tiered Client-Server Distributed Systems • 2. Clustered Multi-computers • 3. Clustered-Multiprocessor • 4. Multiple Processors on a Chip • 5. Multiple Processing Elements • 6. Multiple Pipelines • 7. Multiple Stages per Pipeline • Goals: • Single analytical model that captures parallelism from all levels • Simulator that allows exploration
References K. Hoganson, "Alternative Mechanisms to Achieve Parallel Speedup", First IEEE Online Symposium for Electronics Engineers, IEEE Society, August 2000. K. Hoganson, “Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors”, accepted for publication, to appear in The Journal of Supercomputing, To appear 8/2000, Vol. 17, No. 1. K. Hoganson, “Workload Execution Strategies and Parallel Speedup on Clustered Computers”, accepted for publication, IEEE Transactions on Computers, Vol. 48, No. 11, November 1999. Undergraduate Research Project: Unified Parallel System Modeling project, Directed Study, Summer-Fall 2000