CSE 8383 - Advanced Computer Architecture

CSE 8383 - Advanced Computer Architecture Week-5 Week of Feb 9, 2004 engr.smu.edu/~rewini/8383

Contents • Project/Schedule • Introduction to Multiprocessors • Parallelism • Performance • PRAM Model • ….

Warm Up • Parallel Numerical Integration • Parallel Matrix Multiplication In class: Discuss with your neighbor! Videotape: Think about it! What kind of architecture do we need?

Explicit vs. Implicit Paralleism Parallel program Sequential program Parallelizer Programming Environment Parallel Architecture

Motivation • One-processor systems are not capable of delivering solutions to some problems in reasonable time • Multiple processors cooperate to jointly execute a single computational task in order to speed up its execution • Speed-up versus Quality-up

Multiprocessing One-processor Physical limitations Multiprocessor N processors cooperate to solve a single computational task Speed-up Quality-up Sharing

Flynn’s Classification- revisited • SISD (single instruction stream over a single data stream) • SIMD(single instruction stream over multiple data stream) • MIMD(multiple instruction streams over multiple data streams) • MISD (multiple instruction streams and a single data streams)

IS IS DS CU PU MU I/O SISD (single instruction stream over a single data stream) • SISD uniprocessor architecture Captions: CU = control unit PU = Processing unit MU = memory unit IS = instruction stream DS = data stream PE = processing element LM = Local Memory

PE1 LM1 DS DS Data sets loaded from host IS CU IS Program loaded from host PEn DS LMn DS SIMD (single instruction stream over multiple data stream) SIMD Architecture

MIMD (multiple instruction streams over multiple data streams) IS Shared Memory CU1 PU1 IS DS I/O I/O CU1 IS PUn DS IS MMD Architecture (with shared memory)

MISD (multiple instruction streams and a single data streams) IS IS CU1 CU2 CUn Memory (Program and data) IS IS IS DS PU1 DS PU2 DS PUn DS I/O MISD architecture (the systolic array)

System Components • Three major Components • Processors • Memory Modules • Interconnection Network

Memory Access • Shared Memory • Distributed Memory M P P P P M M

Interconnection Network Taxonomy Interconnection Network Dynamic Static Bus-based Switch-based 1-D 2-D HC Crossbar Single Multiple SS MS

M M M M P P P P P MIMD Shared Memory Systems Interconnection Networks

Shared Memory • Single address space • Communication via read & write • Synchronization via locks

P P P P C C C C C C C P P P M M M M Bus Based & switch based SM Systems Global Memory

M M M M C C C C Interconnection Network P P P P Cache Coherent NUMA

M M M M MIMD Distributed Memory Systems P P P P Interconnection Networks

Distributed Memory • Multiple address spaces • Communication via send & receive • Synchronization via messages

P P P P P P P P P P P P P P P P M M M M M M M M M M M M M M M M Processor Memory SIMD Computers von Neumann Computer Some Interconnection Network

SIMD (Data Parallel) • Parallel Operations within a computation are partitioned spatially rather than temporally • Scalar instructions vs. Array instructions • Processors are incapable of operating autonomously  they must be diven by the control uni

Past Trends in Parallel Architecture (inside the box) • Completely custom designed components (processors, memory, interconnects, I/O) • Longer R&D time (2-3 years) • Expensive systems • Quickly becoming outdated • Bankrupt companies!!

New Trends in Parallel Architecture (outside the box) • Advances in commodity processors and network technology • Network of PCs and workstations connected via LAN or WAN forms a Parallel System • Network Computing • Compete favorably (cost/performance) • Utilize unused cycles of systems sitting idle

OS OS OS M M M I/O I/O I/O C C C P P P Clusters Programming Environment Middleware Interconnection Network

Grids • Grids are geographically distributed platforms for computation. • They provide dependable, consistent, pervasive, and inexpensive access to high end computational capabilities.

Problem Assume that a switching component such as a transistor can switch in zero time. We propose to construct a disk-shaped computer chip with such a component. The only limitation is the time it takes to send electronic signals from one edge of the chip to the other. Make the simplifying assumption that electronic signals travel 300,000 kilometers per second. What must be the diameter of a round chip so that it can switch 109 times per second? What would the diameter be if the switching requirements were 1012 time per second?

Grosch’s Law (1960s) • “To sell a computer for twice as much, it must be four times as fast” • Vendors skip small speed improvements in favor of waiting for large ones • Buyers of expensive machines would wait for a twofold improvement in performance for the same price.

Moore’s Law • Gordon Moore (cofounder of Intel) • Processor performance would double every 18 months • This prediction has held for several decades • Unlikely that single-processor performance continues to increase indefinitely

Von Neumann’s bottleneck • Great mathematician of the 1940s and 1950s • Single control unit connecting a memory to a processing unit • Instructions and data are fetched one at a time from memory and fed to processing unit • Speed is limited by the rate at which instructions and data are transferred from memory to the processing unit.

Parallelism • Multiple CPUs • Within the CPU • One Pipeline • Multiple pipelines

Speedup • S = Speed(new) / Speed(old) • S = Work/time(new) / Work/time(old) • S = time(old) / time(new) • S = time(before improvement) / time(after improvement)

Speedup • Time (one CPU): T(1) • Time (n CPUs): T(n) • Speedup: S • S = T(1)/T(n)

Amdahl’s Law The performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used

Example 20 hours B A must walk 200 miles Walk 4 miles /hour 50 + 20 = 70 hours S = 1 Bike 10 miles / hour 20 + 20 = 40 hours S = 1.8 Car-1 50 miles / hour 4 + 20 = 24 hours S = 2.9 Car-2 120 miles / hour 1.67 + 20 = 21.67 hours S = 3.2 Car-3 600 miles /hour 0.33 + 20 = 20.33 hours S = 3.4

Amdahl’s Law (1967) •  : The fraction of the program that is naturally serial • (1- ): The fraction of the program that is naturally parallel

S = T(1)/T(N) T(1)(1-  ) T(N) = T(1) + N 1 N S = = (1-  )  + N + (1-  ) N

Amdahl’s Law

Gustafson-Barsis Law N &  are not independent from each other a : The fraction of the program that is naturally serial T(N) = 1 T(1) = a + (1- a ) N S = N – (N-1) a

Gustafson-Barsis Law

Comparison of Amdahl’s Law vs Gustafson-Barsis’ Law

Example For I = 1 to 10 do begin S[I] = 0.0 ; for J = 1 to 10 do S[I] = S[I] + M[I, J]; S[I] = S[I]/10; end

Distributed Computing Performance • Single Program Performance • Multiple Program Performance

PRAM Model

What is a Model? • According to Webster’s Dictionary, a model is “a description or analogy used to help visualize something that cannot be directly observed.” • According to The Oxford English Dictionary, a model is “a simplified or idealized description or conception of a particular system, situation or process.”

Why Models? • In general, the purpose of Modeling is to capture the salient characteristics of phenomena with clarity and the right degree of accuracy to facilitate analysis and prediction. Megg, Matheson and Tarjan (1995)

Models in Problem Solving • Computer Scientists use models to help design problem solving tools such as: • Fast Algorithms • Effective Programming Environments • Powerful Execution Engines

An Interface Applications A model is an interface separating high level properties from low level ones Provides operations MODEL Requires implementation Architectures

PRAM Model Control • Synchronized Read Compute Write Cycle • EREW • ERCW • CREW • CRCW • Complexity: T(n), P(n), C(n) Private Memory P1 Global Private Memory P2 Memory Private Memory Pp

CSE 8383 - Advanced Computer Architecture