Higher Level Parallelism

Higher Level Parallelism • The PRAM Model • Vector Processors • Flynn Classification • Connection Machine CM-2 (SIMD) • Communication Networks • Memory Architectures • Synchronization

Amdahl’s Law • The performance gain by speeding up some operations is limited by the fraction of the time these (faster) operations are used • Speedup = Original T/Improved T • Speedup = Improved Performance/Original Performance

PRAM MODEL • All processors share the same memory space • CRCW • concurrent read, concurrent write • resolution function on collision, (first/or/largest/error) • CREW • concurrent read, exclusive write • EREW • exclusive read, exclusive write

PRAM Algorithm • Same Program/Algorithm in All Processors • Each Processor also have local memory/registers • Ex, Search for one value from in an array • Using p processor • Array size m • p=m 2 Search for the value 2 in the array 3 2 5 7 2 5 1 6

Search CRCW p=m 2 step1: concurrent read A the same memory is accessed by all processors P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B step2: read B different memory addresses for each processor 3 2 5 7 2 5 1 6 P1 P2 P3 P4 P5 P6 P7 P8 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6

Search CRCW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: concurrent write write 1 if A=B else 0 A 2 2 2 2 2 2 2 2 B 3 2 5 7 2 5 1 6 We use “or” resolution 1: Value found 0: Value not found 1 • Complexity • All operations performed in constant time • Count only the cost of communication steps • In this case the number of steps is independent of m, (if enough processors) • Search is done in constant time O(1) for CRCW and p=m

Search CREW p=m P1 P2 P3 P4 P5 P6 P7 P8 step3: compute 1 if A=B else 0 2 2 2 2 2 2 2 2 3 2 5 7 2 5 1 6 0 1 0 0 1 0 0 0 Same processors can be reused in the next step! step4.1: read A step4.2: read B step4.3: compute A or B log m steps P1 P2 P3 P4 2 0 0 0 0 1 0 1 0 1 0 1 0 • Complexity • We need log m steps • to “collect” the result • Operations done in constant time • O(log m) complexity 2 P1 P2 P1 2

Search EREW p=m 2 P1 log m steps P1 P2 2 P1 P2 P3 P4 P1 P2 P3 P4 P5 P6 P7 P8 It takes log m steps to distribute the value, more complex? NO, the algorithm is still in O( log m) only the constant differs 2 2

PRAM a Theoretical Model • CRCW • Very elegant • Not of much practical use, (too hard to implement) • CREW • This model can be used to develop algorithms for parallel computers, e.g. our search example • p=1 (a single processor), check all elements give O(m) • p=m (m processors), complexity O(log m), notO(1) • From our example we conclude that even in theory we do not get a m-times “speedup” using m-processors 2 THAT IS ONE BIG PROBLEM WITH PARALLEL COMPUTERS

Parallelism so far • By pipelineing several instructions (at different stages) are executed simultaneously • Pipeline depth limited by hazards • SuperScalar designs provide parallel execution units • Limited by instruction and machine level parallelism • VLIW might improve over hardware instruction issuing • All limited by the instruction fetch mechanism • Called the FLYNN BOTTLENECK • Only a very limited nr of instructions can be fetched each cycle • That makes vector operations ineffective

Vector Processors • Taking Pipelineing to its limits for vector operations • Sometimes referred as a SuperPipeline • The same operation is performed on a vector of data • No data dependencies in the vector data • Ex, add two vectors • Solves the FLYNN BOTTLENECK problem • A loop over a vector can be issued by a singe instruction • Proven to be very effective for scientific calculations • CRAY-1, CRAY-2, CRAY-XMP, CRAY-YMP

Vector Processor (CRAY-1 like) MAIN MEMORY FP add/subtract FP multiply Vector load/store FP divide Integer Vector registers Logical SuperPipelined Arithmetical units Scalar registers (like MIPS reg file)

Vector Operations • Fully Pipelined • CPI = 1, we produce one result each cycle when pipe full • Pipeline Latency • Startup cost = pipeline depth • Vector Add 6 cycles • Vector Multiplication 6 cycles • Vector Divide 20 cycles • Vector Load 12 cycles (depends on memory hierarchy) • Sustained rate • Time/element for a collection of related vector operations

Vector Processor Design • Vector length control • VLR register (Maximum Vector Length, MVL) • Strip Mining in software (Vector > MVL causes a loop) • Stride • How to layout a vectors and matrixes in memory, such that • Memory banks can be accessed without collision • Vector Chaining • Forwarding between vector registers (minimize latency) • Vector Mask Register (Boolean valued) • Conditional writeback, (if 0 no writeback) • Sparse matrixes and conditional execution

Programming • By use of language constructs the compiler is able to utilize the vector functions • FORTRAN is widely used for scientific calculations • built in matrix and vector functions/commands • LINPACK • A library of optimized linear algebra functions • Often used as a benchmark (but does it tell the whole truth?) • Some more (implicite) vectorization possible by advanced compilers

Flynn Classification • SISD (Single Instruction, Single Data) • The MIPS, and even the Vector Processor • SIMD (Single Instruction, Multiple Data) • Each instruction activates several execution units in parallel • MISD (Multiple Instruction, Single Data) • The VLIW architecture might be considered but…. MISD is a seldom used classification • MIMD (Multiple Instruction, Multiple Data) • Multiprocessor architectures • Multi computers (communicating over a LAN), sometimes treated as a separate class of architectures

Communication Bus • Total Bandwidth = Link Bandwidth • Bisection Bandwidth = Link Bandwidth Ring • Total Bandwidth = P * Link Bandwidth • Bisection Bandwidth = 2 * Link Bandwidth Fully Connected • Total Bandwidth = (P * P-1)/2 * Link Bandwidth • Bisection Bandwidth = (P/2) * Link Bandwidth 2

MultiStage Networks Omega Network Crossbar Switch P1 to P2,P3 P2 to P4 P3 to P1 P1 to P6, but P2 to P8 not possible at the same time log P 2 P1 P1 P2 P2 P3 P3 P4 P4 P5 P6 P7 P8

Connection Machines CM-2 (SIMD) 16 1-bit Fully Connected CPUs on each Chip Each CPU has 3 1-bit registers and 64 k-bit memory 3-cube 1024 * Chips 512 FPAs 16k 1-bit CPUs 512 FPAs Front end SISD Sequencer CM-2 uses a 12-cube for communication between the chips 16k 1-bit CPUs 512 FPAs 16k 1-bit CPUs 512 FPAs Data Vault (Disk Array)

SIMD Programming, Parallel sum sum=0 for (i=0;i<65536;i=i+1) /* Loop over 65k elements */ sum=sum+A[Pn,i]; /* Pn is the processor number */ limit=8192; half=limit; /* Collect sum from 8192 processors */ repeat half=half/2 /* Split into sender/receiver */ if (Pn>=half && Pn<limit) send(Pn/2-half,sum); if (Pn<half) sum=sum+receive(); limit=half; until (half==1) /* final sum */ limit 4 3 send(1,sum) half 2 send(0,sum) limit 2 1 sum=sum+R half 1 send(0,sum) 0 sum=sum+R 0 sum=sum+R 0 Final sum

SIMD vs MIMD • SIMD • Single Instruction (one PC) • All processors perform the same work (synchronized) • Conditional execution (case/if etc) • Each processor holds a enable bit • MIMD • Each processor has a PC • Possible to run different programs: BUT • All may run the same program (SPMD), single Program ... • Use MIMD style programming for conditional execution • Use SIMD style programming for synchronized actions

Memory Architectures for MIMD • Centralized • We use a single bus for all main memory • Uniform memory access, (after passing the local cache) • Distributed • The sought address might be hosted by another processor • Non-uniform memory access, (dynamic “find” time) • The Extreme, a cache only Memory • Shared • All processors shared the same address space • Memory can be used for communication • Private • All processors have a unique address space • Communication must be done by “message passing”

Shared Bus MIMD Usually 2-32 P Processor Processor Processor … Snoop Tag Snoop Tag Snoop Tag Cache Cache Cache MEMORY I/O • Cache Coherency Protocol • Write Invalidate • The first write to address A causes all other cached references of A to be invalidated • Write Update • On write to address A all cached references of A is updated (high bus activity) • On a cache read miss when using WB caches • The cache holding the valid data writes to memory • The cache holding the valid data writes directly to the cache requiring the data

Synchronization • When using shared data we need to se that only one processor can access the data when updating • We need an atomic operation for TEST&SET Processor 2 Processor 1 loop: TEST&SET A.lock beq A.go loop update A clear A.lock loop: TEST&SET A.lock beq A.go loop update A clear A.lock Processor 1 gets the lock (A.go) updates the shared data and finally clears the lock (A.lock) Processor B spin-waits until lock released updates shaded data and releases lock

Higher Level Parallelism

Higher Level Parallelism

Presentation Transcript

Instruction Level Parallelism

Instruction Level Parallelism (ILP)

Instruction-level Parallelism

Instruction Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

Instruction-Level Parallelism

Instruction-level Parallelism

Instruction Level Parallelism - Recap

Instruction Level Parallelism

Instruction Level Parallelism: Loop Level Parallelism

Instruction-Level Parallelism

Instruction-Level Parallelism (ILP)

Instruction-level Parallelism