Parallel Processing

Parallel Processing Chapter 9

Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution:

Problem: • Branches, cache misses, dependencies limit the (Instruction Level Parallelism) ILP available • Solution: • Divide program into parts • Run each part on separate CPUs of larger machine

Motivations

Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult

Motivations • Desktops are incredibly cheap • Custom high-performance uniprocessor • Hook up 100 desktops • Squeezing out more ILP is difficult • More complexity/power required each time • Would require change in cooling technology

Challenges • Parallelizing code is not easy • Communication can be costly • Requires HW support

Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Requires HW support

Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support

Challenges • Parallelizing code is not easy • Languages, software engineering, software verification issue – beyond scope of class • Communication can be costly • Performance analysis ignores caches - these costs are much higher • Requires HW support • Multiple processes modifying the same data causes race conditions, and out of order processors arbitrarily reorder things.

Performance - Speedup • _____________________ • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • What is the speedup with 100 processors?

Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? 8

Speedup • Amdahl’s Law!!!!!! • 70% of the program is parallelizable • What is the highest speedup possible? • 1 / (.30 + .70 / ) = 1 / .30 = 3.33 • What is the speedup with 100 processors? • 1 / (.30 + .70/100) = 1 / .307 = 3.26 8

Taxonomy • SISD – single instruction, single data • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

P D P D P D P D Controller SIMD Controller fetches instructions All processors execute the same instruction Conditional instructions only way for variation P D P D P D P D

Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • MIMD – multiple instruction, multiple data

Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Never built – pipeline architectures?!? • MIMD – multiple instruction, multiple data

Taxonomy • SISD – single instruction, single data • uniprocessor • SIMD – single instruction, multiple data • vector, MMX extensions, graphics cards • MISD – multiple instruction, single data • Streaming apps? • MIMD – multiple instruction, multiple data • Most multiprocessors • Cheap, flexible

Example • Sum the elements in A[] and place result in sum int sum=0; int i; for(i=0;i<n;i++) sum = sum + A[i];

Parallel versionShared Memory

Parallel versionShared Memory int A[NUM]; int numProcs; int sum; int sumArray[numProcs]; myFunction( (input arguments) ) { int myNum - …….; int mySum = 0; for (i = (NUM/numProcs)*myNum; i < (NUM/numProcs)*(myNum+1);i++) mySum += A[i]; sumArray[myNum] = mySum; barrier(); if (myNum == 0) { for(i=0;i<numProcs;i++) sum += sumArray[i]; } }

Why Synchronization? • Why can’t you figure out when proc x will finish work?

Why Synchronization? • Why can’t you figure out when proc x will finish work? • Cache misses • Different control flow • Context switches

Supporting Parallel Programs • Synchronization • Cache Coherence • False Sharing

Synchronization • Sum += A[i]; • Two processors, i = 0, i = 50 • Before the action: • Sum = 5 • A[0] = 10 • A[50] = 33 • What is the proper result?

Synchronization • Sum = Sum + A[i]; • Assembly for this equation, assuming • A[i] is already in $t0: • &Sum is already in $s0 lw $t1, 0($s0) add $t1, $t1, $t0 sw $t1, 0($s0)

lw $t1, 0($s0) SynchronizationOrdering #1 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 15 38

lw $t1, 0($s0) SynchronizationOrdering #2 add $t1, $t1, $t0 sw $t1, 0($s0) 5 5 15 38 38 15

Synchronization Problem • Reading and writing memory is a non-atomic operation • You can not read and write a memory location in a single operation • We need hardware primitives that allow us to read and write without interruption

Solution • Software Solution • “lock” – function that allows one processor to leave, all others to loop • “unlock” – releases the next looping processor (or resets to allow next arriving proc to leave) • Hardware • Provide primitives that read & write in order to implement lock and unlock

SoftwareUsing lock and unlock lock(&balancelock) Sum += A[i] unlock(&balancelock)

HardwareImplementing lock & unlock • Swap $1, 100($2) • Swap the contents of $1 and M[$2+100]

Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop

Hardware: Implementing lock & unlock with swap • If lock has 0, it is free • If lock has 1, it is held Lock: Li $t0, 1 Loop: swap $t0, 0($a0) bne $t0, $0, loop Unlock: sw $0, 0($a0)

Outline • Synchronization • Cache Coherence • False Sharing

Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a • P2: Wr a, 3 • P1: Rd a $$$ $$$ 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 2 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 • P1: Rd a $$$ $$$ 3 1 DRAM P1,P2 are write-back caches

Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency!

Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load?

Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load?

Cache Coherence P1 P2 2 4 • Current a value in:P1$ P2$ DRAM • * * 7 • 1. P2: Rd a * 7 7 • 2. P2: Wr a, 5 * 5 7 • 3. P1: Rd a 5 5 5 • P2: Wr a, 3 5 3 5 • P1: Rd a $$$ $$$ 3 1 DRAM AAAAAAAAAAAAAAAAAAAAAH! Inconsistency! What will P1 receive from its load? 5 What should P1 receive from its load? 3

Whatever are we to do? • Write-Invalidate • Invalidate that value in all others’ caches • Set the valid bit to 0 • Write-Update • Update the value in all others’ caches

Parallel Processing