CSE-160 Lecture 2

CSE-160Lecture2

Announcements • A1 posted by 9AM on Tuesday morning, probably sooner, will announce viaPiazza • TA Lab/Officehours starting next week: will be posted by Sundayafternoon • Remember Kesden’s office hours are posted on the Web, and he is often available outside of office hours. See his schedule.

Haveyoufoundaprogrammingpartner? Yes Not yet, but I may have alead No

Recapping from lasttime • We will program multcore processors with multithreading • Multiple programcounters • Anewstorageclass:shareddata • Synchronizationmaybeneededwhenupdatingshared state (threadsafety) Shared memory s =... s y = ..s... Private memory i:8 i:5 i:2 P0 P1 Pn

Hello world with<Threads> #include <thread> void Hello(int TID){ cout << "Hello from thread "<< } $ ./hello_th3 Hello from thread 0 Hello from thread1 Hello from thread2 $ ./hello_th3 Hello from thread 1 Hello from thread 0 Hello from thread2 $ ./hello_th4 Running with 4 threads Hello from thread0 Hello from thread3 Hello from thread Hello from thread21 TID << endl; int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Hello, t); } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples//Threads/Hello-Th PUB = /share/class/public/cse160-wi17

Whatthingscanthreadsdo? Create even morethreads Join with others created by theparent Run different codefragments Run in lockstep A, B &C

Stepsinwritingmultithreaded code • Wewriteathreadfunctionthatgetscalledeachtimewe spawn a newthread • SpawnthreadsbyconstructingobjectsofclassThread (in the C++library) • Eachthreadrunsonaseparateprocessingcore • (Ifmorethreadsthancores,the threadssharecores) • Threadssharememory,declaresharedvariablesoutsidethe scope of anyfunctions • Divideupthe computationfairlyamongthethreads • Jointhreadssoweknowwhentheyaredone

Today’slecture • A firstapplication • Performancecharacterization • Dataraces

A firstapplication • Divide one array of numbers into another,pointwise • for i =0:N-1 • c[i] = a[i] /b[i]; • Partition arrays into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 a b ÷ ÷ ÷ ÷ … ÷ ÷ c

Pointwise division of 2 arrays withthreads #include <thread> int *a, *b,*c; void Div(int TID, int N, int NT) { int64_t i0 = TID*(N/NT), i1 = i0 + (N/NT); for (int r =0; r<REPS;r++) for(inti=i0; i<i1; i++) c[i] = a[i] /b[i]; } int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // allocate a, b andc // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Div, t , N,NT); qlogin $ ./div 1 50000000(50m) 0.3099seconds $ ./div 250000000 0.1980seconds $ ./div 450000000 0.1258seconds $ ./div 850000000 0.1185seconds } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples/Threads/Div PUB = /share/class/public/cse160-wi17

Whydidtheprogramrunonlyalittlefasteron 8coresthanon4? There wasn’t enough work to give out so some werestarved Memory traffic is saturating thebus The workload is shared unevenly and not all cores are doing their fairshare

Measures ofPerformance • Why do we measureperformance? • How do we reportit? • Completiontime • Processor timeproduct • Completiontime  #processors • Throughput: amount of work thatcanbe accomplished in a given amount oftime • Relative performance: given a reference architecture orimplementation • AKASpeedup

Parallel Speedup andEfficiency • How much of an improvement did our parallel algorithm obtain over the serialalgorithm? • Define the parallel speedup, SP = T1/TP SPRunningtimeofthebestserialprogramon1processor RunningtimeoftheparallelprogramonPprocessors • T1 is defined as the running time of the “best serialalgorithm” • In general: not the running time of theparallel • algorithm on 1processor • Definition: Parallel efficiency EP =SP/P

What can go wrong withspeedup? • Not always an accurate way to compare different algorithms…. • .. or the same algorithm runningondifferent machines • We might be able to obtain a better running time even if we lower thespeedup • If our goal is performance, the bottom line is running timeTP

ProgramPgetsahigherspeeduponmachine AthanonmachineB. Doestheprogramrunfasteron machineAorB? A B Can’tsay 17

Wehavea super-linear speedupwhen Superlinearspeedup • EP>1 SP >P • Is itbelieveable? • Super-linear speedups are often an artifact of inappropriate measurementtechnique • Where there is a super-linear speedup, a better serial algorithm may belurking 18

Whatisthemaximumpossiblespeedupofany programrunningon2cores? 1 2 4 10 None ofthese 19

Scalability • A computation is scalableif • performance increases as a “nice function” of the number ofprocessors, • e.g.linearly • In practice scalability can be hard to achieve • ► Serial sections: code that runs on only oneprocessor • ► “Non-productive” work associated with parallel execution, e.g. synchronization • ► Load imbalance: uneven work assignmentsoverthe processors • Some algorithms present intrinsic barriers to scalability leading to alternatives • for i=0:n-1 sum = sum +x[i]

SerialSections • Limitscalability • Let f = the fraction of T1 that runsserially • • T1 = f T1 + (1-f) T1 • • TP = f T1 + (1-f) T1/P • Thus SP = 1/[f + (1 - f)/p] • As P, SP  1/f T1 f • This is known as Amdahl’s Law(1967)

Amdahl’s law(1967) • Aserialsection limitsscalability • Letf= fraction of T1 that runsserially • Amdahl’s Law (1967) : As P, SP 1/f • 0.1 0.2 0.3

Performance questions • You observe the following running times for a parallel program running a fixed workloadN • Assume that the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • What is the maximum possible speedup on an infinite numberof SP = 1/[f + (1 - f)/p] processors? • What is the running time on 4processors? T1 T1 ? f

Performance questions • You observe the following running times for a parallel program running afixedworkload and the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • S2 = T1 / T2 = 1.0//0.6 = 5/3 = 1.67; E2 = S2/2 =0.83 • What is the maximum possible speedup on an infinite numberof processors? SP = 1/[f + (1 - f)/p] • Do compute themaxspeedup, we need todeterminef Do determine f, we plug in known values (S2 andp): • 5/3 = 1/[f + (1-f)/2] ⟹ 3/5 = f+(1-f)/2 ⟹ f =1/5 • So what isS∞? • What is the running time on 4processors? Plugging values into the SP expression: S4 = 1/[1/5 + (4/5)/4] ⟹ S4 =5/2 • But S4 = T1 / T4, So T4 = T1/S4

Weakscaling • Is Amdahl’s law pessimistic? • Observation: Amdahl’s law assumes that the workload (W) remains fixed • But parallel computers are used to tackle more ambitiousworkloads • If we increase W with P wehave • weakscaling • f often decreases withW • We can continue to enjoyspeedups • Gustafson’s law[1992] • http://en.wikipedia.org/wiki/Gustafson's_law • www.scl.ameslab.gov/Publications/Gus/FixedTime/FixedTime.pdf 26

Isoefficiency • TheconsequenceofGustafson’sobservationisthatwe increase N withP • Wecanmaintainconstantefficiencysolongaswe increaseN appropriately • TheisoefficiencyfunctionspecifiesthegrowthofNin terms ofP • IfNislinearinP,wehaveascalablecomputation • Ifnot,memorypercoregrowswithP! • Problem:theamountofmemorypercoreisshrinking overtime 28

Summing a list ofintegers • for i =0:N-1 • sum = sum +x[i]; • Partition x[ ] into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 • x ∑ ∑ ∑ ∑ ∑ Global∑

First version of summingcode void sum(int TID, int N, intNT){ int64_ti0=TID*(N/NT), i1 int64_tlocal_sum=0; for(int64_t=i0; i<i1; i++) local_sum +=x[i]; global_sum +=local_sum = i0 +(N/NT); } int*x; Main(): int64_t global_sum; for(int64_t t=0; t<NT;t++){ thrds[t] =thread(sum,t,N,NT);

Results • The program usually runscorrectly • But sometimes it produces incorrectresults: • Result verified to be INCORRECT, should be549756338176 • void sum(int TID, int N, intNT){ • … • Whathappened? gsum +=local_sum } • There is a conflict when updating global_sum: a datarace Heap(shared) gsum stack Stack (private) . . . P P P

DataRace • Considerthefollowing threadfunction,wherexisshared and initially0 • void threadFn(int TID){ • x++; • } • Let’s run on 2threads • Whatisthevalueofxafterboththreadshavejoined? • Adataracearisesbecausethetimingofaccessestoshared datacanaffecttheoutcome • Wesaywehaveanon-deterministiccomputation • Thisistruebecausewehaveasideeffect • (changes toglobalvariables, I/O andrandomnumber generators) • Normally,ifwerepeatacomputationusingthesameinputs weexpecttoobtainthesameresults

Under the hood of a racecondition • Assume x is initially0 • x=x+1; • Generated assemblycode • r1(x) x=x+1; x=x+1; r1 r1 r1 +#1 (x) • Possible interleaving with twothreads P1P2 r1x r1(P1) gets0 r2(P2) also gets0 r1(P1) set to 1 r1(P1) set to1 P1 writes itsR1 P2 writes itsR1 r1x r1r1+#1 r1r1+#1 x r1 x r1

Avoiding the data race insummation • Perform the global summation inmain() • After a thread joins, add its contribution to the global sum, one thread at atime • We need to wrap std::ref() around reference arguments, int64_t &, compiler needs ahint* • int64_tglobal_sum; • … void int64_t *locSums = new int64_t[NT]; for(int t=0; t<NT;t++) thrds[t] = thread(sum,t,N,NT,ref(locSums[t]); for(int t=0; t<NT;t++){ thrds[t].join(); global_sum +=locSums[t]; sum(int TID, int N, int NT int64_t& localSum){ …. for (int i=i0; i<i1; i++) localSum +=x[i]; } } * Williams, pp.23-4

Nexttime • Avoiding dataraces • Passing arguments byreference • The Mandelbrot setcomputation

CSE-160 Lecture 2

CSE-160 Lecture 2

Presentation Transcript

PHIL 160: Lecture 2

CSE 160 – Lecture 2

CSE 390a Lecture 2

CSE 599 Lecture 2

CSE 160 - Lecture 15

CSE 303 Lecture 2

CSE 160 – Lecture 16

CSE 160 – Lecture 10

CSE 160 – Lecture 9

CS 160: Lecture 2

CSE 143 Lecture 2

CSE 160 – Lecture 2

CS 160: Lecture 2

CSE 160 – Lecture 16

CSE 143 Lecture 2

CS 160: Lecture 2