330 likes | 344 Views
This lecture covers the basics of programming multicore processors using multithreading. Topics include shared memory, synchronization, and writing multithreaded code. An example application and performance characterization are also discussed.
E N D
Announcements • A1 posted by 9AM on Tuesday morning, probably sooner, will announce viaPiazza • TA Lab/Officehours starting next week: will be posted by Sundayafternoon • Remember Kesden’s office hours are posted on the Web, and he is often available outside of office hours. See his schedule.
Haveyoufoundaprogrammingpartner? Yes Not yet, but I may have alead No
Recapping from lasttime • We will program multcore processors with multithreading • Multiple programcounters • Anewstorageclass:shareddata • Synchronizationmaybeneededwhenupdatingshared state (threadsafety) Shared memory s =... s y = ..s... Private memory i:8 i:5 i:2 P0 P1 Pn
Hello world with<Threads> #include <thread> void Hello(int TID){ cout << "Hello from thread "<< } $ ./hello_th3 Hello from thread 0 Hello from thread1 Hello from thread2 $ ./hello_th3 Hello from thread 1 Hello from thread 0 Hello from thread2 $ ./hello_th4 Running with 4 threads Hello from thread0 Hello from thread3 Hello from thread Hello from thread21 TID << endl; int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Hello, t); } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples//Threads/Hello-Th PUB = /share/class/public/cse160-wi17
Whatthingscanthreadsdo? Create even morethreads Join with others created by theparent Run different codefragments Run in lockstep A, B &C
Stepsinwritingmultithreaded code • Wewriteathreadfunctionthatgetscalledeachtimewe spawn a newthread • SpawnthreadsbyconstructingobjectsofclassThread (in the C++library) • Eachthreadrunsonaseparateprocessingcore • (Ifmorethreadsthancores,the threadssharecores) • Threadssharememory,declaresharedvariablesoutsidethe scope of anyfunctions • Divideupthe computationfairlyamongthethreads • Jointhreadssoweknowwhentheyaredone
Today’slecture • A firstapplication • Performancecharacterization • Dataraces
A firstapplication • Divide one array of numbers into another,pointwise • for i =0:N-1 • c[i] = a[i] /b[i]; • Partition arrays into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 a b ÷ ÷ ÷ ÷ … ÷ ÷ c
Pointwise division of 2 arrays withthreads #include <thread> int *a, *b,*c; void Div(int TID, int N, int NT) { int64_t i0 = TID*(N/NT), i1 = i0 + (N/NT); for (int r =0; r<REPS;r++) for(inti=i0; i<i1; i++) c[i] = a[i] /b[i]; } int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // allocate a, b andc // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Div, t , N,NT); qlogin $ ./div 1 50000000(50m) 0.3099seconds $ ./div 250000000 0.1980seconds $ ./div 450000000 0.1258seconds $ ./div 850000000 0.1185seconds } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples/Threads/Div PUB = /share/class/public/cse160-wi17
Whydidtheprogramrunonlyalittlefasteron 8coresthanon4? There wasn’t enough work to give out so some werestarved Memory traffic is saturating thebus The workload is shared unevenly and not all cores are doing their fairshare
Today’slecture • A firstapplication • Performancecharacterization • Dataraces
Measures ofPerformance • Why do we measureperformance? • How do we reportit? • Completiontime • Processor timeproduct • Completiontime #processors • Throughput: amount of work thatcanbe accomplished in a given amount oftime • Relative performance: given a reference architecture orimplementation • AKASpeedup
Parallel Speedup andEfficiency • How much of an improvement did our parallel algorithm obtain over the serialalgorithm? • Define the parallel speedup, SP = T1/TP SPRunningtimeofthebestserialprogramon1processor RunningtimeoftheparallelprogramonPprocessors • T1 is defined as the running time of the “best serialalgorithm” • In general: not the running time of theparallel • algorithm on 1processor • Definition: Parallel efficiency EP =SP/P
What can go wrong withspeedup? • Not always an accurate way to compare different algorithms…. • .. or the same algorithm runningondifferent machines • We might be able to obtain a better running time even if we lower thespeedup • If our goal is performance, the bottom line is running timeTP
ProgramPgetsahigherspeeduponmachine AthanonmachineB. Doestheprogramrunfasteron machineAorB? A B Can’tsay 17
Wehavea super-linear speedupwhen Superlinearspeedup • EP>1 SP >P • Is itbelieveable? • Super-linear speedups are often an artifact of inappropriate measurementtechnique • Where there is a super-linear speedup, a better serial algorithm may belurking 18
Whatisthemaximumpossiblespeedupofany programrunningon2cores? 1 2 4 10 None ofthese 19
Scalability • A computation is scalableif • performance increases as a “nice function” of the number ofprocessors, • e.g.linearly • In practice scalability can be hard to achieve • ► Serial sections: code that runs on only oneprocessor • ► “Non-productive” work associated with parallel execution, e.g. synchronization • ► Load imbalance: uneven work assignmentsoverthe processors • Some algorithms present intrinsic barriers to scalability leading to alternatives • for i=0:n-1 sum = sum +x[i]
SerialSections • Limitscalability • Let f = the fraction of T1 that runsserially • • T1 = f T1 + (1-f) T1 • • TP = f T1 + (1-f) T1/P • Thus SP = 1/[f + (1 - f)/p] • As P, SP 1/f T1 f • This is known as Amdahl’s Law(1967)
Amdahl’s law(1967) • Aserialsection limitsscalability • Letf= fraction of T1 that runsserially • Amdahl’s Law (1967) : As P, SP 1/f • 0.1 0.2 0.3
Performance questions • You observe the following running times for a parallel program running a fixed workloadN • Assume that the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • What is the maximum possible speedup on an infinite numberof SP = 1/[f + (1 - f)/p] processors? • What is the running time on 4processors? T1 T1 ? f
Performance questions • You observe the following running times for a parallel program running afixedworkload and the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • S2 = T1 / T2 = 1.0//0.6 = 5/3 = 1.67; E2 = S2/2 =0.83 • What is the maximum possible speedup on an infinite numberof processors? SP = 1/[f + (1 - f)/p] • Do compute themaxspeedup, we need todeterminef Do determine f, we plug in known values (S2 andp): • 5/3 = 1/[f + (1-f)/2] ⟹ 3/5 = f+(1-f)/2 ⟹ f =1/5 • So what isS∞? • What is the running time on 4processors? Plugging values into the SP expression: S4 = 1/[1/5 + (4/5)/4] ⟹ S4 =5/2 • But S4 = T1 / T4, So T4 = T1/S4
Weakscaling • Is Amdahl’s law pessimistic? • Observation: Amdahl’s law assumes that the workload (W) remains fixed • But parallel computers are used to tackle more ambitiousworkloads • If we increase W with P wehave • weakscaling • f often decreases withW • We can continue to enjoyspeedups • Gustafson’s law[1992] • http://en.wikipedia.org/wiki/Gustafson's_law • www.scl.ameslab.gov/Publications/Gus/FixedTime/FixedTime.pdf 26
Isoefficiency • TheconsequenceofGustafson’sobservationisthatwe increase N withP • Wecanmaintainconstantefficiencysolongaswe increaseN appropriately • TheisoefficiencyfunctionspecifiesthegrowthofNin terms ofP • IfNislinearinP,wehaveascalablecomputation • Ifnot,memorypercoregrowswithP! • Problem:theamountofmemorypercoreisshrinking overtime 28
Today’slecture • A firstapplication • Performancecharacterization • Dataraces
Summing a list ofintegers • for i =0:N-1 • sum = sum +x[i]; • Partition x[ ] into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 • x ∑ ∑ ∑ ∑ ∑ Global∑
First version of summingcode void sum(int TID, int N, intNT){ int64_ti0=TID*(N/NT), i1 int64_tlocal_sum=0; for(int64_t=i0; i<i1; i++) local_sum +=x[i]; global_sum +=local_sum = i0 +(N/NT); } int*x; Main(): int64_t global_sum; for(int64_t t=0; t<NT;t++){ thrds[t] =thread(sum,t,N,NT);
Results • The program usually runscorrectly • But sometimes it produces incorrectresults: • Result verified to be INCORRECT, should be549756338176 • void sum(int TID, int N, intNT){ • … • Whathappened? gsum +=local_sum } • There is a conflict when updating global_sum: a datarace Heap(shared) gsum stack Stack (private) . . . P P P
DataRace • Considerthefollowing threadfunction,wherexisshared and initially0 • void threadFn(int TID){ • x++; • } • Let’s run on 2threads • Whatisthevalueofxafterboththreadshavejoined? • Adataracearisesbecausethetimingofaccessestoshared datacanaffecttheoutcome • Wesaywehaveanon-deterministiccomputation • Thisistruebecausewehaveasideeffect • (changes toglobalvariables, I/O andrandomnumber generators) • Normally,ifwerepeatacomputationusingthesameinputs weexpecttoobtainthesameresults
Under the hood of a racecondition • Assume x is initially0 • x=x+1; • Generated assemblycode • r1(x) x=x+1; x=x+1; r1 r1 r1 +#1 (x) • Possible interleaving with twothreads P1P2 r1x r1(P1) gets0 r2(P2) also gets0 r1(P1) set to 1 r1(P1) set to1 P1 writes itsR1 P2 writes itsR1 r1x r1r1+#1 r1r1+#1 x r1 x r1
Avoiding the data race insummation • Perform the global summation inmain() • After a thread joins, add its contribution to the global sum, one thread at atime • We need to wrap std::ref() around reference arguments, int64_t &, compiler needs ahint* • int64_tglobal_sum; • … void int64_t *locSums = new int64_t[NT]; for(int t=0; t<NT;t++) thrds[t] = thread(sum,t,N,NT,ref(locSums[t]); for(int t=0; t<NT;t++){ thrds[t].join(); global_sum +=locSums[t]; sum(int TID, int N, int NT int64_t& localSum){ …. for (int i=i0; i<i1; i++) localSum +=x[i]; } } * Williams, pp.23-4
Nexttime • Avoiding dataraces • Passing arguments byreference • The Mandelbrot setcomputation