1 / 33

CSE-160 Lecture 2

This lecture covers the basics of programming multicore processors using multithreading. Topics include shared memory, synchronization, and writing multithreaded code. An example application and performance characterization are also discussed.

dewaynet
Download Presentation

CSE-160 Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE-160Lecture2

  2. Announcements • A1 posted by 9AM on Tuesday morning, probably sooner, will announce viaPiazza • TA Lab/Officehours starting next week: will be posted by Sundayafternoon • Remember Kesden’s office hours are posted on the Web, and he is often available outside of office hours. See his schedule.

  3. Haveyoufoundaprogrammingpartner? Yes Not yet, but I may have alead No

  4. Recapping from lasttime • We will program multcore processors with multithreading • Multiple programcounters • Anewstorageclass:shareddata • Synchronizationmaybeneededwhenupdatingshared state (threadsafety) Shared memory s =... s y = ..s... Private memory i:8 i:5 i:2 P0 P1 Pn

  5. Hello world with<Threads> #include <thread> void Hello(int TID){ cout << "Hello from thread "<< } $ ./hello_th3 Hello from thread 0 Hello from thread1 Hello from thread2 $ ./hello_th3 Hello from thread 1 Hello from thread 0 Hello from thread2 $ ./hello_th4 Running with 4 threads Hello from thread0 Hello from thread3 Hello from thread Hello from thread21 TID << endl; int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Hello, t); } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples//Threads/Hello-Th PUB = /share/class/public/cse160-wi17

  6. Whatthingscanthreadsdo? Create even morethreads Join with others created by theparent Run different codefragments Run in lockstep A, B &C

  7. Stepsinwritingmultithreaded code • Wewriteathreadfunctionthatgetscalledeachtimewe spawn a newthread • SpawnthreadsbyconstructingobjectsofclassThread (in the C++library) • Eachthreadrunsonaseparateprocessingcore • (Ifmorethreadsthancores,the threadssharecores) • Threadssharememory,declaresharedvariablesoutsidethe scope of anyfunctions • Divideupthe computationfairlyamongthethreads • Jointhreadssoweknowwhentheyaredone

  8. Today’slecture • A firstapplication • Performancecharacterization • Dataraces

  9. A firstapplication • Divide one array of numbers into another,pointwise • for i =0:N-1 • c[i] = a[i] /b[i]; • Partition arrays into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 a b ÷ ÷ ÷ ÷ … ÷ ÷ c

  10. Pointwise division of 2 arrays withthreads #include <thread> int *a, *b,*c; void Div(int TID, int N, int NT) { int64_t i0 = TID*(N/NT), i1 = i0 + (N/NT); for (int r =0; r<REPS;r++) for(inti=i0; i<i1; i++) c[i] = a[i] /b[i]; } int main(int argc, char *argv[]){ thread *thrds = newthread[NT]; // allocate a, b andc // Spawnthreads for(int t=0;t<NT;t++){ thrds[t] = thread(Div, t , N,NT); qlogin $ ./div 1 50000000(50m) 0.3099seconds $ ./div 250000000 0.1980seconds $ ./div 450000000 0.1258seconds $ ./div 850000000 0.1185seconds } // Jointhreads for(int t=0;t<NT;t++) thrds[t].join(); } $PUB/Examples/Threads/Div PUB = /share/class/public/cse160-wi17

  11. Whydidtheprogramrunonlyalittlefasteron 8coresthanon4? There wasn’t enough work to give out so some werestarved Memory traffic is saturating thebus The workload is shared unevenly and not all cores are doing their fairshare

  12. Today’slecture • A firstapplication • Performancecharacterization • Dataraces

  13. Measures ofPerformance • Why do we measureperformance? • How do we reportit? • Completiontime • Processor timeproduct • Completiontime  #processors • Throughput: amount of work thatcanbe accomplished in a given amount oftime • Relative performance: given a reference architecture orimplementation • AKASpeedup

  14. Parallel Speedup andEfficiency • How much of an improvement did our parallel algorithm obtain over the serialalgorithm? • Define the parallel speedup, SP = T1/TP SPRunningtimeofthebestserialprogramon1processor RunningtimeoftheparallelprogramonPprocessors • T1 is defined as the running time of the “best serialalgorithm” • In general: not the running time of theparallel • algorithm on 1processor • Definition: Parallel efficiency EP =SP/P

  15. What can go wrong withspeedup? • Not always an accurate way to compare different algorithms…. • .. or the same algorithm runningondifferent machines • We might be able to obtain a better running time even if we lower thespeedup • If our goal is performance, the bottom line is running timeTP

  16. ProgramPgetsahigherspeeduponmachine AthanonmachineB. Doestheprogramrunfasteron machineAorB? A B Can’tsay 17

  17. Wehavea super-linear speedupwhen Superlinearspeedup • EP>1 SP >P • Is itbelieveable? • Super-linear speedups are often an artifact of inappropriate measurementtechnique • Where there is a super-linear speedup, a better serial algorithm may belurking 18

  18. Whatisthemaximumpossiblespeedupofany programrunningon2cores? 1 2 4 10 None ofthese 19

  19. Scalability • A computation is scalableif • performance increases as a “nice function” of the number ofprocessors, • e.g.linearly • In practice scalability can be hard to achieve • ► Serial sections: code that runs on only oneprocessor • ► “Non-productive” work associated with parallel execution, e.g. synchronization • ► Load imbalance: uneven work assignmentsoverthe processors • Some algorithms present intrinsic barriers to scalability leading to alternatives • for i=0:n-1 sum = sum +x[i]

  20. SerialSections • Limitscalability • Let f = the fraction of T1 that runsserially • • T1 = f T1 + (1-f) T1 • • TP = f T1 + (1-f) T1/P • Thus SP = 1/[f + (1 - f)/p] • As P, SP  1/f T1 f • This is known as Amdahl’s Law(1967)

  21. Amdahl’s law(1967) • Aserialsection limitsscalability • Letf= fraction of T1 that runsserially • Amdahl’s Law (1967) : As P, SP 1/f • 0.1 0.2 0.3

  22. Performance questions • You observe the following running times for a parallel program running a fixed workloadN • Assume that the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • What is the maximum possible speedup on an infinite numberof SP = 1/[f + (1 - f)/p] processors? • What is the running time on 4processors? T1 T1 ? f

  23. Performance questions • You observe the following running times for a parallel program running afixedworkload and the only losses are due to serialsections • What are the speedup and efficiency on 2processors? • S2 = T1 / T2 = 1.0//0.6 = 5/3 = 1.67; E2 = S2/2 =0.83 • What is the maximum possible speedup on an infinite numberof processors? SP = 1/[f + (1 - f)/p] • Do compute themaxspeedup, we need todeterminef Do determine f, we plug in known values (S2 andp): • 5/3 = 1/[f + (1-f)/2] ⟹ 3/5 = f+(1-f)/2 ⟹ f =1/5 • So what isS∞? • What is the running time on 4processors? Plugging values into the SP expression: S4 = 1/[1/5 + (4/5)/4] ⟹ S4 =5/2 • But S4 = T1 / T4, So T4 = T1/S4

  24. Weakscaling • Is Amdahl’s law pessimistic? • Observation: Amdahl’s law assumes that the workload (W) remains fixed • But parallel computers are used to tackle more ambitiousworkloads • If we increase W with P wehave • weakscaling • f often decreases withW • We can continue to enjoyspeedups • Gustafson’s law[1992] • http://en.wikipedia.org/wiki/Gustafson's_law • www.scl.ameslab.gov/Publications/Gus/FixedTime/FixedTime.pdf 26

  25. Isoefficiency • TheconsequenceofGustafson’sobservationisthatwe increase N withP • Wecanmaintainconstantefficiencysolongaswe increaseN appropriately • TheisoefficiencyfunctionspecifiesthegrowthofNin terms ofP • IfNislinearinP,wehaveascalablecomputation • Ifnot,memorypercoregrowswithP! • Problem:theamountofmemorypercoreisshrinking overtime 28

  26. Today’slecture • A firstapplication • Performancecharacterization • Dataraces

  27. Summing a list ofintegers • for i =0:N-1 • sum = sum +x[i]; • Partition x[ ] into intervals, assign each to aunique thread • Each thread sweeps over a reducedproblem • T0 T1 T2 T3 • x ∑ ∑ ∑ ∑ ∑ Global∑

  28. First version of summingcode void sum(int TID, int N, intNT){ int64_ti0=TID*(N/NT), i1 int64_tlocal_sum=0; for(int64_t=i0; i<i1; i++) local_sum +=x[i]; global_sum +=local_sum = i0 +(N/NT); } int*x; Main(): int64_t global_sum; for(int64_t t=0; t<NT;t++){ thrds[t] =thread(sum,t,N,NT);

  29. Results • The program usually runscorrectly • But sometimes it produces incorrectresults: • Result verified to be INCORRECT, should be549756338176 • void sum(int TID, int N, intNT){ • … • Whathappened? gsum +=local_sum } • There is a conflict when updating global_sum: a datarace Heap(shared) gsum stack Stack (private) . . . P P P

  30. DataRace • Considerthefollowing threadfunction,wherexisshared and initially0 • void threadFn(int TID){ • x++; • } • Let’s run on 2threads • Whatisthevalueofxafterboththreadshavejoined? • Adataracearisesbecausethetimingofaccessestoshared datacanaffecttheoutcome • Wesaywehaveanon-deterministiccomputation • Thisistruebecausewehaveasideeffect • (changes toglobalvariables, I/O andrandomnumber generators) • Normally,ifwerepeatacomputationusingthesameinputs weexpecttoobtainthesameresults

  31. Under the hood of a racecondition • Assume x is initially0 • x=x+1; • Generated assemblycode • r1(x) x=x+1; x=x+1; r1 r1 r1 +#1 (x) • Possible interleaving with twothreads P1P2 r1x r1(P1) gets0 r2(P2) also gets0 r1(P1) set to 1 r1(P1) set to1 P1 writes itsR1 P2 writes itsR1 r1x r1r1+#1 r1r1+#1 x r1 x r1

  32. Avoiding the data race insummation • Perform the global summation inmain() • After a thread joins, add its contribution to the global sum, one thread at atime • We need to wrap std::ref() around reference arguments, int64_t &, compiler needs ahint* • int64_tglobal_sum; • … void int64_t *locSums = new int64_t[NT]; for(int t=0; t<NT;t++) thrds[t] = thread(sum,t,N,NT,ref(locSums[t]); for(int t=0; t<NT;t++){ thrds[t].join(); global_sum +=locSums[t]; sum(int TID, int N, int NT int64_t& localSum){ …. for (int i=i0; i<i1; i++) localSum +=x[i]; } } * Williams, pp.23-4

  33. Nexttime • Avoiding dataraces • Passing arguments byreference • The Mandelbrot setcomputation

More Related