220 likes | 335 Views
Parallelization at a Glance. Christian Terboven <terboven@rz.rwth-aachen.de> 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3. Basic Concepts Parallelization Strategies Possible Issues Efficient Parallelization. Agenda. Basic Concepts.
E N D
Parallelization at a Glance Christian Terboven <terboven@rz.rwth-aachen.de> 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3
Basic Concepts • Parallelization Strategies • Possible Issues • Efficient Parallelization • Agenda
Process: Instance of a computerprogram • Thread: A sequenceofinstructions (ownProgram Counter) • Parallel / Concurrentexecution: • Multiple Processes • Multiple Threads • Per process: Address Space, Files, Environment, … • Per thread: Program Counter, Stack, … • Processesand Threads computer program counter thread process
Even on a multi-socket multi-core systemyoushould not makeanyassumptionwhichprocess / threadisexecutedwhenandwhere! • Twothreads on onecore: • Twothreads on twocores: • Processand Thread Scheduling bythe OS Thread1 Thread 2 System thread threadmigration “pinned” threads
Memory canbeaccessedbyseveralthreadsrunning on different cores in a multi-socket multi-core system: • Shared Memory Parallelization CPU1 CPU2 a=4 a c=3+a
Chancesforconcurrentexecution: • Look fortasksthatcanbeexecutedsimultaneously(taskparallelism) • Decomposedataintodistinctchunkstobeprocessedindependently(dataparallelism) • FindingConcurrency Organize by data decomposition Organize by flow of data Organize by task Geometric decomposition Pipeline Task parallelism Event-based coordination Divide and conquer Recursive data
Divide and conquer Problem split Subproblem Subproblem split Subproblem Subproblem Subproblem Subproblem solve Subsolution Subsolution Subsolution Subsolution merge Subsolution Subsolution merge Solution
Geometric decomposition Example: ventricular assist device (VAD)
Analogy assembly line • Assign different stages to different PEs • Pipeline Time C1 C2 C3 C4 C5 Stage 1 C1 C2 C3 C4 C5 Stage 2 C1 C2 C3 C4 C5 Stage 3 C1 C2 C3 C4 C5 Stage 4
Recursive data pattern • Parallelize operations on recursive data structures • See example later on: Fibonacci • Event-based coordination pattern • Decomposition into semi-independent tasks interacting in an irregular fashion • See example later on: While-loop with tasks • Further parallel design patterns
Data Race: Concurrentaccessofthe same memorylocationby multiple threadswithout proper synchronization • Let variable x havetheinitialvalue 1 • Depending on whichthreadisfaster, you will seeeither 1 or 5 • Resultisnondeterministic (i.e. depends on OS scheduling) • Data Races (andhowtodetectandavoidthem) will becovered in moredetaillater! • Data Races / RaceConditions x=5; printf(x);
Serialization: Whenthreads / processeswait „toomuch“ • Limited scalability, ifat all • Simple (and stupid) example: • Serialization Data Transfer Send Recv Wait Calc Data Transfer Send Recv Wait Calc Data Transfer Send Recv
Overhead introducedbytheparallelization: • Time tostart / end / manage threads • Time to send / exchangedata • Time spent in synchronizationofthreads / processes • Withparallelization: • The total CPU time increases, • The Wall time decreases, • The System time staysthe same. • Efficientparallelizationisaboutminimizingtheoverheadintroducedbytheparallelizationitself! • Parallelization Overhead
LoadBalancing • All threads / processes finish atthe same time • Somethreads / processestakelongerthanothers • But: All threads / processeshavetowaitfortheslowestthread / process, whichisthuslimitingthescalability perfect load balancing load imbalance time time
Time using 1 CPU: T(1) • Time using p CPUs: T(p) • Speedup S: S(p)=T(1)/T(p) • Measureshowmuchfasterthe parallel computationis! • Efficiency E: E(p)=S(p)/p • Example: • T(1)=6s, T(2)=4s S(2)=1.5 E(2)=0.75 • Ideal case: T(p)=T(1)/p S(p)=p E(p)=1.0 • Speedupand Efficiency
Describestheinfluenceoftheserialpartontoscalability(withouttakinganyoverheadintoaccount).Describestheinfluenceoftheserialpartontoscalability(withouttakinganyoverheadintoaccount). • S(p)=T(1)/T(p)=T(1)/(f*T(1) + (1-f)*T(1)/p)=1/(f+(1-f)/p) • f: serialpart (0f1) • T(1) : time using 1 CPU • T(p): time using p CPUs • S(p): speedup; S(p)=T(1)/T(p) • E(p): efficiency; E(p)=S(p)/p • Itisrather easy toscaleto a smallnumberofcores, but anyparallelizationis limited bytheserialpartoftheprogram! • Amdahl‘s Law
If 80% (measured in programruntime) ofyourworkcanbeparallelizedand „just“ 20% are still runningsequential, thenyourspeedup will be: • Amdahl‘s Law illustrated processors: time: 20% speedup: 5 1 processor: time: 100% speedup: 1 2 processors: time: 60% speedup: 1.7 4 processors: time: 40% speedup: 2.5
After theinitialparallelizationof a program, you will typicallyseespeedupcurveslikethis: • Speedup in Practice speedup 8 Speedup according to Amdahl’s law 7 6 Ideal speedup S(p)=p 5 Realistic speedup 4 3 2 1 p . . . 1 2 3 4 5 6 7 8