Parallelization at a Glance

Parallelization at a Glance Christian Terboven <terboven@rz.rwth-aachen.de> 20.11.2012 / Aachen, Germany Stand: 19.11.2012 Version 2.3

Basic Concepts • Parallelization Strategies • Possible Issues • Efficient Parallelization • Agenda

Basic Concepts

Process: Instance of a computerprogram • Thread: A sequenceofinstructions (ownProgram Counter) • Parallel / Concurrentexecution: • Multiple Processes • Multiple Threads • Per process: Address Space, Files, Environment, … • Per thread: Program Counter, Stack, … • Processesand Threads computer program counter thread process

Even on a multi-socket multi-core systemyoushould not makeanyassumptionwhichprocess / threadisexecutedwhenandwhere! • Twothreads on onecore: • Twothreads on twocores: • Processand Thread Scheduling bythe OS Thread1 Thread 2 System thread threadmigration “pinned” threads

Memory canbeaccessedbyseveralthreadsrunning on different cores in a multi-socket multi-core system: • Shared Memory Parallelization CPU1 CPU2 a=4 a c=3+a

Parallelization Strategies

Chancesforconcurrentexecution: • Look fortasksthatcanbeexecutedsimultaneously(taskparallelism) • Decomposedataintodistinctchunkstobeprocessedindependently(dataparallelism) • FindingConcurrency Organize by data decomposition Organize by flow of data Organize by task Geometric decomposition Pipeline Task parallelism Event-based coordination Divide and conquer Recursive data

Divide and conquer Problem split Subproblem Subproblem split Subproblem Subproblem Subproblem Subproblem solve Subsolution Subsolution Subsolution Subsolution merge Subsolution Subsolution merge Solution

Geometric decomposition Example: ventricular assist device (VAD)

Analogy assembly line • Assign different stages to different PEs • Pipeline Time C1 C2 C3 C4 C5 Stage 1 C1 C2 C3 C4 C5 Stage 2 C1 C2 C3 C4 C5 Stage 3 C1 C2 C3 C4 C5 Stage 4

Recursive data pattern • Parallelize operations on recursive data structures • See example later on: Fibonacci • Event-based coordination pattern • Decomposition into semi-independent tasks interacting in an irregular fashion • See example later on: While-loop with tasks • Further parallel design patterns

PossibleIssues

Data Race: Concurrentaccessofthe same memorylocationby multiple threadswithout proper synchronization • Let variable x havetheinitialvalue 1 • Depending on whichthreadisfaster, you will seeeither 1 or 5 • Resultisnondeterministic (i.e. depends on OS scheduling) • Data Races (andhowtodetectandavoidthem) will becovered in moredetaillater! • Data Races / RaceConditions x=5; printf(x);

Serialization: Whenthreads / processeswait „toomuch“ • Limited scalability, ifat all • Simple (and stupid) example: • Serialization Data Transfer Send Recv Wait Calc Data Transfer Send Recv Wait Calc Data Transfer Send Recv

EfficientParallelization

Overhead introducedbytheparallelization: • Time tostart / end / manage threads • Time to send / exchangedata • Time spent in synchronizationofthreads / processes • Withparallelization: • The total CPU time increases, • The Wall time decreases, • The System time staysthe same. • Efficientparallelizationisaboutminimizingtheoverheadintroducedbytheparallelizationitself! • Parallelization Overhead

LoadBalancing • All threads / processes finish atthe same time • Somethreads / processestakelongerthanothers • But: All threads / processeshavetowaitfortheslowestthread / process, whichisthuslimitingthescalability perfect load balancing load imbalance time time

Time using 1 CPU: T(1) • Time using p CPUs: T(p) • Speedup S: S(p)=T(1)/T(p) • Measureshowmuchfasterthe parallel computationis! • Efficiency E: E(p)=S(p)/p • Example: • T(1)=6s, T(2)=4s S(2)=1.5 E(2)=0.75 • Ideal case: T(p)=T(1)/p  S(p)=p E(p)=1.0 • Speedupand Efficiency

Describestheinfluenceoftheserialpartontoscalability(withouttakinganyoverheadintoaccount).Describestheinfluenceoftheserialpartontoscalability(withouttakinganyoverheadintoaccount). • S(p)=T(1)/T(p)=T(1)/(f*T(1) + (1-f)*T(1)/p)=1/(f+(1-f)/p) • f: serialpart (0f1) • T(1) : time using 1 CPU • T(p): time using p CPUs • S(p): speedup; S(p)=T(1)/T(p) • E(p): efficiency; E(p)=S(p)/p • Itisrather easy toscaleto a smallnumberofcores, but anyparallelizationis limited bytheserialpartoftheprogram! • Amdahl‘s Law

If 80% (measured in programruntime) ofyourworkcanbeparallelizedand „just“ 20% are still runningsequential, thenyourspeedup will be: • Amdahl‘s Law illustrated  processors: time: 20% speedup: 5 1 processor: time: 100% speedup: 1 2 processors: time: 60% speedup: 1.7 4 processors: time: 40% speedup: 2.5

After theinitialparallelizationof a program, you will typicallyseespeedupcurveslikethis: • Speedup in Practice speedup 8 Speedup according to Amdahl’s law 7 6 Ideal speedup S(p)=p 5 Realistic speedup 4 3 2 1 p . . . 1 2 3 4 5 6 7 8

Parallelization at a Glance

Parallelization at a Glance

Presentation Transcript

At a glance

At A Glance

At A Glance

AT A GLANCE

At A Glance:

At A Glance

At a glance

At A Glance

at a glance

At A Glance

At A Glance

At a Glance

at a glance…..

At a Glance:

At A Glance

At a Glance

At-a-Glance

At-A-Glance

At a glance...

At a Glance

At A Glance

At A Glance