690 likes | 815 Views
Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores.
E N D
Attacking theprogramming model wall Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28th 2013
Multicores • Moore law from components to cores • Simpler cores, shared memory, cache coherent, full interconnect
Manycores • Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe) • Options for cache coherence, more complex inter core communication protocols
GPUs • ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe) • Data parallel computations only
FPGA • Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores • Non-standard programming tools
Power wall (2) • Reducing idle costs • E4 CARMA CLUSTER • ARM + nVIDIA • Spare Watt → GPU • Reducing the cooling costs • Eurotech AURORA TIGON • Intel technology • Water cooling • Spare Watt → CPU
Programming models • Pros • Performance / efficiency • Heterogeneous hw targeting • Cons • Huge application programmer responsibilities • Portability (functional, performance) • Quantitative parallelism exploitation • Pros • Expressive power • Separation of concerns • Qualitative parallelism exploitation • Cons • Performance / efficiency • Hw targeting Low abstraction level High abstraction level
Separation of concerns • What has to be computer • Function from input data to output data • Domain specific • Application dependent • How the results is computed • Parallelism, Power management, Security, Fault Tolerance, … • Target hw specific • Factorizable Functional Non functional
Current programmingframeworks CILK TBB OpenMP MPI OpenCL
Urgencies Need for Parallel programming models Parallel programmers
Structured parallel programming • From HPC community • Started in early ‘90(M. Cole’s PhD thesis) • Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls • From SW engineering community • Started in early ‘00 • “Recipes” to handle parallelism (name, problem, forces, solutions, …) Algorithmic skeletons Parallel design patterns
Algorithmic skeletons • Common, parametric, reusable parallelism exploitation patterns (from HPC community) • Exposed to programmers as constructs, library calls, objects, higher order functions, components, ... • Composable • Two tier model: “stream parallel” skeletons with inner “data parallel” skeletons
Sample classical skeletons • Parallel computation of different items from an input stream • Task/farm (master/worker), Pipeline • Parallel computation on (possibly overlapped) partitions of the same input data • Map, Stencil, Reduce, Scan, Mapreduce Stream parallel Data Parallel
Implementing skeletons • Skeleton implemented by instantiating a “concurrent activity graph template” • Performance models used to instantiate quantitative parameters • P3L, Muesli, SkeTo, FastFlow • Skeleton program compiled to macro data flow graphs • Rewriting/refactoring compiling process • Parallel MDF graph interpreter • Muskel, Skipper, Skandium Template based Macro Data Flow based
Refactoring skeletons • Formally proven rewriting rules Farm(Δ) = Δ Pipe(Δ1, Δ2) = SeqComp(Δ1, Δ2) Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))
Sample performance models • Pipeline service time Maxi=1,k { serviceTime(Stagei)} • Pipeline latency ∑i=1,k { serviceTime(Stagei)} • Farm service time max { taskSchedTime, resGathTime, workerTime/#worker} • Map latency partitionTime + workerTime + gatherTime
Key strenghts • Full parallel structure of the application exposed to the skeleton framework • Exploited by optimizations, support for autonomic non functional concern management • Framework responsibility for architecture targeting • Write once run everywhere code, with architecture specific compiler and back end (run time) tools • Only functional debugging required to application programmers
Parallel design patterns • Carefully describe a parallelism exploitation pattern including • Applicability • Forces • Possibile implementations/problem solutions • As text • At different levels of abstraction
Patterns • Collapsed in algorithmic skeletons • application programmer → concurrency and algorithm spaces • Skeleton implementation (system programmer)→ support structures and implementation mechanisms
Structured parallel programming: design patterns Design patterns Follow, learn, use Problem Low level code Programming tools
Structured parallel programming: skeletons Skeleton library Instantiate, compose High level code Problem
Structured parallel programming Design patterns Skeletons Use knowledge to instantiate, compose High level code Problem
Working unstructured • Tradeoffs • CPU/GPU threads • Processes/Threads • Coarse/fine grain tasks • Target architecture dependent decisions
Thread/processes • Creation • Thread pool vs. on-the-fly creation • Pinning • Operating system dendent effectiveness • Memory management • Embarrassingly parallel patterns may benefit of process memory space separation (see Memory (next) slide)
Memory • Cache friendly algorithms • Minimization of cache coherency traffic • Data aligment/padding • Memory wall • 1-2 memory interfaces per 4-8 cores • 4-8 memory interfaces per 60-64 cores (+internal routing)
Synchronization • High level, general purpose mechanisms • Passive wait • High latency • Low level mechanisms • Active wait • Smaller latency • Eventually • Synchronization on memory (fences)
Devising parallelism degree • Ideally • As much parallel activities as necessary to sustain the input data rate • Base measures • Estimated input pressure & task processing time, communication overhead • Compile vs. run time choices • Try to devise statically some optimal values • Adjust initial settings dynamically based on observations
NUMA memory exploitation • Auto scheduling • Idle workers require tasks from a “global” queue • Far nodes require less than near ones • Affinity scheduling • Tasks scheduled on the producing cores • Round robin allocation of dynamically allocated chunks
Behavioural skeletons Structured parallel algorithm code exposes Sensors & Actuators Sensors: determine what can be perceived of the computation Actuators: determine what can be affected/changed in the computation Autonomic manager: ex- ecutes a MAPE loop. At each iteration, and ECA (Event Condition Action) rule system is executed using monitored values and possi- bly operating actions on the structured parallel pattern reads NFC autonomic manager ECA rule based program
Sample rules • Event: inter arrival time changes • Condition: faster than service time • Action: increase the parallelism degree • Event: fault at worker • Condition: service time low • Action: recruit new worker resource
Yes, nice, but then ? We have MPI, OpenMP, Cuda, OpenCL …
FastFlow • Full C++, skeleton based, streaming parallel processing framework http://mc-fastflow.sourceforge.net
Bring skeletons to your desk • Full POSIX/C++ compliancy • G++, make, gprof, gdb, pthread, … • Reuse existing code • Proper wrappers • Run from laptops to clusters & clouds • Same skeleton structure
Basic abstraction: ff_node class RedEye: public ff_node { … intsvc_init(){ … } void svc_end() { … } void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }
Basic stream parallel skeletons • Farm(Worker, Nw) • Embarrassingly parallel computations on streams • Computing Worker in parallel (Nw copies) • Emitter + string of workers + Collector implementation • Pipeline(Stage1, … , StageN) • StageK processes output of Stage(K-1) and delivers to Stage(K+1) • Feedback(Skel, Cond) • Routes back results from Skel to input or forward to output depending on Cond
Setting up a pipeline ff_pipelinemyImageProcessingPipe; ff_nodestartNode = new Reader(…); ff_noderedEye = new RedEye(); ff_node light = new LightCalibration(); ff_node sharpen = new Sharpen(); ff_nodeendNode = new Writer(…); myImageProcessingPipe.addStage(startNode); myImageProcessingPipe.addStage(redEye); myImageProcessingPipe.addStage(light); myImageProcessingPipe.addStage(sharpen); myImageProcessingPipe.addStage(endNode); myImageProcessingPipe.run_and_wait_end();