1 / 69

Attacking the programming model wall

Attacking the programming model wall. Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28 th 2013. Setting the scenario (HW). Market pressure. Multicores. Moore law from components to cores Simpler cores, shared memory, cache coherent, full interconnect. Manycores.

lainey
Download Presentation

Attacking the programming model wall

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Attacking theprogramming model wall Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28th 2013

  2. Setting the scenario (HW)

  3. Market pressure

  4. Multicores • Moore law from components to cores • Simpler cores, shared memory, cache coherent, full interconnect

  5. Manycores • Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe) • Options for cache coherence, more complex inter core communication protocols

  6. GPUs • ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe) • Data parallel computations only

  7. FPGA • Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores • Non-standard programming tools

  8. Power wall

  9. Power wall (2) • Reducing idle costs • E4 CARMA CLUSTER • ARM + nVIDIA • Spare Watt → GPU • Reducing the cooling costs • Eurotech AURORA TIGON • Intel technology • Water cooling • Spare Watt → CPU

  10. Setting the scenario (sw)

  11. Programming models • Pros • Performance / efficiency • Heterogeneous hw targeting • Cons • Huge application programmer responsibilities • Portability (functional, performance) • Quantitative parallelism exploitation • Pros • Expressive power • Separation of concerns • Qualitative parallelism exploitation • Cons • Performance / efficiency • Hw targeting Low abstraction level High abstraction level

  12. Separation of concerns • What has to be computer • Function from input data to output data • Domain specific • Application dependent • How the results is computed • Parallelism, Power management, Security, Fault Tolerance, … • Target hw specific • Factorizable Functional Non functional

  13. Current programmingframeworks CILK TBB OpenMP MPI OpenCL

  14. Urgencies Need for Parallel programming models Parallel programmers

  15. Structured parallel programming • From HPC community • Started in early ‘90(M. Cole’s PhD thesis) • Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls • From SW engineering community • Started in early ‘00 • “Recipes” to handle parallelism (name, problem, forces, solutions, …) Algorithmic skeletons Parallel design patterns

  16. Similarities

  17. Algorithmic skeletons • Common, parametric, reusable parallelism exploitation patterns (from HPC community) • Exposed to programmers as constructs, library calls, objects, higher order functions, components, ... • Composable • Two tier model: “stream parallel” skeletons with inner “data parallel” skeletons

  18. Sample classical skeletons • Parallel computation of different items from an input stream • Task/farm (master/worker), Pipeline • Parallel computation on (possibly overlapped) partitions of the same input data • Map, Stencil, Reduce, Scan, Mapreduce Stream parallel Data Parallel

  19. Evolution of the concept

  20. Evolution of the concept (2)

  21. Implementing skeletons • Skeleton implemented by instantiating a “concurrent activity graph template” • Performance models used to instantiate quantitative parameters • P3L, Muesli, SkeTo, FastFlow • Skeleton program compiled to macro data flow graphs • Rewriting/refactoring compiling process • Parallel MDF graph interpreter • Muskel, Skipper, Skandium Template based Macro Data Flow based

  22. Refactoring skeletons • Formally proven rewriting rules Farm(Δ) = Δ Pipe(Δ1, Δ2) = SeqComp(Δ1, Δ2) Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))

  23. Sample refactoring: normal form

  24. Performance modelling

  25. Sample performance models • Pipeline service time Maxi=1,k { serviceTime(Stagei)} • Pipeline latency ∑i=1,k { serviceTime(Stagei)} • Farm service time max { taskSchedTime, resGathTime, workerTime/#worker} • Map latency partitionTime + workerTime + gatherTime

  26. Key strenghts • Full parallel structure of the application exposed to the skeleton framework • Exploited by optimizations, support for autonomic non functional concern management • Framework responsibility for architecture targeting • Write once run everywhere code, with architecture specific compiler and back end (run time) tools • Only functional debugging required to application programmers

  27. Ideally

  28. Assessments

  29. Parallel design patterns • Carefully describe a parallelism exploitation pattern including • Applicability • Forces • Possibile implementations/problem solutions • As text • At different levels of abstraction

  30. Pattern spaces

  31. Patterns • Collapsed in algorithmic skeletons • application programmer → concurrency and algorithm spaces • Skeleton implementation (system programmer)→ support structures and implementation mechanisms

  32. Structured parallel programming: design patterns Design patterns Follow, learn, use Problem Low level code Programming tools

  33. Structured parallel programming: skeletons Skeleton library Instantiate, compose High level code Problem

  34. Structured parallel programming Design patterns Skeletons Use knowledge to instantiate, compose High level code Problem

  35. Working unstructured • Tradeoffs • CPU/GPU threads • Processes/Threads • Coarse/fine grain tasks • Target architecture dependent decisions

  36. Thread/processes • Creation • Thread pool vs. on-the-fly creation • Pinning • Operating system dendent effectiveness • Memory management • Embarrassingly parallel patterns may benefit of process memory space separation (see Memory (next) slide)

  37. Memory • Cache friendly algorithms • Minimization of cache coherency traffic • Data aligment/padding • Memory wall • 1-2 memory interfaces per 4-8 cores • 4-8 memory interfaces per 60-64 cores (+internal routing)

  38. Synchronization • High level, general purpose mechanisms • Passive wait • High latency • Low level mechanisms • Active wait • Smaller latency • Eventually • Synchronization on memory (fences)

  39. Devising parallelism degree • Ideally • As much parallel activities as necessary to sustain the input data rate • Base measures • Estimated input pressure & task processing time, communication overhead • Compile vs. run time choices • Try to devise statically some optimal values • Adjust initial settings dynamically based on observations

  40. NUMA memory exploitation • Auto scheduling • Idle workers require tasks from a “global” queue • Far nodes require less than near ones • Affinity scheduling • Tasks scheduled on the producing cores • Round robin allocation of dynamically allocated chunks

  41. More separation of concerns

  42. Behavioural skeletons Structured parallel algorithm code exposes Sensors & Actuators Sensors: determine what can be perceived of the computation Actuators: determine what can be affected/changed in the computation Autonomic manager: ex- ecutes a MAPE loop. At each iteration, and ECA (Event Condition Action) rule system is executed using monitored values and possi- bly operating actions on the structured parallel pattern reads NFC autonomic manager ECA rule based program

  43. Sample rules • Event: inter arrival time changes • Condition: faster than service time • Action: increase the parallelism degree • Event: fault at worker • Condition: service time low • Action: recruit new worker resource

  44. BS assessments

  45. Yes, nice, but then ? We have MPI, OpenMP, Cuda, OpenCL …

  46. FastFlow • Full C++, skeleton based, streaming parallel processing framework http://mc-fastflow.sourceforge.net

  47. Bring skeletons to your desk • Full POSIX/C++ compliancy • G++, make, gprof, gdb, pthread, … • Reuse existing code • Proper wrappers • Run from laptops to clusters & clouds • Same skeleton structure

  48. Basic abstraction: ff_node class RedEye: public ff_node { … intsvc_init(){ … } void svc_end() { … } void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }

  49. Basic stream parallel skeletons • Farm(Worker, Nw) • Embarrassingly parallel computations on streams • Computing Worker in parallel (Nw copies) • Emitter + string of workers + Collector implementation • Pipeline(Stage1, … , StageN) • StageK processes output of Stage(K-1) and delivers to Stage(K+1) • Feedback(Skel, Cond) • Routes back results from Skel to input or forward to output depending on Cond

  50. Setting up a pipeline ff_pipelinemyImageProcessingPipe; ff_nodestartNode = new Reader(…); ff_noderedEye = new RedEye(); ff_node light = new LightCalibration(); ff_node sharpen = new Sharpen(); ff_nodeendNode = new Writer(…); myImageProcessingPipe.addStage(startNode); myImageProcessingPipe.addStage(redEye); myImageProcessingPipe.addStage(light); myImageProcessingPipe.addStage(sharpen); myImageProcessingPipe.addStage(endNode); myImageProcessingPipe.run_and_wait_end();

More Related