Attacking the programming model wall

Attacking theprogramming model wall Marco Danelutto Dept. Computer Science, Univ. of Pisa Belfast, February 28th 2013

Setting the scenario (HW)

Market pressure

Multicores • Moore law from components to cores • Simpler cores, shared memory, cache coherent, full interconnect

Manycores • Even simpler cores, shared memory, cache coherent, regular interconnection, co-processors (via PCIe) • Options for cache coherence, more complex inter core communication protocols

GPUs • ALUs + instruction sequencers, large and fast memory access, co-processors (via PCIe) • Data parallel computations only

FPGA • Low scale manifacturing, Accelerators (mission critical sw), GP computing (PCIe co-processors, CPU socket replacement), possibly hosting GP CPU/cores • Non-standard programming tools

Power wall

Power wall (2) • Reducing idle costs • E4 CARMA CLUSTER • ARM + nVIDIA • Spare Watt → GPU • Reducing the cooling costs • Eurotech AURORA TIGON • Intel technology • Water cooling • Spare Watt → CPU

Setting the scenario (sw)

Programming models • Pros • Performance / efficiency • Heterogeneous hw targeting • Cons • Huge application programmer responsibilities • Portability (functional, performance) • Quantitative parallelism exploitation • Pros • Expressive power • Separation of concerns • Qualitative parallelism exploitation • Cons • Performance / efficiency • Hw targeting Low abstraction level High abstraction level

Separation of concerns • What has to be computer • Function from input data to output data • Domain specific • Application dependent • How the results is computed • Parallelism, Power management, Security, Fault Tolerance, … • Target hw specific • Factorizable Functional Non functional

Current programmingframeworks CILK TBB OpenMP MPI OpenCL

Urgencies Need for Parallel programming models Parallel programmers

Structured parallel programming • From HPC community • Started in early ‘90(M. Cole’s PhD thesis) • Pre-defined parallel patterns, exposed to programmers as programming constructs/lib calls • From SW engineering community • Started in early ‘00 • “Recipes” to handle parallelism (name, problem, forces, solutions, …) Algorithmic skeletons Parallel design patterns

Similarities

Algorithmic skeletons • Common, parametric, reusable parallelism exploitation patterns (from HPC community) • Exposed to programmers as constructs, library calls, objects, higher order functions, components, ... • Composable • Two tier model: “stream parallel” skeletons with inner “data parallel” skeletons

Sample classical skeletons • Parallel computation of different items from an input stream • Task/farm (master/worker), Pipeline • Parallel computation on (possibly overlapped) partitions of the same input data • Map, Stencil, Reduce, Scan, Mapreduce Stream parallel Data Parallel

Evolution of the concept

Evolution of the concept (2)

Implementing skeletons • Skeleton implemented by instantiating a “concurrent activity graph template” • Performance models used to instantiate quantitative parameters • P3L, Muesli, SkeTo, FastFlow • Skeleton program compiled to macro data flow graphs • Rewriting/refactoring compiling process • Parallel MDF graph interpreter • Muskel, Skipper, Skandium Template based Macro Data Flow based

Refactoring skeletons • Formally proven rewriting rules Farm(Δ) = Δ Pipe(Δ1, Δ2) = SeqComp(Δ1, Δ2) Pipe(Map(Δ1), Map(Δ1)) = Map(Pipe(Δ1, Δ2))

Sample refactoring: normal form

Performance modelling

Sample performance models • Pipeline service time Maxi=1,k { serviceTime(Stagei)} • Pipeline latency ∑i=1,k { serviceTime(Stagei)} • Farm service time max { taskSchedTime, resGathTime, workerTime/#worker} • Map latency partitionTime + workerTime + gatherTime

Key strenghts • Full parallel structure of the application exposed to the skeleton framework • Exploited by optimizations, support for autonomic non functional concern management • Framework responsibility for architecture targeting • Write once run everywhere code, with architecture specific compiler and back end (run time) tools • Only functional debugging required to application programmers

Ideally

Assessments

Parallel design patterns • Carefully describe a parallelism exploitation pattern including • Applicability • Forces • Possibile implementations/problem solutions • As text • At different levels of abstraction

Pattern spaces

Patterns • Collapsed in algorithmic skeletons • application programmer → concurrency and algorithm spaces • Skeleton implementation (system programmer)→ support structures and implementation mechanisms

Structured parallel programming: design patterns Design patterns Follow, learn, use Problem Low level code Programming tools

Structured parallel programming: skeletons Skeleton library Instantiate, compose High level code Problem

Structured parallel programming Design patterns Skeletons Use knowledge to instantiate, compose High level code Problem

Working unstructured • Tradeoffs • CPU/GPU threads • Processes/Threads • Coarse/fine grain tasks • Target architecture dependent decisions

Thread/processes • Creation • Thread pool vs. on-the-fly creation • Pinning • Operating system dendent effectiveness • Memory management • Embarrassingly parallel patterns may benefit of process memory space separation (see Memory (next) slide)

Memory • Cache friendly algorithms • Minimization of cache coherency traffic • Data aligment/padding • Memory wall • 1-2 memory interfaces per 4-8 cores • 4-8 memory interfaces per 60-64 cores (+internal routing)

Synchronization • High level, general purpose mechanisms • Passive wait • High latency • Low level mechanisms • Active wait • Smaller latency • Eventually • Synchronization on memory (fences)

Devising parallelism degree • Ideally • As much parallel activities as necessary to sustain the input data rate • Base measures • Estimated input pressure & task processing time, communication overhead • Compile vs. run time choices • Try to devise statically some optimal values • Adjust initial settings dynamically based on observations

NUMA memory exploitation • Auto scheduling • Idle workers require tasks from a “global” queue • Far nodes require less than near ones • Affinity scheduling • Tasks scheduled on the producing cores • Round robin allocation of dynamically allocated chunks

More separation of concerns

Behavioural skeletons Structured parallel algorithm code exposes Sensors & Actuators Sensors: determine what can be perceived of the computation Actuators: determine what can be affected/changed in the computation Autonomic manager: ex- ecutes a MAPE loop. At each iteration, and ECA (Event Condition Action) rule system is executed using monitored values and possibly operating actions on the structured parallel pattern reads NFC autonomic manager ECA rule based program

Sample rules • Event: inter arrival time changes • Condition: faster than service time • Action: increase the parallelism degree • Event: fault at worker • Condition: service time low • Action: recruit new worker resource

BS assessments

Yes, nice, but then ? We have MPI, OpenMP, Cuda, OpenCL …

FastFlow • Full C++, skeleton based, streaming parallel processing framework http://mc-fastflow.sourceforge.net

Bring skeletons to your desk • Full POSIX/C++ compliancy • G++, make, gprof, gdb, pthread, … • Reuse existing code • Proper wrappers • Run from laptops to clusters & clouds • Same skeleton structure

Basic abstraction: ff_node class RedEye: public ff_node { … intsvc_init(){ … } void svc_end() { … } void * svc(void * task) { Image *in = (Image *)task; Image * out = …. return((void *) out); } … }

Basic stream parallel skeletons • Farm(Worker, Nw) • Embarrassingly parallel computations on streams • Computing Worker in parallel (Nw copies) • Emitter + string of workers + Collector implementation • Pipeline(Stage1, … , StageN) • StageK processes output of Stage(K-1) and delivers to Stage(K+1) • Feedback(Skel, Cond) • Routes back results from Skel to input or forward to output depending on Cond

Setting up a pipeline ff_pipelinemyImageProcessingPipe; ff_nodestartNode = new Reader(…); ff_noderedEye = new RedEye(); ff_node light = new LightCalibration(); ff_node sharpen = new Sharpen(); ff_nodeendNode = new Writer(…); myImageProcessingPipe.addStage(startNode); myImageProcessingPipe.addStage(redEye); myImageProcessingPipe.addStage(light); myImageProcessingPipe.addStage(sharpen); myImageProcessingPipe.addStage(endNode); myImageProcessingPipe.run_and_wait_end();

Attacking the programming model wall

Attacking the programming model wall

Presentation Transcript

Attacking Antivirus

Attacking Interoperability

Attacking the Application Server

Using The CUDA Programming Model

Attacking the Poetry Prompt

Attacking the person.

Attacking the Prompt

Attacking the Power-Wall by Using Near-threshold Cores

Attacking the DBQ

Attacking Antivirus

Attacking the PHP Market

Stop Attacking the Queue

Attacking

Attacking Mazatzal

Attacking RSA

ATTACKING THE SCIENCE TAKS

Programming: The Web Model

Attacking Over the Flanks

Attacking Antivirus

The 80x86 Programming Model

Attacking the Writing Prompt

Programming Model