Task Parallelism and Task Superscalar Processing

Task Parallelism and Task Superscalar Processing A preview of work done in Barcelona Supercomputing Center (and other places) Collected by Alon Naveh, June-2010 Seminar in VLSI Architecture (048879)

Content • Preface • Incentives for Task-Superscalar • Alternative solutions for ILP • Why use Task Level parallelism • Task Superscalar – next level of task level parallelism • The pipeline & similarity to ILP • Smart scheduling prospects • Overcoming ILP-pipeline bottlenecks with Task Superscalar • Topics for further research • References

Preface • Previous session presented an architecture aimed at allowing efficient execution of almost embarrassingly parallel applications (MRI, GFX-physics, perceptual computing, etc) • Focused mostly efficient scaling of Data and Coherency management, • Assumed SW aspect of parallelism is handled mostly by the developer, by partitioning the work between the parallel resources • Did provide some queuing runtime to alleviate task • This session tries to solve the massive parallelism WITHOUT heavy programmer intervention • Tries to extract parallelism from an existing workload by moving to a “task-level” granularity

Agenda • Preface • Incentives for Task-Superscalar • Alternative solutions for ILP • Why use Task Level parallelism • Task Superscalar – next level of task level parallelism • The pipeline & similarity to ILP • Smart scheduling prospects • Overcoming ILP-pipeline bottlenecks with Task Superscalar • Topics for further research

Incentives for activity (1) • Disparity between HW capabilities @ parallelism (1000 cores) and programmers capabilities • Traditional parallelism management: • a “serial” execution programming model + “hidden” ILP extraction mechanism • VLIW: via smart compilers • OOO: via dependencies checking and OOO execution in HW • Relies on the existence of ILP in code Programmers Need a Simple Interface!

Incentives for activity (2) • ILP is at the stage of diminishing return • Most of the existing ILP already “used” (overcoming memory latencies). • Need windows of “1000” in-flight instructions to make headway [4] - Area/Pwr!!! • MOST of the “window” contains speculative instructions. Likely to be “nuked” • Most transistors in high-end processors aimed today at “multi-core” • Benefits gained through multiple tasks parallel execution • Back to the programming capabilities gap in the previous slide… ILP at Diminishing Return Point – Need a good parallelization solution for Multi-Core

Alternative Solutions to ILP – dataflow machines (1) • Explicit dataflow machines [6] • An architecture where each operation points to the consumers of its results • Each instruction is a stand alone task of its own • Program can be represented as a dependencies graph with each instruction as a node After firing

Dataflow Machines (2) • Many studies focused on instruction-level data-flow architecture • Originally did not support conventional memory and programming models like C/C++ • More recent work (TRIPS and WaveScalar ) does allow better transitional SW model support • However synchronization overhead limits such design scalability as well [5] -- Data-Flow machines provide an alternative way to extract ILP, but still limited scaling due to communication overhead

Alternative Solutions to ILP (2) • Task Level parallel programming • Trigger: shift to heavily CMPed architectures • Examples: CellSs and SMPSs from UBC; RapidMind ; Sequoia • What is it? A SW infrastructure that • Allows programmers to tag each Task variable as Input, Output, IO • Generates dependency graphs for tasks (IO based) – statically @ compilation or dynamically • Schedules execution of tasks when their input is ready – allowing OOO execution and completion • Allows renaming memory (I/O/IO locations) in case of false dependencies

SMP Super Scalar - Example • Programming example we will use to clarify operation: • Block-matrix Multiply & Accumulate flow x = a a b b c c c c a a b b + x = d d e e f cff x = f f d d e e a b c • The basic “Task” • #pragmacss task input(A, B) inout(C) • void block_macc(double A[N][N], double B[N][N], double C[N][N]) {inti, j, k;for (i=0; i < N; i++) for (j=0; j < N; j++) for (k=0; k < N; k++) C[i][j] += A[i][k] * B[k][j];} • #pragmacss task input(bigF) inout(bigC) • Void block_acc(…) • The “program” (block-matrix multiply, N = 2) • inti, j, k;for (i=0; i < N; i++) • for (j=0; j < N; j++) for (k=0; k < N; k++)block_macc(bigA[i][k], bigB[k][j], bigC[i][j]); • for (i=0; i < N; i++) • for (j=0; j < N; j++) for (k=0; k < N; k++)block_macc(bigD[i][k], bigE[k][j], bigF[i][j]);for (i=0; i < N; i++) for (j=0; j < N; j++)block_acc(bigC[i][j], bigF[i][j]);

SMPSs – Complier & Dependency Generator • Source -> Source compiler: • takes user annotated code (marked as tasks that can run in parallel) and generates encapsulated modules with data dependencies exposed • Runtime-1: generate dependency graphs dynamically • Each sub-block-output of C and F can be calculated separately • Each sub-block-sum of C and F can be calculated when sources ready • Can handle all inter-task dependencies automatically • For non-task code, need to encapsulate it in a separate “task” and then tool will create dependencies correctly A10 x B01 D10 x E01 A00 x B00 D00 x E00 Block_macc A11 x B11 D11 x E11 A01 x B10 D01 x E10 C11 F11 C00 F00 Block_acc C11 C00

SMPSs - Scheduling • Runtime-2: Scheduling • Scheduling goals • Maximizing parallelization – issue any task that is ‘ready’ • Minimize data xfer overhead- run dependent tasks in the same processor if possible • Master-thread: • runs all non-task user code, • generates dependency lists per task it encounters • places all tasks with “no input dependency” in its “ready-list” • Work-threads • As many as the physical cores in the system (minus 1) • Pick tasks from their own ready-lists, the master-thread, and other-threads, if their own list is empty • Once complete a task, push all dependant tasks now “ready” into their own “read-list” • Shade/shape indicates actual scheduling done by envir (8 thread CPU)

SMPSs – Partial Synchronization Points • Runtime component cannot handle non-task user code synch • One Solution: encapsulate relevant code with “task” format. • Works well for simple user code but not for “external visibility” events (e.g. IO to disk at end of program) • Data-specific Synchronization Points • Barriers preventing execution till a specific data is “ready” • Will delay till its data-sources are complete but will not wait till all source tasks are done inti, j;for (i=0; i < N; i++) for (j=0; j < N; j++) { #pragmacss wait on (bigF[i][j])block_writeOut(bigF[i][j], outputFile);

SMPSs - summary • A SW infrastructure for extracting parallelism from serial code “automatically” • Runs on standard C/C++ code • Requires only programmer hints and annotation as to Task Input/Output characteristics • Done at a TASK level (not ILP) • Creates runtime dependency graphs • Distributes work between HW threads/cores based on data readiness and in a manner that minimizes data xfers • Parallel distribution of work (like OpenCL/OpenMP) but without programmer intervention and analysis! • Can do loop unrolling if input-data is independent!

Why Use Task-Level Parallelism • Scaling: • 100 – 1000 parallel tasks “discovered” in computational kernels exhibiting simple task dependency structures such as MatMul, FFT, and Knn, but also Choleski (multiple wave fronts) for over 50% of execution duration • 10-150 parallel tasks evident even in complex dependency structures such as H264 (diagonal wave front), Jacobi (neighboring block dependencies) Available task parallelism in sample kernels, as uncovered by StarSs (like SPMSs). 8K tasks Task-Window.

Why Use Task-Level Parallelism (2) • Allows for automatic work distribution between HW threads • Like OpenCL/OpenMP, but without the need for programmer manual dependency analysis • Can minimize memory traffic by scheduling consumers and producers of data on the same execution units and physical memory/cache • Relaxes the need for data coherency management due to the availability of data-flow info • Can use Bulk Synchronization, as shown on the HW solution (Rigel) • Can use DAG[7] consistency – optimized for distributed shared memory •  saves silicon area, saves energy, removes synchronization overhead

Task Superscalar – OUR TOPIC • Goal: Raising the level of abstraction • Start from Task-level parallelism work already done • Leverage work already done on ILP-OOO systems • Each “operation” is now a TASK • Create an OOO task-pipeline that • leverages from the benefits of coarse level granularity • Scales beyond what ILP can provide • Show how Task-Superscalar overcomes ILP-OOO restrictions • Highlight directions for future research

Task Superscalar pipeline P P P P Fetch • Fetch: receive NON-SPECULATIVE task from master/worker threads • Decode: • Task operand addresses lookup in Object Versioning Tables -> indentify dependency • Place Task + mapping of producers into Multiplexed Reservation Stations -> distributed dependency graph • Scheduling: • Wait till input data ready (MRS) • Schedule “intelligently” • Execution: • Create new non-speculative tasks as needed • Execute computations • Data results can be written to memory! (nothing is speculative!) Task Dep. Decode Decode Task Decode Unit Multiplexed Reservation Stations OVT OVT OVT OVT Needed Task Generation HW/SW Scheduler Dispatch MRS MRS MRS MRS P P P P P WorkerProcessors FunctionalUnits GPU

Similarity to ILP OOO Pipeline • Very similar in concept, but Coarse-grain nature allows • Slower heuristics and “decisions” • A distributed OOO management (MRS) • Smarter Dispatch/scheduling (SW/HW) • Using L1/Memory for operands instead of Register-Files  much larger OOO window (>10K)

Smart Scheduling Prospects • Minimize data xfer overhead: schedule consumer-tasks on same processors as producer-tasks (same cache or NUMA memory) • Done in CellSs (a predecessor of CMPSs) {3] SPE1 completes execution of a Task.Notifies: (data-X) is updated and present in local-memory(update LocHint array High Hierarchy of “Rdy-Queues”PRI – based on temporal locality Task “t” at Queue Rj Rj+1 Rj Low ReadyLocks: data-dependency info for all tasks in Rdy_Queues Every scheduler pass- when detects “X” is “available” in SPEi, bumps its dependant task “t” one step to Rj+1 More likely to be scheduled now

Smart Scheduling Prospects (2) • Match required task computations to CPU capabilities: • FP, vector-arithmetic, media accelerators, heavy single-thread high-performance computations • Utilize GPUs as the backend • Automatically co-scheduling parts of the task in parallel on multiple machines •  “Hybrid computing adapter”

ILP-OOO bottlenecks • Typical ILP-OOO bottlenecks: efficient implementation and utilization of a large instruction window • Limits: • Conceptual execution-model related issues: • branches, jumps every 4-5 insturctions Many instructions in flight are ‘speculative’; will be dropped • data anti-dependencies  Need address resolution to decide if there is an ordering issue or not • Exceptions: require precise handling (detect exact point of occurrence; restore pipe to correct state) • Technology limits & physics: • Timing for synchronization & clocking: update every instruction •  Need uniform clock and synchronized logic for fast pipe update •  Need a fast mapping between instructions and Reservation Stations •  Simplistic scheduling • Memory wall  Instructions access memory “at random”; Data fetch to cache takes 100x of cycles, stalling the pipe

Overcoming Pipeline timings • Pipeline Synch must occur at the execution-items issue rate • ILP pipeline: instruction every few nS • A synchronous implementation using a global clock • Limited pipeline size for timing • Task pipeline: “issue” every few uS • Relaxed timing constraints • Allows for an asynchronous implementation  scalable implementation possible

Window Implementation Physics (1) Bypass network and Reservation Station architecture: • ILP – OOO: fast updates requires 1x1 mapping of instructions to RS and a BUS for param-update @ retirement • This sets the instruction window to a limit of 128-256 in flight instructions • Task Pipelines: updates every 1-10uS • Allows replacing the bypass bus with a switched NoC,  support hundreds of reservation stations and processing elements • Allows storing many tasks in a single RS using large arrays (eDRAM); • Assuming task meta-data requires 128B-1KB then 256KB eDRAM could accommodate up to 2K tasks per station

Window Implementation Physics (2) Exception support can also be relaxed for task-based scheduling: • ILP Pipe: precise exceptions tracking required; exponential increase in complexity with the instruction window growth • Task Pipe: No need to manage it at a global level • I/O Interrupts can be handled by dedicated processors, • Traps and many Exceptions (page-faults) are task dependent  Locally by the assigned processor. • Global effect occurs mostly for Error conditions (division by zero, for example)  they can be handled by a software runtime

Window Implementation Physics (3) Smart Scheduling and Pre-fetching • Can use SW scheduling as opposed to HW based mechanisms: more flexibility • Can be better tuned to data locality, memory BW, heterogeneous processing, etc • Locality based scheduling for memory BW savings • Leverage data-flow information available in model for aggregate data transfers • DMA, Prefetch instructions • Hides memory latencies altogether

Window utilization and speculation • ILP: heavy speculation to overcome memory wall and extract ILP • Branch ins rate (1 of 4-5) impact pipe efficiency (even with Branch prediction) • Task Pipeline: Not Speculative • Tasks scheduled only by committed instructions • All activated tasks will retire and will be useful • Can support multiple task generators and nested tasks since all is non-speculative •  can increase window to x1000s without increasing inefficiency

Overcoming the Memory Wall • By nature, solving the memory wall requires non-random access to slow memories – Not a von-Neumann architecture… • Triggered work into Data-Flow and Intelligent RAM architectures • However – programming model is diversely different than programmers used to • Task-Pipeline allows to both have the cake and eat it: • Task Execution: classic von-Neumann with random-access to memories • However – activate bulk-transfer of data ahead of execution (Data-Flow hints from Dependency Graphs) • DMA (Cell); Prefetch (IA64) • Effectively – removes the Memory Wall with correct management

Task Pipeline Summary • Done automatically without user “analysis” (unlike OpenMP, …) • Leverages the coarse granularity of Task-activation to overcome many of the ILP bottlenecks • Scales to a window depth of x1000s • Can be done in a distributed manner • Leverages data-dependency information to overcome Memory-Wall •  Allows to scale parallelism to the 1000 of processors automatically – the next generation of OOO.

Topics for next stage of research • Should draw from existing OOO ILP pipeline knowledge, aiming to provide task-level managed parallelism • Task level compiler optimizations • Possible HW Accelerations • Task dependency decoding • Novel task and data scheduling assists • Existing work: • Carbon and ADM, - use hardware task queues to support fast task dispatch and stealing • MLCA - manages global dependencies using a universal register file • StarPU: • A SW envir for data coherency management and unified execution model • Allows managing and running parallel code on a variety of processor types simultaneously • Contains • a high level library that takes care of transparently performing data movements • a unified execution model. • Missing: smart Hw based heuristics and decision making?

Refereces • [1] Y. Etsion, A. Ramirez, R. Badia, E. Ayguade, J. Labarta, M. Valero: Task Superscalar: Using Processors as Functional Units; Barcelona Supercomputing center – Yoav Etsion website • [2] SMPSs – SMP Superscalar: barcelona Supercomputing Center - http://www.bsc.es/plantillaG.php?cat_id=385 • [3] P. Bellens, J. Perez, F. Cabarcas, A. Ramirez, R. Badiaa, and J. Labarta: CellSs: Scheduling techniques to better exploit memory hierarchy; Scientific Programming 17 (2009) 77–95 • [4] A. Cristal, O. J. Santana, F. Cazorla, M. Galluzzi, T. Ramirez, M. Pericas, and M. Valero. Kilo-Instruction processors: Overcoming the memory wall. IEEE Micro, 25(3):48–57, 2005 • [5] R. A. Iannucci. Toward a dataflow/von Neumann hybrid architecture. In Intl. Symp. on Comp. Arch., pp. 131–140, May 1988 • [6] A. Veen, Data-Flow machine architecture http://www.cs.uiuc.edu/class/sp10/cs533/reading_list/p365-veen.pdf • [7] R. Blumofe, M. Frigo, C. Joerg, C. Leiserson, and K. Randall. DAG-consistent distributed shared memory. In Intl. Parallel Processing Symp., pp. 132–141, Apr 1996

Backup (in the works)

Kilo Instruction Window Processors[4] (1) • ILP Pipeline memory issue • L2 miss requires >500 cycles to fill • Pipe blocked when ROB fills (~100) • Usually comes in bursts (task switch) – happens again & again..

Kilo Instruction Window Processors (2) • For 500 & 1000 cycle memory latency (DRAM typical) 4K window provides ~1.5x performance gain over 128 window (current ILP’s) for SPECint and >2.5x for SPECfp

Kilo-instruction processors - Implementation issues • ROB and register resources for 2-4K instructions is impractical. • However the instruction flight-time distribution shows that only 30% of instructions need this window size • Possible directions • “Checkpoint” arch state only at specific instructions and release ROB earlier. (may be less efficient for mis-prediction, but reasonable area wise) • Bi-modal issue queue: • Issue queue very expensive resource; Holds all instructions waiting for data before execution  stall on Load, etc can be very expensive • Solution: remove long-delay instructions to smaller/cheaper queues and return to “fast” queues when data ready • PRF • …

Data-Flow Machines – memory format

Task Parallelism and Task Superscalar Processing

Task Parallelism and Task Superscalar Processing

Presentation Transcript

Task-internal and task-external readiness:

Robot, Task. Task, Robot.

Task oriented processing

TASK

Task 5 – Group task

Task Parallelism

task

Task :

Task

Task

Task A Task B Task C

Task

Task 1 and Task 2

Task ?

Task Parallelism

Instruction Level Parallelism and Superscalar Processors

Task

Task

Task

Task

Task

Task