Compiling with multicore

Compiling with multicore Jeehyung Lee 15-745 Spring 2009

Papers • Automatic Thread Extraction with Decoupled Software Pipelining • Fully automatic • Fine grained pipelining • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • Semi-automatic • Coarse grained pipelining

First paper • Automatic Thread Extraction with Decoupled Software Pipelining • Guilherme Ottoni, Ram Rangan, Adam Stoler and David August • From Princeton University

What is the paper about? • Despite increasing uses of multiprocessors, many single threaded applications do not benefit • Let the compiler automatically extract threads and exploit lurking pipeline parallelism • Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)

Why decoupled pipelining? Example Linked list traversal

Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency)

Why decoupled pipelining? DSWP One way pipelining Iteration * LD latency

DSWP • Flow of data (dependency) is acyclic among cores • With use of inter-core queue, threads can be decoupled • Efficiency + high tolerance for latency

DSWP Algorithm • Build dependence graph • Find strongly connected components (SCC) • Create DAG of SCC • Partition DAG • Split codes into partitions • Add flows to partitions

Build dependence graph Include every traditional dependence (data, control, and memory) & extensions

Find SCC • SCC : Instructions that form a dependency cycle in a loop • Instructions in SCC cannot be parallelized 1 1 2 2 1 2

Create DAG of SCCs • Merge instructions within each SCC and update dependency arrows

Partition DAG • Partition DAG nodes into n partitions ( n <= # of processors) • Use heuristic to maximize load balance • Decide # of partitions (threads) • Start filling in from partition 1 with nodes from the top of DAG. • When the partition is stuffed (estimated by # of cycles), move on to next partition • Find the best # of threads and its partition

Split codes and insert flows (done!) • For each partition, insert code basic blocks relevant to its contained SCC node • Add in codes for dependency flow

Result • 19.4% speedup on important benchmark loops, 9.2% overall • When core bandwidth is halved • Single threaded code slows down by 17.1% • DSWP code is still slightly faster than single-threaded code running on full-bandwidth core • Promising enabler for Thread-Level-Parallelism(TLP)?

Second Paper • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • William Thies, Vikram Chandrasekhar and Saman Amaransinghe • From MIT

What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define pipeline, and learn practical dependencies in runtime

What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define stages, and learn practical dependencies in runtime …for streaming applications

Interface • Add annotations in the body of top loop

Dynamic analysis • The system creates a stream graph according to annotations. How do they find dependencies?

Dynamic analysis • Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages

Dynamic analysis • Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies

Interface • Program shows a complete stream graph • User decides if he/she likes this • pipelining or not • If yes, done! • else, redo annotations. Iterate over until satisfied

Actual pipelining • When compiled, annotation macros emit codes that will fork original program for each pipeline stage

Result • Average 2.78x speedup, max 3.89x on 4-core • Seems unsound but practical (?)

Compiling with multicore