250 likes | 265 Views
Compiling with multicore. Jeehyung Lee 15-745 Spring 2009. Papers. Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic
E N D
Compiling with multicore Jeehyung Lee 15-745 Spring 2009
Papers • Automatic Thread Extraction with Decoupled Software Pipelining • Fully automatic • Fine grained pipelining • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • Semi-automatic • Coarse grained pipelining
First paper • Automatic Thread Extraction with Decoupled Software Pipelining • Guilherme Ottoni, Ram Rangan, Adam Stoler and David August • From Princeton University
What is the paper about? • Despite increasing uses of multiprocessors, many single threaded applications do not benefit • Let the compiler automatically extract threads and exploit lurking pipeline parallelism • Extract non-speculative and truly decoupled threads through Decoupled Software Pipelining(DSWP)
Why decoupled pipelining? Example Linked list traversal
Why decoupled pipelining? DOACROSS Iteration * (LD latency + communication latency)
Why decoupled pipelining? DSWP One way pipelining Iteration * LD latency
DSWP • Flow of data (dependency) is acyclic among cores • With use of inter-core queue, threads can be decoupled • Efficiency + high tolerance for latency
DSWP Algorithm • Build dependence graph • Find strongly connected components (SCC) • Create DAG of SCC • Partition DAG • Split codes into partitions • Add flows to partitions
Build dependence graph Include every traditional dependence (data, control, and memory) & extensions
Find SCC • SCC : Instructions that form a dependency cycle in a loop • Instructions in SCC cannot be parallelized 1 1 2 2 1 2
Create DAG of SCCs • Merge instructions within each SCC and update dependency arrows
Partition DAG • Partition DAG nodes into n partitions ( n <= # of processors) • Use heuristic to maximize load balance • Decide # of partitions (threads) • Start filling in from partition 1 with nodes from the top of DAG. • When the partition is stuffed (estimated by # of cycles), move on to next partition • Find the best # of threads and its partition
Split codes and insert flows (done!) • For each partition, insert code basic blocks relevant to its contained SCC node • Add in codes for dependency flow
Result • 19.4% speedup on important benchmark loops, 9.2% overall • When core bandwidth is halved • Single threaded code slows down by 17.1% • DSWP code is still slightly faster than single-threaded code running on full-bandwidth core • Promising enabler for Thread-Level-Parallelism(TLP)?
Second Paper • A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs • William Thies, Vikram Chandrasekhar and Saman Amaransinghe • From MIT
What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define pipeline, and learn practical dependencies in runtime
What is the paper about? • Despite increasing uses of multiprocessors, many single threaded… (Repeated) • Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes • Let people define stages, and learn practical dependencies in runtime …for streaming applications
Interface • Add annotations in the body of top loop
Dynamic analysis • The system creates a stream graph according to annotations. How do they find dependencies?
Dynamic analysis • Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages
Dynamic analysis • Run the application on training examples, and record every relevant store-load pair across pipeline boundaries This gives us practical dependencies
Interface • Program shows a complete stream graph • User decides if he/she likes this • pipelining or not • If yes, done! • else, redo annotations. Iterate over until satisfied
Actual pipelining • When compiled, annotation macros emit codes that will fork original program for each pipeline stage
Result • Average 2.78x speedup, max 3.89x on 4-core • Seems unsound but practical (?)