Beyond Auto-Parallelization: Compilers for Many-Core Systems

Beyond Auto-Parallelization: Compilers for Many-Core Systems University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc Marcelo Cintra

Compilers for Parallel Computers (Today) • Auto-parallelizing compilers • “Holy grail”: convert sequential programs into parallel programs with little or no user intervention • Only partial success, despite decades of work • No performance debugging tools • For explicitly parallel languages/annotations (e.g., OpenMP, Java Threads) • Main goal: correctly map high-level data and control flow to hardware/OS threads and communication • Secondary goal: perform simple optimizations specific to parallel execution • Simple correctness and performance debugging tools Moore for Less Keynote - September 2008

Compilers for Parallel Computers (Future) • Data flow/dependence analysis tools – unsafe/speculative • Probabilistic approaches • Profile-based approaches • Multithreading-specific optimization toolbox • Including alternative/speculative parallel programming models (e.g., Transactional Memory (TM)) • Auto-parallelizing compilers – with speculation • Thread-level speculation (TLS) • Helper threads Holistic parallelizing tool chain. Moore for Less Keynote - September 2008

Why Be Speculative? • Performance of programs ultimately limited by control and data flows • Most compiler optimizations exploit knowledge of control and data flows • Techniques based on complete/accurate knowledge of control and data flows are reaching their limit • True for both sequential and parallel optimizations Future compiler optimizations must rely on incomplete knowledge: speculative execution Moore for Less Keynote - September 2008

Compilers for Parallel Computers (Future) Unsafe P-way parallel Dependence/Flow Analysis Tool TLS TM Seq. Parallelizing Compiler Auto-TLS Compiler Auto-TLS Compiler <P-way parallel Moore for Less Keynote - September 2008

Outline • Context and Motivation • History and status-quo of auto-parallelizing compilers • Data dependence analysis for array-based programs • Data dependence analysis for irregular programs • Auto-parallelizing compilers for TLS • TLS execution model (speculative parallelization) • Static compiler cost model (PACT’04, TACO’07) Moore for Less Keynote - September 2008

Data Dependence Analysis for Arrays • Based on mathematical evaluation of array index expressions within loop nests • Progressively more capable analyses (e.g., GCD test, Banerjee test), but still restricted to affine loop index expressions • Coupled with mathematical framework to represent loop transformations (e.g., loop interchange, skewing) that can help expose more parallelism Moore for Less Keynote - September 2008

Data Dependence Analysis for Arrays • What’s wrong with traditional data dependence? • Not all index expressions are affine or even statically defined (e.g., subscripted subscripts) • Not all loops are well structured (e.g., conditional exits, control flow) • Not all procedures are analyzable (e.g., unavailable code, aliasing, global data access) • Not all applications make intense use of arrays (e.g., trees, hash tables, linked lists, etc) and loop nests Moore for Less Keynote - September 2008

Data Dependence Analysis for Irregular Programs • Based on ad-hoc analyses (e.g., pointer analysis, shape analysis, task graph analysis) ? There isn’t a comprehensive data dependence analysis framework for irregular applications Moore for Less Keynote - September 2008

Outline • Context and Motivation • History and status-quo of auto-parallelizing compilers • Data dependence analysis for array-based programs • Data dependence analysis for irregular programs • Auto-parallelizing compilers for TLS • TLS execution model (speculative parallelization) • Static compiler cost model (PACT’04, TACO’07) Moore for Less Keynote - September 2008

RAW Thread Level Speculation (TLS) • Assume no dependences and execute threads in parallel • While speculating, buffer speculative data separately • Track data accesses and monitor cross-thread violations • Squash offending threads and restart them • All this can be done in hardware, software, or a combination for(i=0; i<100; i++) { … = A[L[i]] + … A[K[i]] = … } Iteration J … = A[4]+… A[5] = ... Iteration J+1 … = A[2]+… A[2] = ... Iteration J+2 … = A[5]+… A[6] = ... Moore for Less Keynote - September 2008

TLS Overheads • Squash & restart: re-executing the threads • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Dispatch & commit: writing back speculative data into memory and starting next speculative thread • Load imbalance: processor waiting for thread to become non-speculative to commit Moore for Less Keynote - September 2008

Coping with overheads: Cost Model! • Compiler cost models are key to guide optimizations, but no such cost model exists for TLS • Speculative parallelization can deliver significant speedup or slowdown • Several speculation overheads • Overheads are hard to estimate (e.g., squash?) • A prediction of the value of speedup can be useful • e.g. multi-tasking environment • program A wants to run speculatively in parallel on 4 cores ( predicted speedup 1.8 ) • other programs waiting to be scheduled • OS decides it does not pay off Moore for Less Keynote - September 2008

TLS Overheads • Squash & restart: re-executing the threads • Hard because violations are highly unpredictable • Speculative buffer overflow: speculative buffer is full, thread stalls until becomes non-speculative • Hard because write-sets are somewhat unpredictable • Dispatch & commit: writing back speculative data into memory and starting next speculative thread • Hard because write-sets are somewhat unpredictable • Load imbalance: processor waiting for thread to become non-speculative to commit • Hard because workloads are very unpredictable and order does matter due to in-order commit requirement Moore for Less Keynote - September 2008

Our Compiler Cost Model: Highlights • First fully static compiler cost model for TLS • Can handle all TLS overheads in a single framework • Including loop imbalance, which is not handled by any other cost model • Produces not only a qualitative (“good” or “bad”) assessment of the TLS benefits but instead a quantitative value (i.e., expected speedup/slowdown) • Can be easily integrated into most compilers at the intermediate representation level • Simple and fast to compute Moore for Less Keynote - September 2008

Very varied speedup/slowdown behavior Speedup Distribution Moore for Less Keynote - September 2008

Only 17% false positives (performance degradation) Negligible false negatives (missed opportunities) Most speedups/slowdowns correctly predicted by the model Model Accuracy (I): Outcomes Moore for Less Keynote - September 2008

Current Developments • Done: • Completed implementation of TLS code generator in GCC • Doing: • Implementing cost model in this TLS GCC • Profiling TLS program behavior (with IBM and U. of Manchester) • To Do: • Develop hybrid cost models based on static and profile information • Develop “intelligent” cost models based on Machine Learning (with U. of Manchester) Moore for Less Keynote - September 2008

Summary • Paraphrasing M. Snir† (UIUC): “parallel programming will have to become synonymous with programming” • However, • Better (and unsafe) data dependence analysis tools • Explicit (and speculative) parallel models • Auto-parallelizing (speculative) compilers • Much work still needs to be done. • At U. of Edinburgh: • Auto-parallelizing TLS compilers • TLS hardware • STM (software TM) † Director of Intel+Microsoft’s UPCRC Moore for Less Keynote - September 2008

Acknowledgments • Research Team and Collaborators • Jialin Dou • Salman Khan • Polychronis Xekalakis • Nikolas Ioannou • Fabricio Goes • Constantino Ribeiro • Dr. G. Brown, Dr. M. Lujan, Prof. I. Watson (U. of Manchester) • Prof. Diego Llanos (U. of Valladolid) • Funding • UK – EPSRC: GR/R65169/01 EP/G000697/1 Moore for Less Keynote - September 2008

Beyond Auto-Parallelization: Compilers for Many-Core Systems University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc Marcelo Cintra

Beyond Auto-Parallelization: Compilers for Many-Core Systems

Beyond Auto-Parallelization: Compilers for Many-Core Systems

Presentation Transcript

Optimizing for Intel multi-/many-core architectures

Parallelization Issues for MINLP

Session 2: Many Core

Parallelization

Dynamic Thread Mapping for High-Performance, Power-Efficient Heterogeneous Many-core Systems

DANBI: Dynamic Scheduling of Irregular Stream Programs for Many-Core Systems

Scalable Many-Core Memory Systems Optional Topic 5 : Interconnects

Assemblers, Compilers, Operating Systems

Many Particle Systems

Beyond Systems

Software abstractions for many-core software engineering

Programming Many-Core Systems with GRAMPS

McRT: Many-Core Runtime

Research Computing on Multi-core and Many-core Systems: Toward Extreme-scale Computing

Programming many core systems Marco Bekooij

Reasons for parallelization

Spring 2008 CSE 591 Compilers for Embedded Systems

Compilers for embedded systems: Why are compilers an issue?

Parallelization Issues for MINLP

Programming many core systems Marco Bekooij

Spring 2008 CSE 591 Compilers for Embedded Systems