300 likes | 471 Views
Combining Thread Level Speculation, Helper Threads, and Runahead Execution. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA. Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra. Introduction. Single core, out-of-order cores don’t scale
E N D
Combining Thread Level Speculation, Helper Threads, and Runahead Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis, Nikolas Ioannou and Marcelo Cintra
Introduction • Single core, out-of-order cores don’t scale • Simpler solution: multi-core architectures • No speedup for single thread applications • Use Thread Level Speculation to extract TLP • Use Helper Threads or RunAhead to improve ILP • However for different apps. (or phases) some models work better than some others • Our Proposal: • Combine these execution models • Decide at runtime when to employ them ICS 2009
Contributions • Introduce mixed Speculative Multithreading (SM) Execution Models • Design one that combines TLS, HT and RA • Propose a performance model able to quantify ILP and TLP benefits • Unified approach outperforms state-of-the-art SM models: • TLS by 10.2% avg. (up to 41.2%) • RA by 18.3 % avg. (up to 35.2%) ICS 2009
Outline • Introduction • Speculative Multithreading Models • Performance Model • Unified Scheme • Experimental Setup and Results • Conclusions ICS 2009
Helper Threads • Compiler deals with: • Memory ops miss/ hard-to-predict branches • Backward slices • HW deals with: • Spawn threads • Different context • Discard when finished • Benefit: • ILP (Prefetch/Warmup) ICS 2009
RunAhead Execution • Compiler deals with: • Nothing • HW deals with: • Different context • When to do RA • VP Memory • Commit/Discard • Benefit: • ILP (Prefetch/Warmup) ICS 2009
Thread Level Speculation • Compiler deals with: • Task selection • Code generation • HW deals with: • Different context • Spawn threads • Detecting violations • Replaying • Arbitrate commit • Benefit: TLP/ILP • TLP (Overlapped Execution) • + ILP (Prefetching) ICS 2009
Outline • Introduction • Speculative Multithreading Models • Performance Model • Unified Scheme • Experimental Setup and Results • Conclusions ICS 2009
Understanding Performance Benefits • Complex TLS thread interactions, obscure performance benefits • Even more true for mixed execution models • We need a way to quantify ILP and TLP contributions to bottom-line performance • Proposed model: • Able to break benefits in ILP/TLP contributions ICS 2009
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) Tseq/Tmt ICS 2009
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) Tseq/T1p ICS 2009
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) (T1+T2)/(T1’+T2’) ICS 2009
Performance Model • Sall = Sseq x Silp x Sovl • Compute overall speedup (Sall) • Compute sequential TLS speedup (Sseq) • Compute speedup due to ILP (Silp) • Use everything to compute TLP (Sovl) Sall/(Sseq x Silp) ICS 2009
Outline • Introduction • Speculative Multithreading Models • Performance Model • Unified Scheme • Experimental Setup and Results • Conclusions ICS 2009
Unified Execution Model • Can we improve TLS? • Some of the threads do not help • Slack in usage of cores • Improve TLP: • Requires a better compiler • Improve ILP: • Combine TLS with another SM ! • Most of the HW common ICS 2009
Combining TLS, HT and RA • Start with TLS • Provide support to clone TLS threads and convert them to HT • Conversion to HT means: • Put them in RA mode • Suppress squashes and do not cause additional squashes • Discard them when they finish • No compiler slicing purely HW approach ICS 2009
Intricacies to be Handled • HT may not prefetch effectively! • Dealing with contention • HT threads much faster saturate BW • Dealing with thread ordering • TLS imposes total thread order • HT killed squashes TLS threads ICS 2009
Creating and Terminating HT • Create a HT on a L2 miss we can VP • Use mem. address based confidence estimator • VP only if confident • Create a HT if we have a free processor • Only allow most speculative thread to clone • Seamless integration of HT with TLS • BUT: if parent no longer the most spec. TLS thread, the HT has to be killed • Additionally kill HT when: • Parent/HT thread finishes • HT causes exception ICS 2009
Outline • Introduction • Speculative Multithreading Models • Performance Model • Unified Scheme • Experimental Setup and Results • Conclusions ICS 2009
Experimental Setup • Simulator, Compiler and Benchmarks: • SESC (http://sesc.sourceforge.net/) • POSH (Liu et al. PPoPP ‘06) • Spec 2000 Int. • Architecture: • Four way CMP, 4-Issue cores • 16KB L1 Data (multi-versioned) and Instruction Caches • 1MB unified L2 Caches • Inst. window/ROB – 80/104 entries • 16KB Last Value Predictor ICS 2009
Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits ICS 2009
Comparing TLS, RunAhead and Unified Scheme • Almost additive benefits • 10.2% over TLS, 18.3% over RA ICS 2009
Understanding the extra ILP • Improvements of ILP come from: • Mainly memory • Branch prediction (improvement 0.5%) • Focus on memory: • Miss rate on committed path • Clustering of misses (different cost) ICS 2009
Normalized Shared Cache Misses • All schemes better than sequential • Unified 41% better than sequential ICS2009
Isolated vs. Clustered Misses . • Both TLS + RA Large window machines • Unified does even better ICS 2009
Outline • Introduction • Multithreading Models • Performance Model • Unified Scheme • Experimental Setup and Results • Conclusions ICS 2009
Also on the paper … • Dealing with the load of the system • Converting TLS threads to HT • Multiple HT • Effect of a better VP • Detailed comparison of performance model against existing models (Renau et. al ICS ’05) ICS 2009
Conclusions • CMPs are here to stay: • What about single threaded apps. and apps with significant seq. sections? • Different apps. require different SM techniques • Even within apps. different phases • We propose the first mixed execution model • TLS is nicely complemented by HT and RA • Our unified scheme outperforms existing SM models • TLS by 10.2% avg. (up to 41.2%) • RA by 18.3 % avg. (up to 35.2%) ICS 2009
Combining Thread Level Speculation, Helper Threads, and Runahead Execution University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc/Projects/VESPA Polychronis Xekalakis Nikolas Ioannou and Marcelo Cintra