Dynamic Region Selection for Thread Level Speculation

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by: Chris Colohan from CMU Greg Steffan

Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (ALPHA 21464, Intel Xeon, Pentium IV) (IBM Power4/5, SUN MAJC, Ultrasparc 4) but what can we do with them? Multithreading on a Chip is here TODAY! Threads of Execution Supercomputers

Improving Performance with a Chip Multiprocessor P P P P P C C C C C C C With a bunch of independent applications: Applications Execution Time Processor Caches improves throughput (total work per second)

Improving Performance with a Chip Multiprocessor P P P P P P P P P C C C C C C C C C C C C With a single application:  Exec. Time need parallel threads to reduce execution time

…*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea 

   Support for TLS: What Do We Need? • Break programs into speculative threads • to maximize thread-level parallelism • Track data dependences • to determine whether speculation was safe • Recover from failed speculation • to ensure correct execution three key elements of every TLS system

Support for TLS: What Do We Need? • Lots of research has been done on TLS hardware • Tracking data dependence • Recover from violation • We focus on how to select regions to run in parallel • A region is any segment of code that you want to speculatively parallelize • For this work, region == loop, iterations == speculative threads

Why is static region selection hard? • Extensive profiling information • Regions can be nested for ( i = 1 to N ) { <= 2x faster in parallel …. for ( j = 1 to N ) { <= 3x faster in parallel …. for ( k = 1 to N ) { <= 4x faster in parallel …. } Which loop should we parallelize? } } • Dynamic behaviour Dynamic Region Selection is a potential solution

Dynamic Region Selection • Compiler transforms all candidate regions into parallel and sequential versions • Through dynamic profiling, we decide which regions are to be run in parallel • Key Questions: • Is there any dynamic behaviour between region instances? • What is a good algorithm for selecting regions? • Are there performance trade-offs for doing dynamic profiling? • Is there any dynamic behaviour within region instances? (not the focus of this research)

Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work

LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH Sequential Parallel Current Compilation for TLS LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH

DRS Compilation LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH

E 1Extract candidate region DRS Compilation

E 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E DRS Compilation

1Extract candidate region E 2Create sequential and parallel versions of the region (Clone) 3Add some extra overhead to monitor the region’s performance E DRS Compilation

DRS Algorithm 4Introduce a DRS algorithm to make the decision at runtime 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance E DRS Compilation

DRS Algorithm 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance 4Introduce a DRS algorithm to make the decision at runtime E DRS Compilation DRS Compilation by Colohan

Constant Periodic Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour

Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour

DRS Algorithms • Sample Twice • Continuous Monitoring • Continuous Resample • Path Sensitive Sampling

Constant Speed Up 1x Time Sample Twice Algorithm • Effective if behaviour is constant. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version • else run sequential version • Note that by using execution time as a metric, it is assumed that the amount of work done from instance to instance remains relatively constant. Using throughput (IPC) as a metric eliminates the need for this assumption but adds additional complexity.

Sample Sequential? Sample Parallel? Decided Sample Twice Example

Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Monitoring • Effective if behaviour is continuously degrading. • Extension to sample twice method. Continuously monitor all regions and reevaluate your decision if speedup changes. • Not doing much more besides monitoring continuously -> the overhead is free. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version and update tp • else run sequential version and update t1

t1 = NA tp = NA t1 = 5 tp = 3 t1 = 5 tp = 4 t1 = 5 tp = 6 t1 = 5 tp = NA t1 = 4 tp = 6 Sample Sequential? Sample Parallel? Decided Continuous Monitoring Example

Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Resample • Effective if behaviour is continuously changing. • Continuously resample by flushing values t1 and tp periodically. • Adds new overhead. • This algorithm has not yet been explored.

Periodic Speed Up 1x Time Path Sensitive Sampling • If the behaviour is periodic, a means of filtering is required. • One intuitive solution is to sample when the invocation path or region nesting path changes.

foo_while bar_while Speed Up 1x Periodic moo_while Time Path Sensitive Sampling • Sample when region nesting path changes • Makes the assumption that state stays the same if the invocation path does not change void foo() { while(cond) moo(); } void bar() { while(cond) moo(); } void moo() { while(cond) moo(); }

Results – Static analysis Average number of per-path instances for all regions

Interesting Region in IJPEG Number of speculative threads per region instance Program execution 

Interesting Region in Perl Number of instructions per region instance Program execution 

Experimental Framework • SPEC benchmarks • TLS compiler • MIPS architecture • TLS profiler and simulator

Is there any dynamic behavior between region instances?

Results – Dynamic behavior Regions with high coverage have low instruction variance between instances

Results – Dynamic behavior Regions with high coverage have low violation variance between instances

Results – Dynamic behavior Regions with high coverage have low speculative thread count variance between instances

What is a good algorithm for selecting regions?

static optimal slower faster Continuous monitoring 1% better on average than sample twice About 10% worse than static ‘optimal’ selection

How often did we agree with the ‘optimal’ selection?

static optimal Sample twice agrees 57% of the time, on average Continuous monitoring agrees 43% of the time, on average Levels of agreement are close  no dynamic behavior?

Agreeing with static ‘optimal’ gives better performance? Another sign of no dynamic behaviour?

 Sample twice often leaves regions undecided Overall, undecided regions represent low coverage

Conclusions • This is an unexplored research topic (as far as we know) Is there any dynamic behavior between region instances? • We have good indications that there isn’t tons of it What is the best algorithm for selecting regions? • Continuous sampling does 1% better than sample twice • Within 10% of the static ‘optimal’ without any sampling done! Any performance trade-offs for doing dynamic profiling? • The code size is increased by at most 30% • The runtime performance overhead is believed to be negligible Is there any dynamic behavior within a region instance? • We don’t know yet

Open Questions • The dynamic optimal is the theoretical optimal • How close are we from the dynamic optimal? • How close is the static ‘optimal’ to the dynamic optimal? • How do the other proposed algorithms perform? • What should be implemented in hardware/software?

Questions?

AUXILIARY SLIDES

Results – Potential Study Execution time versus invocation (IJPEG)

Results – Potential Study Execution time versus invocation (CRAFTY)

Results – Potential Study Execution time versus invocation (LI)

Dynamic Region Selection for Thread Level Speculation

Dynamic Region Selection for Thread Level Speculation

Presentation Transcript

Enabling Thread Level Speculation via A Transactional Memory System

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

Dynamic Branch Prediction and Speculation

Razor: Dynamic Voltage Scaling Based on Circuit-Level Timing Speculation

Improving Cache Locality for Thread-Level Speculation Stanley Fung and J. Gregory Steffan

Thread-Level Speculation as a Memory Consistency Protocol for Software DSM?

Combining Thread Level Speculation, Helper Threads, and Runahead Execution

Data/Thread Level Speculation (TLS) in the Stanford Hydra Chip Multiprocessor (CMP)

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

The potential for Software-only thread-level speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

The potential for Software-only thread-level speculation

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

A Scalable Approach to Thread-Level Speculation

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,

Thread Level Parallelism (TLP)

Improving Region Selection in Dynamic Optimization Systems

Razor: Dynamic Voltage Scaling Based on Circuit-Level Timing Speculation

Applying Thread Level Speculation to Database Transactions

Thread Warping: A Framework for Dynamic Synthesis of Thread Accelerators

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan,