550 likes | 645 Views
Dynamic Region Selection for Thread Level Speculation. Presented by: Jeff Da Silva Stanley Fung Martin Labrecque. Feb 6, 2004. Builds on research done by: Chris Colohan from CMU Greg Steffan. Proc. Proc. Proc. Cache. Cache. Desktops. Simultaneous- Multithreading.
E N D
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by: Chris Colohan from CMU Greg Steffan
Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (ALPHA 21464, Intel Xeon, Pentium IV) (IBM Power4/5, SUN MAJC, Ultrasparc 4) but what can we do with them? Multithreading on a Chip is here TODAY! Threads of Execution Supercomputers
Improving Performance with a Chip Multiprocessor P P P P P C C C C C C C With a bunch of independent applications: Applications Execution Time Processor Caches improves throughput (total work per second)
Improving Performance with a Chip Multiprocessor P P P P P P P P P C C C C C C C C C C C C With a single application: Exec. Time need parallel threads to reduce execution time
…*q violation *p… Recover TLS Exec. Time …*q exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea
Support for TLS: What Do We Need? • Break programs into speculative threads • to maximize thread-level parallelism • Track data dependences • to determine whether speculation was safe • Recover from failed speculation • to ensure correct execution three key elements of every TLS system
Support for TLS: What Do We Need? • Lots of research has been done on TLS hardware • Tracking data dependence • Recover from violation • We focus on how to select regions to run in parallel • A region is any segment of code that you want to speculatively parallelize • For this work, region == loop, iterations == speculative threads
Why is static region selection hard? • Extensive profiling information • Regions can be nested for ( i = 1 to N ) { <= 2x faster in parallel …. for ( j = 1 to N ) { <= 3x faster in parallel …. for ( k = 1 to N ) { <= 4x faster in parallel …. } Which loop should we parallelize? } } • Dynamic behaviour Dynamic Region Selection is a potential solution
Dynamic Region Selection • Compiler transforms all candidate regions into parallel and sequential versions • Through dynamic profiling, we decide which regions are to be run in parallel • Key Questions: • Is there any dynamic behaviour between region instances? • What is a good algorithm for selecting regions? • Are there performance trade-offs for doing dynamic profiling? • Is there any dynamic behaviour within region instances? (not the focus of this research)
Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work
LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH Sequential Parallel Current Compilation for TLS LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH
DRS Compilation LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH LoopA LoopB EndB LoopC LoopD EndD EndC EndA LoopE LoopF EndF EndE LoopG LoopH EndH EndH
E 1Extract candidate region DRS Compilation
E 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E DRS Compilation
1Extract candidate region E 2Create sequential and parallel versions of the region (Clone) 3Add some extra overhead to monitor the region’s performance E DRS Compilation
DRS Algorithm 4Introduce a DRS algorithm to make the decision at runtime 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance E DRS Compilation
DRS Algorithm 1Extract candidate region 2Create sequential and parallel versions of the region (Clone) E 3Add some extra overhead to monitor the region’s performance 4Introduce a DRS algorithm to make the decision at runtime E DRS Compilation DRS Compilation by Colohan
Constant Periodic Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour
Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Characterizing TLS Region Behaviour
DRS Algorithms • Sample Twice • Continuous Monitoring • Continuous Resample • Path Sensitive Sampling
Constant Speed Up 1x Time Sample Twice Algorithm • Effective if behaviour is constant. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version • else run sequential version • Note that by using execution time as a metric, it is assumed that the amount of work done from instance to instance remains relatively constant. Using throughput (IPC) as a metric eliminates the need for this assumption but adds additional complexity.
Sample Sequential? Sample Parallel? Decided Sample Twice Example
Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Monitoring • Effective if behaviour is continuously degrading. • Extension to sample twice method. Continuously monitor all regions and reevaluate your decision if speedup changes. • Not doing much more besides monitoring continuously -> the overhead is free. • When a region is encountered: • 1st Time: Run sequential version and record execution time t1 • 2nd Time: Run parallel version (if possible) and record execution time tp • Subsequent instances: • if tp < t1 then run parallel version and update tp • else run sequential version and update t1
t1 = NA tp = NA t1 = 5 tp = 3 t1 = 5 tp = 4 t1 = 5 tp = 6 t1 = 5 tp = NA t1 = 4 tp = 6 Sample Sequential? Sample Parallel? Decided Continuous Monitoring Example
Continuous Continuous Improvement Degradation Speed Up Speed Up 1x 1x Time Time Continuous Resample • Effective if behaviour is continuously changing. • Continuously resample by flushing values t1 and tp periodically. • Adds new overhead. • This algorithm has not yet been explored.
Periodic Speed Up 1x Time Path Sensitive Sampling • If the behaviour is periodic, a means of filtering is required. • One intuitive solution is to sample when the invocation path or region nesting path changes.
foo_while bar_while Speed Up 1x Periodic moo_while Time Path Sensitive Sampling • Sample when region nesting path changes • Makes the assumption that state stays the same if the invocation path does not change void foo() { while(cond) moo(); } void bar() { while(cond) moo(); } void moo() { while(cond) moo(); }
Results – Static analysis Average number of per-path instances for all regions
Interesting Region in IJPEG Number of speculative threads per region instance Program execution
Interesting Region in Perl Number of instructions per region instance Program execution
Experimental Framework • SPEC benchmarks • TLS compiler • MIPS architecture • TLS profiler and simulator
Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work
Results – Dynamic behavior Regions with high coverage have low instruction variance between instances
Results – Dynamic behavior Regions with high coverage have low violation variance between instances
Results – Dynamic behavior Regions with high coverage have low speculative thread count variance between instances
static optimal slower faster Continuous monitoring 1% better on average than sample twice About 10% worse than static ‘optimal’ selection
static optimal Sample twice agrees 57% of the time, on average Continuous monitoring agrees 43% of the time, on average Levels of agreement are close no dynamic behavior?
Agreeing with static ‘optimal’ gives better performance? Another sign of no dynamic behaviour?
Sample twice often leaves regions undecided Overall, undecided regions represent low coverage
Outline • The role of the TLS compiler • Characterizing dynamic behaviour • Dynamic Region Selection (DRS) algorithms • Results • Conclusions • Open questions and future work
Conclusions • This is an unexplored research topic (as far as we know) Is there any dynamic behavior between region instances? • We have good indications that there isn’t tons of it What is the best algorithm for selecting regions? • Continuous sampling does 1% better than sample twice • Within 10% of the static ‘optimal’ without any sampling done! Any performance trade-offs for doing dynamic profiling? • The code size is increased by at most 30% • The runtime performance overhead is believed to be negligible Is there any dynamic behavior within a region instance? • We don’t know yet
Open Questions • The dynamic optimal is the theoretical optimal • How close are we from the dynamic optimal? • How close is the static ‘optimal’ to the dynamic optimal? • How do the other proposed algorithms perform? • What should be implemented in hardware/software?
Results – Potential Study Execution time versus invocation (IJPEG)
Results – Potential Study Execution time versus invocation (CRAFTY)
Results – Potential Study Execution time versus invocation (LI)