490 likes | 623 Views
Value Prediction: Are(n’t) We Done Yet?. Mikko Lipasti University of Wisconsin-Madison. Definition. What is value prediction? Broadly, three salient attributes: Generate a speculative value (predict) Consume speculative value (execute) Verify speculative value (compare/recover)
E N D
Value Prediction:Are(n’t) We Done Yet? Mikko Lipasti University of Wisconsin-Madison
Definition • What is value prediction? Broadly, three salient attributes: • Generate a speculative value (predict) • Consume speculative value (execute) • Verify speculative value (compare/recover) • This subsumes branch prediction Focus here on operand values Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Some History • “Classical” value prediction • Independently invented by 4 groups in 1995-1996 • AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995 • Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997 • CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March 1996 • Wisconsin: Y. Sazeides, J. Smith, Summer 1996 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Why? • Possible explanations: • Natural evolution from branch prediction • Natural evolution from memoization • Natural evolution from rampant speculation • Cache hit speculation • Memory independence speculation • Speculative address generation • Improvements in tracing/simulation technology • “There’s a lot of zeroes out there.” (C. Wilkerson) • Values, not just instructions & addresses • TRIP6000 [A. Martin-de-Nicolas, IBM] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Publications by Year • Excludes journals, workshops, compiler conferences Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
What Happened? • Tremendous academic interest • Dozens of research groups, papers, proposals • No industry uptake • No present or planned CPU with value prediction • Why? • Meager performance benefit (< 10%) • Power consumption • Dynamic power for extra activity • Static power (area) for prediction tables • Complexity and correctness • Subtle memory ordering issues [MICRO ’01] • Misprediction recovery [HPCA ’04] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Performance? • Relationship between timely fetch and value prediction benefit [Gabbay, ISCA] Value prediction doesn’t help when the result can be computed before the consumer instruction is fetched • High-bandwidth fetch helps • Wide trace caches studied in late 1990s • But, these have several negative attributes • Recent designs focus on frequency, not ILP • High-bandwidth fetch is a red herring • More important to fetch the right instructions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Future Adoption? • Classical value prediction will only make it in the context of a very different microarchitecture • One that explicitly and aggressively exposes ILP • Promising trends • Deep pipelining craze appears to be over • Can’t manage the design complexity • High frequency mania appears to be over • Can’t afford the power • Architects are pursuing ILP once again • Value prediction has another opportunity Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
What Value Prediction Begat • Value prediction catalyzed a new focus on values in computation • This had not been studied before • A whole new realm of research: Value-Aware Microarchitecture • Spans numerous subdisciplines • Significant industrial impact already • Also, developments in supporting technologies Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Value-Aware Microarchitecture • Memory Hierarchy • Register File Compression [several] • Cache Compression [Gupta, Alameldeen] • Memory Compression [e.g. IBM MXT] • Bandwidth compression • Address and data bus encoding [Rudolph] • Initialization Traffic [Lewis] • Memory Hierarchy • Register File Compression [several] • Cache Compression [Gupta, Alameldeen] • Memory Compression [e.g. IBM MXT] • Bandwidth compression • Address and data bus encoding [Rudolph] • Initialization Traffic [Lewis] • Load/Store Processing • Load value prediction [numerous] • Fast address calculation [Austin] • Value-aware alias prediction [Onder] • Memory consistency [Cain] • Load/Store Processing • Load value prediction [numerous] • Fast address calculation [Austin] • Value-aware alias prediction [Onder] • Memory consistency [Cain] Value-Aware Microarchitecture • Execution Core • Value Prediction • Operand Significance • Low Power [Canal] • Execution bandwidth [Loh] • Bit-slicing [Pentium 4, Mestan] • Instruction reuse [Sodani] • Carry prediction [Circuit-level Speculation] • Execution Core • Value Prediction • Operand Significance • Low Power [Canal] • Execution bandwidth [Loh] • Bit-slicing [Pentium 4, Mestan] • Instruction reuse [Sodani] • Carry prediction [Circuit-level Speculation] • Cache Coherence • Producer-side • Silent stores, temporally silent stores [Lepak] • Speculative lock elision [Wisc, UIUC] • Consumer side • Load value prediction using stale lines [Lepak] • “Coherence decoupling” [ASPLOS 04] • Cache Coherence • Producer-side • Silent stores, temporally silent stores [Lepak] • Speculative lock elision [Rajwar] • Consumer side • Load value prediction using stale lines [Lepak] • “Coherence decoupling” [Burger, Sohi] Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Supporting Technologies • Value prediction presented some unique challenges: • Relatively low correct prediction rate (initially 40-50%) • Nontrivial misprediction rate with avoidable misprediction cost • These drove study of: • Confidence prediction/estimation • First microarchitectural application of confidence estimation, though not widely credited or cited as such • Since studied for numerous applications, e.g. gating control speculation • Selective recovery [Sazeides Ph.D., Kim HPCA ‘04] • Numerous challenges in extending recovery to entire window • Both have proved to be fruitful research areas • Also stimulated development of software technology: • Value profiling • Value-based compiler optimizations • Run-time code specialization Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Outline • Some History • Industry Trends • Value-Aware Microarchitecture • Case study: Memory Consistency [Trey Cain, ISCA 2004] • Conventional load queue microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation • Conclusions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Value-based Memory Consistency • High ILP => Large instruction windows • Larger physical register file • Larger scheduler • Larger load/store queues • Result in increased access latency • Value-based Replay • If load queue scalability a problem…who needs one! • Instead, re-execute load instructions a 2nd time in program order • Filter replays: heuristics reduce extra cache bandwidth to 3.5% on average Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Enforcing RAW dependences Program order (Exe order) • Load queue contains load addresses • Memory independence speculation • Hoist load above unknown store assuming it is to a different address • Check correctness at store retirement • One search per store address calculation • If address matches, the load is squashed • (1) store A • (3) store ? • (2) load A Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Enforcing memory consistency • Processor p1 • (3) load A • 2. (1) load A • Processor p2 • (2) store A • Two approaches • Snooping: Search per incoming invalidate • Insulated: Search per load address calculation raw war Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
address CAM load meta-data RAM Load queue implementation queue management • # of write ports = load address calc width • # of read ports = load+store address calc width ( + 1) • Current generation designs (32-48 entries, 2 write ports, 2 (3) read ports) squash determination external request external address store address store age load address load age Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Load queue scaling • Larger instruction window => larger load queue • Increases access latency • Increases energy consumption • Wider issue width => more read/write ports • Also increases latency and energy Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Related work: MICRO 2003 • Park et al., Purdue • Extra structure dedicated to enforcing memory consistency • Increase capacity through segmentation • Sethumadhavan et al., UT-Austin • Add set of filters summarizing contents of load queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Keep it simple… • Throw more hardware at the problem? • Need to design/implement/verify • Execution core is already complicated • Load queue checks for rare errors • Why not move error checking away from exe? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Value-based Consistency … • Replay: access the cache a second time -cheaply! • Almost always cache hit • Reuse address calculation and translation • Share cache port used by stores in commit stage • Compare: compares new value to original value • Squash if the values differ • This is value prediction! • Predict: access cache prematurely • Execute: as usual • Verify: replay load, compare value, recover if necessary IF1 IF2 D R Q S EX REP C CMP WB Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Rules of replay • All prior stores must have written data to the cache • No store-to-load forwarding • Loads must replay in program order • If a load is squashed, it should not be replayed a second time • Ensures forward progress Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Replay reduction • Replay costs • Consumes cache bandwidth (and power) • Increases reorder buffer occupancy • Can we avoid these penalties? • Infer correctness of certain operations • Four replay filters • These are used to avoid checking our value prediction when in fact no value prediction occurred (loaded value is known to be correct) • Similar to “constant prediction” in initial work Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
No-Reorder filter • Avoid replay if load isn’t reordered wrt other memory operations • Can we do better? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Enforcing single-thread RAW dependencies • No-Unresolved Store Address Filter • Load instruction i is replayed if there are prior stores with unresolved addresses when i issues • Works for intra-processor RAW dependences • Doesn’t enforce memory consistency Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Constraint graph • Defined for sequential consistency by Landin et al., ISCA-18 • Directed-graph represents a multithreaded execution • Nodes represent dynamic instruction instances • Edges represent their transitive orders (program order, RAW, WAW, WAR). • If the constraint graph is acyclic, then the execution is correct Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Constraint graph example - SC Proc 1 ST A Proc 2 WAR 2. 4. LD B Program order Program order ST B LD A 3. RAW 1. Cycle indicates that execution is incorrect Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Anatomy of a cycle Proc 1 ST A Proc 2 Incoming invalidate WAR LD B Program order Program order Cache miss ST B RAW LD A Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Enforcing MP consistency • No-Recent-Miss Filter • Avoid replay if there have been no cache line fills (to any address) while load in instruction window • No-Recent-Snoop Filter • Avoid replay if there have been no external invalidates (to any address) while load in instruction window Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Filter Summary Conservative Replay all committed loads No-Reorder Filter No-Unresolved Store/ No-Recent-Miss Filter No-Unresolved Store/ No-Recent-Snoop Filter Aggressive Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Outline • Some History • Industry Trends • Value-Aware Microarchitecture • Case study: Memory Consistency [Cain, ISCA] • Conventional load queue microarchitecture • Value-based memory ordering • Replay-reduction heuristics • Performance evaluation • Conclusions Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Base machine model Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
%L1 DCache bandwidth increase SPECint2000 SPECfp2000 commercial multiprocessor • replay all (b) no-reorder filter (c) no-recent-miss filter (d) no-recent-snoop filter On average, 3.4% bandwidth overhead using no-recent-snoop filter Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Value-based replay performance (relative to constrained load queue) SPECint2000 SPECfp2000 commercial multiprocessor Value-based replay 8% faster on avg than baseline using 16-entry ld queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Does value locality help? • Not much… • Value locality does avoid memory ordering violations • 59% single-thread violations avoided • 95% consistency violations avoided • But these violations rarely occur • ~1 single-thread violation per 100 million instr • 4 consistency violation per 10,000 instr Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
What About Power? • Simple power model: • Empirically: 0.02 replay loads per committed instruction • If load queue CAM energy/insn > 0.02 × energy expenditure of a cache access and comparison: • value-based implementation saves power! DEnergy = # replays ( Eper cache access + Eper word comparison ) + replay overhead – ( Eper ldq search× # ldq searches ) Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Value-based replay Pros/Cons • Eliminates associative lookup hardware • Load queue becomes simple FIFO • Negligible IPC or L1D bandwidth impact • Can be used to fix value prediction • Enforces dependence order consistency constraint [MICRO ‘01] • Requires additional pipeline stages • Requires additional cache datapath for loads Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Conclusions • Value prediction • Continues to generate lots of academic interest • Little industry uptake so far • Historical trends (narrow deep pipelines) minimized benefit • Sea-change underway on this front • Value prediction will be revisited in quest for ILP • Power consumption is key! • Value-Aware Microarchitecture • Multiple fertile areas of research • Some has found its way into products • Are we done yet? No! • Questions? Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Backups Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Caveat: Memory Dependence Prediction • Some predictors train using the conflicting store • (e.g. store-set predictor) • Replay mechanism is unable to pinpoint conflicting store • Fair comparison: • Baseline machine: store-set predictor w/ 4k entry SSIT and 128 entry LFST • Experimental machine: Simple 21264-style dependence predictor w/ 4k entry history table Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Load queue search energy Based on 0.09 micron process technology using Cacti v. 3.2 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Load queue search latency Based on 0.09 micron process technology using Cacti v. 3.2 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Benchmarks • MP (16-way) • Commercial workloads (SPECweb, TPC-H) • SPLASH2 scientific application (ocean) • Error bars signify 95% statistical confidence • UP • 3 from SPECfp2000 • Selected due to high reorder buffer utilization • apsi, art, wupwise • 3 commercial • SPECjbb2000, TPC-B, TPC-H • A few from SPECint2000 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Life cycle of a load ST ? ST ? LD ? ST ? LD ? LD ? LD ? ST ? LD ? ST ? LD A ST A ST ? OoO Execution Window Blam! LD ? LD A Load queue Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Performance relative to unconstrained load queue Good news: Replay w/ no-recent-snoop filter only 1% slower on average Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Reorder-Buffer Utilization Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Why focus on load queue? • Load queue has different constraints that store queue • More loads than stores (30% vs 14% dynamic instructions) • Load queue searched more frequently (consuming more power) • Store-forwarding logic performance critical • Many non-scalable structures in OoO processor • Scheduler • Physical register file • Register map Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Prior work: formal memory model representations • Local, WRT, global “performance” of memory ops (Dubois et al., ISCA-13) • Acyclic graph representation (Landin et al., ISCA-18) • Modeling memory operation as a series of sub-operations (Collier, RAPA) • Acyclic graph + sub-operations (Adve, thesis) • Initiation event, for modeling early store-to-load forwarding (Gharachorloo, thesis) Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004
Some History From: Larry.Widigen@amd.com (Larry Widigen) Received: by charlie (4.1) id AA00850; Wed, 14 Aug 96 10:33:12 PDT Date: Wed, 14 Aug 96 10:33:12 PDT Message-Id: <9608141733.AA00850@charlie> To: Mikko_H._Lipasti@cmu.edu Subject: www location of paper Status: RO X-Status: X-Keywords: X-UID: 1 I would like to review your forthcoming paper, "Value Locality and Load Value Prediction." Could you provide a www address where it resides? I am curious as to its contents since its title suggests that it may discuss an area where I have done some work. Cordially, Larry Widigen Manager of Processor Development • “Classical” value prediction • Independently invented by 4 groups in 1995-1996 • AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995 • Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997 • CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March, 1996 • Wisconsin: Y. Sazeides, J. Smith, Summer 1996 Keynote, 2nd Value Prediction Workshop, Oct. 10, 2004