520 likes | 687 Views
A Scalable Front-End Architecture for Fast Instruction Delivery. Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong. Conventional Pipeline Architecture. High-performance processors can be broken down into two parts Front-end : fetches and decodes instructions
E N D
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong
Conventional Pipeline Architecture • High-performance processors can be broken down into two parts • Front-end: fetches and decodes instructions • Execution core: executes instructions
Front-End and Pipeline Simple Front-End Fetch Decode … Fetch Decode Fetch
Front-End with Prediction Simple Front-End Fetch Predict Decode … Fetch Predict Decode Fetch Predict
Front-End Issues I • Flynn’s bottleneck: • IPC is bounded by the number of Instructions fetched per cycle • Implies: As execution performance increases, the front-end must keep up to ensure overall performance
Front-End Issues II • Two opposing forces • Designing a faster front-end • Increase I-cache size • Interconnect Scaling Problem • Wire performance does not scale with feature size • Decrease I-cache size
Key Contributions:Fetch Target Queue • Objective: • Avoid using large cache with branch prediction • Purpose • Decouple I-cache from branch prediction • Results • Improves throughput
Key Contributions:Fetch Target Buffer • Objective • Avoid large caches with branch prediction • Implementation • A multi-level buffer • Results • Deliver performance is 25% better than single level • Scales better with “future” feature size
Outline • Scalable Front-End and Components • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion
Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Simple Front-End Fetch Predict Fetch Predict
Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Front-End with FTQ Predict Predict Predict Fetch Predict Fetch Fetch
Fetch Target Queue • Fetch and predict can have different latencies • Allows for I-cache to be pipelined • As long as they have the same throughput
Fetch Blocks • FTQ stores fetch block • Sequence of instructions • Starting at branch target • Ending at a strongly biased branch • Instructions are directly fed into pipeline
Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion
Fetch Target Buffer:Outline • Review: Branch Target Buffer • Fetch Target Buffer • Fetch Blocks • Functionality
Review: Branch Target Buffer I • Previous Work (Perleberg and Smith [2]) • Makes fetch independent of predict Fetch Predict Simple Front-End Fetch Predict Fetch Predict Fetch Predict With Branch Target Buffer Fetch Predict Fetch Predict
Review: Branch Target Buffer II • Characteristics • Hash table • Makes predictions • Caches prediction information
FTP Optimizations over BTB • Multi-level • Solves conundrum • Need a small cache • Need enough space to successfully predict branches
FTP Optimizationsover BTB • Oversize bit • Indicates if a block is larger than cache line • With multi-port cache • Allows several smaller blocks to be loaded at the same time
FTP Optimizationsover BTB • Only stores partial fall-through address • Fall-through address is close to the current PC • Only need to store an offset
FTP Optimizations over BTB • Doesn’t store every blocks: • Fall-through blocks • Blocks that are seldom taken
Fetch Target Buffer • Target: of branch • Type: conditional, subroutine call/return • Oversize: if block size > cache line Next PC
L1 Hit HIT!
Branch NOT Taken HIT! HIT! NOT TAKEN
Branch NOT Taken HIT! HIT! NOT TAKEN
Branch Taken HIT! TAKEN
L1 Miss L1: MISS FALL THROUGH
L1 Miss L1: MISS L2: HIT After N cycle Delay FALL THROUGH
L1 and L2 Miss L1: MISS L2: MISS FALL THROUGH: eventually mispredicts
Hybrid branch prediction • Meta-predictor selects between • Local history predictor • Global history predictor • Bimodal predictor
Branch Prediction Global Predictor
Committing Results When full, SHQ commits oldest value to local history or global history
Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Methodology • Results • Analysis and Conclusion
Experimental Methodology I • Baseline Architecture • Processor • 8 instruction fetch with 16 instruction issue per cycle • 128 entry reorder buffer with 32 entry load/store buffer • 8 cycle minimum branch mis-prediction penalty • Cache • 64k 2-way instruction cache • 64k 4 way data cache (pipelined)
Experimental Methodology II • Timing Model • Cacti cache compiler • Models on-chip memory • Modified for 0.35 um, 0.188 um and 0.10 um processes • Test set • 6 SPEC95 benchmarks • 2 C++ Programs
Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion
Comparing FTB to BTB • FTB provides slightly better performance • Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better
Comparing Multi-level FTB to Single-Level FTB • Two-level FTB Performance • Smaller fetch size • 2 Level Average Size: 6.6 • 1 Level Average Size: 7.5 • Higher accuracy on average • Two-Level: 83.3% • Single: 73.1 % • Higher performance • 25% average speedup over single
Fall-through Bits Used Better • Number of fall-through bits: 4-5 • Because fetch distances 16 instructions do not improve performance
FTQ Occupancy Better • Roughly indicates throughput • On average, FTQ is • Empty: 21.1% • Full: 10.7% of the time
Scalability Better • Two level FTB scale well with features size • Higher slope is better
Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion
Analysis • 25% improvement in IPC over best performing single-level designs • System scales well with feature size • On average, FTQ is non-empty 21.1% of the time • FTB Design requires at most 5 bits for fall-through address
Conclusion • FTQ and FTB design • Decouples the I-cache from branch prediction • Produces higher throughput • Uses multi-level buffer • Produces better scalability
References • [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999 • [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.