1 / 34

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor. André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata Krishnan, Stargen Inc Yiannakis Sazeides, University of Cyprus. Alpha EV8 (cancelled june 2001). SMT: 4 threads wide-issue superscalar processor: 8-way issue.

hume
Download Presentation

Design tradeoffs for the Alpha EV8 Conditional Branch Predictor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Design tradeoffs for the Alpha EV8 Conditional Branch Predictor André Seznec, IRISA/INRIA Stephen Felix, Intel Venkata Krishnan, Stargen Inc Yiannakis Sazeides, University of Cyprus

  2. Alpha EV8 (cancelled june 2001) • SMT: 4 threads • wide-issue superscalar processor: • 8-way issue Single process performance is the goal Multithreaded performance is a bonus 5-10 % overhead for SMT

  3. Challenges on the EV8 conditional branch predictor • High accuracy is needed: • 14 cycles minimum miss penalty • Up to 16 predictions per cycle: • from two non-contiguous fetch blocks! • Various implementation constraints: • master the number of physical memory arrays • use of single-ported memory cells • timing constraints

  4. br br not taken not taken br br not taken taken instruction fetch blocks on EV8

  5. Alpha EV8 front-end pipeline • Fetches up to two, 8-instruction blocks per cycle from the I-cache: • a block ends either on an aligned 8-instruction end or on a taken control flow • up to 16 conditional branches fetched and predicted per cycle • Next two block addresses must be predicted in a single cycle: • critical path: use of a line predictor backed with a complex PC address generator: conditional branch predictor, RAS, jump predictor ..

  6. Cycle 1 Cycle 2 Cycle 3 PC address generation is completed Line prediction is completed Prediction table read is completed PC address generation pipeline C and D A and B Y and Z

  7. e-gskew EV8 predictor: (derived from) (2Bc-gskew)

  8. 2Bc-gskew: degrees of freedompartial update policy • on correct predictions, only updates correct components: • do not destroy other predictions • better accuracy ! • On correct predictions: • prediction bit is only read • hysteresis bit is only written USE OF DISTINCT PREDICTION AND HYSTERESIS ARRAYS !! No reason for same size for hysteresis and prediction arrays

  9. Smaller bimodal table Different history lengths EV8 predictor: leveraging degrees of freedom

  10. Dealing with implementation constraints

  11. Issues on global history Blocks A and B Blocks Y and Z Blocks C and D Branch infos from C, B and A are not valid to predict D! On each cycle, upto 16 branch are predicted: 0 to 16 bits to be inserted in the history vector !?

  12. Block compressed history lghist • Incorporate at most one bit in the history per fetch block: • 0, 1 or 2 bits to be incorporated in history vector per cycle • Which bit ? • Direction of the last conditional branch in the block • previous ones are not taken • XORed with position (1st half/ 2nd half) in the block • more uniform distribution of the history vectors

  13. br br taken 1 is inserted br br not taken taken 0 is inserted instruction fetch blocks on EV8

  14. The EV8 branch predictor information vector • History information is not available on the three previous blocks A, B, and C • but, addresses are available !! Information vector to index the predictor: 1. Instruction address 2. Lghist (3-blocks-old history + path) 3. Path info on the last three blocks

  15. Using single-ported memory arrays The challenge: 16 predictions to be performed per cycle from two non-contiguous blocks ! 8 updates per cycle: for two non-contiguous blocks ! But single-ported arrays are highly desirable :-)

  16. Bank-interleaved or double-ported branch predictor ? • Reads of predictions for two 8-instructions blocks: • double-porting: memory cells twice as large • losing half of the entries ? • bank-interleaving: need for arbitration • longer critical electrical path • losing throughput • short loops fitting in a single 8-instruction block !? ????????

  17. Conflict free interleaved bank predictor Key idea: Force adjacent predictions to lie in distinct banks Bank for A is determined by Y and Z 4-way interleaved: if (y6,y5)== Bz then Ba =(y6,y5+1) else Ba = (y6,y5)

  18. Conflict free bank-interleaved predictor (2) • Conflicts are avoided by construction • Bank number is computed one cycle ahead • not on the critical path Single ported bank-interleaved memory arrays !

  19. 4 tables * 4 banks * 2 (pred. +hyst.): 32 memory arrays Indexing functions are computed, then arrays are accessed 4 banks * 2 (pred. + hyst.) 4 tables in a single array 8 memory arrays No time to lose: start access and compute part of the index in // « Logical view » vs real implementation

  20. Bank selection 1 out of 4 Meta G0 G1 BIM Wordline selection 1 out 64 Column selection: 8 out of 256 Unshuffle: 8 to 8 Reading the branch prediction tables

  21. Reading the branch prediction tables (2) • Span over 5 cycle phases: • Cycle -1: • bank number computation • bank selection • Cycle 0: • phase 0: wordline selection • phase 1: column selection • Cycle 1: • phase 0: unshuffle permutation

  22. Constraints for indices composition • Strong: Wordline bits: • immediate availability • common to the four logical tables • Medium: Column bits • a single 2-entry XOR gate • Weak: Unshuffle bits: • near complete freedom, a full tree of XOR gates if needed

  23. Designing the indexing functions (1)6 wordline bits • Must be available at the beginning of the cycle: • block address bits • 3-block old lghist bits • path bits • Tradeoff: • address bits for emphasizing bimodal component behavior • lghist bits are more uniformly distributed 4 lghist bits + 2 address bits

  24. Designing the indexing functions (2)Column selection and unshuffle • Favor independance of the four indexing functions: • if two (address,history) pairs conflict on a table then try to avoid repeating the conflict on an other table • Guarantee that for a single address, two histories that differ by only one or two bits will not map on the same entry • Favor usage of the whole table: • lghist bits are more uniformly distributed than address bits XORing 2 lghist bits for column bits a XOR tree with up to 11 bits for unshuffle

  25. EV8 branch predictor configuration • 208 Kbits for prediction and 144 Kbits for hysteresis • «BIM»: 16 K + 16 K, 4 lghist bits (+ 3-block path) • G0: 64 K + 32 K, 13 lghist bits • G1: 64 K + 64 K, 21 lghist bits • Meta: 64 K + 32 K, 17 lghist bits • 4 prediction banks and 4 hysteresis banks

  26. Performance evaluation Sorry, SPEC 95 :-)

  27. Benchmarks characteristics • Highly optimized SPECint 95: • much more not-taken than taken • ratio lghist/ghist length: • from 1.12 to 1.59 • from 8.9 to 16.2 branches per 100 instructions

  28. 2Bc-gskew vs other global history predictors

  29. Quality of information vector

  30. Reducing some table sizes no significant impact

  31. Quality of indexing functions

  32. Conclusion • Design of a real branch predictor leads to challenges ignored in most academic studies: • 3-block old history vector • inability to maintain a complete history • simultaneous accesses to the predictor • minimization of the number of memory arrays • timing constraints on the indexing functions We overcame these difficulties and adapted a state of the art academic branch predictor to real world constraints.

  33. Summary of the contributions • Efficient information vector can be built with mixing path and compressed history: • don’t focus on the info vector, use what is convenient! • Use of different table sizes, history lengths in the predictor. • Sharing of hysteresis bits • Conflict free parallel access scheme for the predictor • Engineering of indexing functions

  34. Acknowledgements To the whole EV8 design team Special mention to: Ta-chung Chang, George Chrysos, John Edmondson, Joel Emer, Tryggve Fossum, Glenn Giacalone, Balakrishnan Iyer, Manickavelu Balasubramanian, Harish Patil, George Tien and James Vash.

More Related