150 likes | 260 Views
Emulating Unimplemented Instructions in an SMT. Suan Yong & Brian Forney CS/ECE 752 Spring 2000. Motivation. Simultaneous Multithreaded processors are promising and likely to be embraced by industry Exploit thread level parallelism Compaq’s Alpha 21464 has planned SMT support. The question.
E N D
Emulating Unimplemented Instructions in an SMT Suan Yong & Brian Forney CS/ECE 752 Spring 2000
Motivation • Simultaneous Multithreaded processors are promising and likely to be embraced by industry • Exploit thread level parallelism • Compaq’s Alpha 21464 has planned SMT support
The question • What if SMT support was needed but cost, complexity, or power consumption are an issue? • One solution is to remove functional units => emulation • Can anything be done to speed up emulation of instructions?
Related work • “The Use of Multithreading for Exception Handling,” Zilles et al, Micro-32, November 1999 • “Simultaneous Subordinate Microthreading (SSMT),” Chappell et al, 26th Annual ISCA, May 1999
3 4 5 6 7 A A B B C C C C 6 7 6 8 7 9 8 10 Exception Handling 3 4 5 6 7 standard pipeline: SMT:
PC 2 1 R1 R2 : 2 1 Rn 3 2 1 3 3 1 1 2 2 1 1 1 FP 2 1 BRANCH PREDICTOR FETCH I$ PC 3 1 R1 DECODE R2 : 3 1 Rn Simultaneous Multithreading (SMT) D$ R / W I-ALU I-MUL 3 1
PC R1 R2 : Rn I-MUL FP Emulating SMT approach BRANCH PREDICTOR FETCH I$ PC R1 DECODE R2 : Rn D$ R / W I-ALU I-MUL T-STRT T-RET
A PC & B A R1 src1 R2 src2 : [3] 7 6 5 4 Rn I-MUL FP 7 6 5 4 3 BRANCH PREDICTOR FETCH I$ PC R1 DECODE R2 : Rn D$ R / W I-ALU T-STRT T-RET 5 [7] [6] [5] [4] [3] [2] [1]
PC A R1 R2 : A 7 4 Rn 5 I-MUL FP 7 7 6 5 4 4 3 BRANCH PREDICTOR FETCH I$ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PC C B A R1 src1 DECODE R2 src2 : [3] 7 6 5 4 Rn D$ R / W I-ALU T-STRT T-RET 5 [7] [6] [5] [4] [3] [2] [1]
PC R1 R2 : Rn I-MUL FP 7 6 5 4 3 BRANCH PREDICTOR FETCH I$ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PC R1 src1 DECODE R2 src2 : [3] 6 C Rn D$ R / W I-ALU T-STRT T-RET C C B A [7] [6] [5] [4] [3] [2] [1]
PC R1 R2 : Rn I-MUL FP 7 6 5 4 3 ? BRANCH PREDICTOR FETCH I$ Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z PC R1 src1 DECODE R2 src2 : [3] 6 C Rn D$ R / W I-ALU T-STRT T-RET C C B A [7] [6] [5] [4] [3] [2] [1]
Methodology Modified Zilles’s sim-multi Compaq Alpha SMT simulator added exception thread support added multiply thread Ran representative execution traces of benchmarks from SPEC CPU2000 and MediaBench
Conclusions ESMT usually minimizes performance cost of emulation “ooo” mode (non-pausing) works best “squash” is occasionally better, because of resource contention do partial squashing? Some of the hardware is already needed, and could be useful for other purposes