230 likes | 362 Views
L. S. C. Operation Tables for Scheduling in the presence of Partial Bypassing. Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1. 2 Strategic CAD Labs, Intel, Hudson, MA, USA. 1 Center For Embedded Computer Systems, University of California, Irvine, CA, USA. RF. X2. F.
E N D
L S C Operation Tablesfor Scheduling in the presence of Partial Bypassing Aviral Shrivastava1 Eugene Earlie2 Nikil Dutt1 Alex Nicolau1 2Strategic CAD Labs, Intel, Hudson, MA, USA 1Center For Embedded Computer Systems, University of California, Irvine, CA, USA
RF X2 F D OR X1 WB Bypassing Improves Performance • Pipelining improves performance • Limited by pipeline hazards • Bypasses eliminate certain data hazards • Further improve performance RF X2 F D OR WB X1 R1 R4 R4 + R1 R1 R2 + R3 R1 R4 R4 + R1 R1 R2 + R3
M1 RF X2 F D X1 WB M2 Impact of Bypassing • Area and Power consumption • Wide multiplexers • Bypass Control logic • Bypass wires • Cycle time • Bypasses may be a part of timing-critical path • Overall chip complexity • deeply pipelined • out-of-order processors • Wiring congestion P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.
Bypassing in Embedded Systems • Bypassing increases performance • But may have significant impact on Area, Power Consumption, Wire congestion etc.. • The Embedded Systems Dilemma • No Bypassing - Too low performance • Full Bypassing - Too much area, power, wire congestion • How to customize Bypassing?
RF X2 F D OR X1 WB Partial Bypassing – Solution and Problem • Solution – • Only the most beneficial bypasses are present • Implements a trade-off between Performance, Area, Power consumption, etc.. of the processor • Problem – • How to Compile for a processor with partial bypassing?
Related Work • Compilation for partial bypassing • P. Ahuja et al. [MICRO’95] • Manual Compilation • M. Buss et al. [CASES’01] • Optimize inter-cluster copy operations • K. Fan et al. [ASSP’03] • FU-allocation strategy for VLIW processors • No existing generic compilation technique • RISC, superscalar, superpipline • No instruction reordering • No accurate “pipeline hazard detection” technique We present : An accurate, generic, retargetable pipeline hazard detection technique
Pipeline Hazards • Data Hazards • Resource Hazards • Resource Hazards – Structural Information • Reservation Tables RF C3 C1 C2 X2 F D OR X1 WB
Resource Hazard Detection Resource Hazard RF C3 C1 C2 X2 F D OR X1 WB
Data Hazard Detection • Control Flow Graph – Register Information • Operation Latency • Least delay (in cycles) by which dependent operations must be separated to avoid data hazard a Time 1 a b 1 2 2 b c c 2 2 1 d d e e d 3 f e 4 f Control Flow Graph with operation latencies Scheduled operations
RF X2 F D OR X1 WB Traditional - Operation Latency • Operation Latency of a non-bypassed or fully bypassed pipeline is a constant RF X2 D OR X1 WB F R1 R1 R4 R4 + R1 R1 R2 + R3 R4 R4 + R1 R1 R2 + R3 No Bypassing: Operation Latency = 3 Full Bypassing: Operation Latency = 1
Partial Bypasses - Operation Latency • Operation Latency ill-defined RF X2 F D OR X1 X3 WB R4 R4 + R1 R1 R2 + R3 Partial Bypassing: Operation Latency = ?? • Delay (in cycles) depends on the structure • Processor pipeline • Presence/absence of bypasses • Need structural information to detect data hazards
Partial Bypassing - Pipeline Hazards • Traditionally (No or Full Bypassing) • Resource Hazards - Structural information • Data Hazards - Register information + Operation Latency • Partial Bypassing • Resource Hazards - Structural information • Data Hazards - Register information + Structural information • Structural information captured by Reservation Tables • Augment Reservation Tables with register information Our Contribution - Operation Table
Reservation Table 1. F 2. D 3. OR C1 RF C2 RF 4. X1 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources • Does not support multiple datapaths RF C3 C1 C2 X2 F D OR X1 WB Reservation Table for ADD
RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Enhanced Reservation Table 1. F 2. D 3. OR C1 RF C2 RF C5 BRF 4. X1 C4 BRF 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources Reservation Table for ADD
RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Operation Table 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. WB WriteOperands R1 C3 RF • Operation Table is a binding between • Operation and Processor Resources and Registers • Can be used to detect both data and resource hazards Operation Table for ADD R1 R2 R3
RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Pipeline Hazard Detection using OT
RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Resource Hazard Detection Resource Hazard
RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Data Hazard Detection Data Hazard
Scheduling using Operation Tables • Operation Tables provide a way to accurately detect pipeline hazards • detect data and resource hazards • Most scheduling algorithms have two main components • Generate possible reorderings • Evaulate each to find the best one. • Most Scheduling algorithms should be able to leverage from a better evaluation mechanism
Experimental Setup • Platform – Intel XScale • 7-stage super-pipelined RISC • Benchmarks – MiBench • Scheduler • instruction reordering within Basic Block • Currently a post pass in the compiler Application gcc –O3 Executable OT – based Scheduler Executable Cycle Accurate Simulator Cycle Accurate Simulator GCC Cycles OT Cycles Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.com M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001
Up to 20% Performance Improvement Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles
Summary • Bypassing improves performance but is costly in terms of area, power etc.. • Partial bypassing presents valuable trade-offs, however poses challenges in compilation • Operation latencies in a partially bypassed pipeline are ill-defined • We define Operation Table (OT) as a binding between an operation and the processors resources and registers • OTs can be used to accurately detect hazards even in the presence of partial bypassing in processors • OT based simple Basic Block level scheduling results in up to 20% performance improvement
Thank You! Questions/Comments? aviral@ics.uci.edu