Operation Tables for Scheduling in the presence of Partial Bypassing

L S C Operation Tablesfor Scheduling in the presence of Partial Bypassing Aviral Shrivastava1 Eugene Earlie2 Nikil Dutt1 Alex Nicolau1 2Strategic CAD Labs, Intel, Hudson, MA, USA 1Center For Embedded Computer Systems, University of California, Irvine, CA, USA

RF X2 F D OR X1 WB Bypassing Improves Performance • Pipelining improves performance • Limited by pipeline hazards • Bypasses eliminate certain data hazards • Further improve performance RF X2 F D OR WB X1 R1 R4  R4 + R1 R1  R2 + R3 R1 R4  R4 + R1 R1  R2 + R3

M1 RF X2 F D X1 WB M2 Impact of Bypassing • Area and Power consumption • Wide multiplexers • Bypass Control logic • Bypass wires • Cycle time • Bypasses may be a part of timing-critical path • Overall chip complexity • deeply pipelined • out-of-order processors • Wiring congestion P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.

Bypassing in Embedded Systems • Bypassing increases performance • But may have significant impact on Area, Power Consumption, Wire congestion etc.. • The Embedded Systems Dilemma • No Bypassing - Too low performance • Full Bypassing - Too much area, power, wire congestion • How to customize Bypassing?

RF X2 F D OR X1 WB Partial Bypassing – Solution and Problem • Solution – • Only the most beneficial bypasses are present • Implements a trade-off between Performance, Area, Power consumption, etc.. of the processor • Problem – • How to Compile for a processor with partial bypassing?

Related Work • Compilation for partial bypassing • P. Ahuja et al. [MICRO’95] • Manual Compilation • M. Buss et al. [CASES’01] • Optimize inter-cluster copy operations • K. Fan et al. [ASSP’03] • FU-allocation strategy for VLIW processors • No existing generic compilation technique • RISC, superscalar, superpipline • No instruction reordering • No accurate “pipeline hazard detection” technique We present : An accurate, generic, retargetable pipeline hazard detection technique

Pipeline Hazards • Data Hazards • Resource Hazards • Resource Hazards – Structural Information • Reservation Tables RF C3 C1 C2 X2 F D OR X1 WB

Resource Hazard Detection Resource Hazard RF C3 C1 C2 X2 F D OR X1 WB

Data Hazard Detection • Control Flow Graph – Register Information • Operation Latency • Least delay (in cycles) by which dependent operations must be separated to avoid data hazard a Time 1 a b 1 2 2 b c c 2 2 1 d d e e d 3 f e 4 f Control Flow Graph with operation latencies Scheduled operations

RF X2 F D OR X1 WB Traditional - Operation Latency • Operation Latency of a non-bypassed or fully bypassed pipeline is a constant RF X2 D OR X1 WB F R1 R1 R4  R4 + R1 R1  R2 + R3 R4  R4 + R1 R1  R2 + R3 No Bypassing: Operation Latency = 3 Full Bypassing: Operation Latency = 1

Partial Bypasses - Operation Latency • Operation Latency ill-defined RF X2 F D OR X1 X3 WB R4  R4 + R1 R1  R2 + R3 Partial Bypassing: Operation Latency = ?? • Delay (in cycles) depends on the structure • Processor pipeline • Presence/absence of bypasses • Need structural information to detect data hazards

Partial Bypassing - Pipeline Hazards • Traditionally (No or Full Bypassing) • Resource Hazards - Structural information • Data Hazards - Register information + Operation Latency • Partial Bypassing • Resource Hazards - Structural information • Data Hazards - Register information + Structural information • Structural information captured by Reservation Tables • Augment Reservation Tables with register information Our Contribution - Operation Table

Reservation Table 1. F 2. D 3. OR C1 RF C2 RF 4. X1 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources • Does not support multiple datapaths RF C3 C1 C2 X2 F D OR X1 WB Reservation Table for ADD

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Enhanced Reservation Table 1. F 2. D 3. OR C1 RF C2 RF C5 BRF 4. X1 C4 BRF 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources Reservation Table for ADD

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Operation Table 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. WB WriteOperands R1 C3 RF • Operation Table is a binding between • Operation and Processor Resources and Registers • Can be used to detect both data and resource hazards Operation Table for ADD R1 R2 R3

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Pipeline Hazard Detection using OT

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Resource Hazard Detection Resource Hazard

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Data Hazard Detection Data Hazard

Scheduling using Operation Tables • Operation Tables provide a way to accurately detect pipeline hazards • detect data and resource hazards • Most scheduling algorithms have two main components • Generate possible reorderings • Evaulate each to find the best one. • Most Scheduling algorithms should be able to leverage from a better evaluation mechanism

Experimental Setup • Platform – Intel XScale • 7-stage super-pipelined RISC • Benchmarks – MiBench • Scheduler • instruction reordering within Basic Block • Currently a post pass in the compiler Application gcc –O3 Executable OT – based Scheduler Executable Cycle Accurate Simulator Cycle Accurate Simulator GCC Cycles OT Cycles Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.com M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001

Up to 20% Performance Improvement Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles

Summary • Bypassing improves performance but is costly in terms of area, power etc.. • Partial bypassing presents valuable trade-offs, however poses challenges in compilation • Operation latencies in a partially bypassed pipeline are ill-defined • We define Operation Table (OT) as a binding between an operation and the processors resources and registers • OTs can be used to accurately detect hazards even in the presence of partial bypassing in processors • OT based simple Basic Block level scheduling results in up to 20% performance improvement

Thank You! Questions/Comments? aviral@ics.uci.edu

Operation Tables for Scheduling in the presence of Partial Bypassing

Operation Tables for Scheduling in the presence of Partial Bypassing

Presentation Transcript

Bypassing the Union

Living in the Presence of God

The Operation of Heavy Equipment in the Presence of Personnel On the Ground

BYPASSING

Zeta: Scheduling Interactive Services with Partial Execution

The Presence of Courage in Literature

THE PRESENCE OF GOD

DATA-DRIVEN APPOINTMENT SCHEDULING IN THE PRESENCE OF NO-SHOWS

Bypassing Complexity in Synthesis

In the Holy Presence of God

The Practice of Presence

The Principle of Presence:

Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems

Bypassing the Roadblocks

Tians Scheduling: Using Partial Processing in Best-Effort Applications

The Presence of God

Presence of Characters in the Novel

The Operation of Heavy Equipment in the Presence of Personnel On the Ground

Presence in the World

In the Presence of God