PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

S L C PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava1 Nikil Dutt1 Alex Nicolau1 Eugene Earlie2 2Strategic CAD Labs, Intel, Hudson, MA, USA 1Center For Embedded Computer Systems, University of California, Irvine, CA, USA

RF X2 F D OR X1 WB Bypassing Improves Performance • Pipelining improves performance • Limited by pipeline hazards • Bypasses eliminate certain data hazards • Further improve performance RF X2 F D OR WB X1 R1 R4  R4 + R1 R1  R2 + R3 R1 R4  R4 + R1 R1  R2 + R3

Impact of Bypassing • Wiring congestion • Cycle time • Bypasses may be a part of timing-critical path • Overall chip complexity • deeply pipelined • out-of-order processors • Area and Power consumption • Wide multiplexers • Bypass Control logic • Bypass wires M1 RF X2 F D X1 WB M2 OR P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.

Problem, Solution and Problem • Problem – How do I customize bypasses? • Important for Embedded Systems • Solution – • Keep only the most beneficial bypasses • Area, Power and Performance trade-off RF X2 F D OR X1 WB • Problems – • How to Compile for a processor with partial bypassing? • Requires Compiler-in-the-Loop Exploration

Related Work • Optimizations for partial bypassing • P. Ahuja et al. [MICRO’95] • Manual code generation • M. Buss et al. [CASES’01] • Optimize inter-cluster copy operations • K. Fan et al. [ASSP’03] • FU-allocation strategy Only for VLIW processors • A. Shrivastava et al. [CODES’04] • A generic “pipeline hazard detection” mechanism to generate bypass-sensitive code We present • A generic Compiler-in-the-Loop bypass exploration framework • Perform area-power-performance trade-off on Intel XScale by varying bypasses

Application Application Synthesis Tool Bypass-sensitive Compiler Bypass-control Logic Executable Power Simulator Cycle-accurate Simulator Stimulus Energy Estimate Execution Cycles Area Estimate Report PBExplore: A CIL Exploration Framework Bypass Configuration

Bypass Sensitive Scheduling • Bypasses transfer data between dependent operations • Missing bypasses cause pipeline hazard No Hazard Hazard RF X2 F D OR X1 WB R1 R1 R1 R4  R4 + R1 R1  R2 + R3 R1  R2 + R3 R1  R2 + R3 • Bypass-sensitive compiler should be able to • detect and avoid pipeline hazards

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Operation Table Operation Table for ADD R1 R2 R3 Details are in the paper !! 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. XWB WriteOperands R1 C3 RF • Operation Table is a binding between • Operation and Processor Resources and Registers • Can detect Resource Hazards • OTs model processor resources • Can detect Data Hazards • OTs model processor registers

Experiments • Experiments I – Need of a CIL framework • Need of Bypass-sensitive Compiler-in-the-Loop Exploration • Traditional exploration versus Bypass-sensitive Compiler-in-the-Loop exploration • Experiments II – CIL Exploration • Use of Bypass-sensitive Compiler-in-the-Loop Exploration • Perform Power-Performance-Area trade-offs • Identify alternate interesting design points

Bypass-sensitive Compiler-in-the-Loop Exploration Application Application Traditional Exploration OT-based Compiler gcc –O3 Executable Executable Cycle Accurate Simulator Cycle Accurate Simulator Traditional Cycles CIL Cycles Experiments I - Framework Traditional Exploration versus Bypass-sensitive Compiler-in-the-Loop Exploration Bypass Configuration

Experiments I - Setup D1 D2 DWB • 7 pipeline stages can bypass result • We vary which pipeline stage bypasses a result • 27 = 128 bypass configurations • Encode bypass configuration • <DWB D2 MWB M2 XWB X2 X1> • Configuration 28 = <0011100> • Bypass paths from MWB, M2 and XWB are present F1 F2 ID RF X1 X2 XWB M1 M2 MWB

Traditional bitcount CIL Bypass Explorations on XScale 1250000 1200000 1150000 1100000 Execution Cycles 1050000 1000000 950000 900000 850000 0 32 64 96 128 Bypass Source Configurations • CIL-compiler can effectively exploit the bypass configuration • Significant performance difference

D1 D2 DWB 1200000 bitcount Traditional M1 M2 MWB CIL 1150000 1100000 F1 F2 ID RF X1 X2 XWB 1050000 Execution Cycles 1000000 950000 900000 850000 - X1 X2 XWB X2 X1 XWB X2 XWB X1 XWB X2 X1 X-bypass Configuration X-bypass explorations in XScale Difference in trends

D1 D2 DWB F1 F2 ID RF X1 X2 XWB M1 M2 MWB M-bypass explorations in XScale Difference in trends

980000 Traditional bitcount CIL 960000 D1 D2 DWB 940000 F1 F2 ID RF X1 X2 XWB Execution Cycles 920000 M1 M2 MWB 900000 880000 860000 - DWB D2 DWB D2 D Bypass Configurations D-bypass exploration in XScale Difference in trends

Experiments II - Setup Power-Performance-Area trade-offs • Scheduler • Exhaustive instruction reordering within Basic Blocks • Synthesis Tool • Synopsys Design compiler 2001.10 • 0.8µ library lsi_10k • Power Estimation • Synopsys power_estimate Bypass Configuration Application Application Synthesis Tool Bypass-sensitive Compiler Bypass Control Logic Executable Cycle-accurate Simulator Power Simulator Report Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.com M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001 Synopsys Design Compiler, 2001, http://www.synopsys.com/products/logic/design compiler.html

Point 1 Point 1 Point 2 Point 2 Performance-Energy-Area Trade-off • Design Point 1 • no bypass from MWB and XWB to first operand • 18% less area and 14% less energy consumption of bypass control logic • 2% performance loss • Design Point 2 • Only D2 and X2 bypass to first operand • 25% less area and 16% less energy consumption of bypass control logic • 6% performance loss

Summary • Bypassing improves performance but is costly in terms of area and power • Partial bypassing presents valuable trade-offs, however poses challenges in compilation • We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses. • PBExplore uses Operation Tables to generate bypass-sensitive code • PBExplore automatically synthesizes bypass control logic to explore power and area trade-offs • PBExplore is able to discover interesting design points that trade-off performance for power and area of bypass control logic

Thank You

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Pipeline Hazard Detection using OT

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Resource Hazard Detection Resource Hazard

RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Data Hazard Detection Data Hazard

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors

Presentation Transcript

Design Space Exploration of Embedded Systems

Embedded Computer Architecture 5SAI0 Coherence, Synchronization and Memory Consistency ( ch 5b,7)

UBC104 Embedded Systems

An Architecture Framework for Transparent Instruction Set Customization in Embedded Processors

Rapid Exploration of Pipelined Processors through Automatic Generation of Synthesizable RTL Models

Lower Power Embedded Architecture Design

UML, Embedded Systems, and Application Frameworks

CECS 347 Embedded Processors

TI Sitara ™ AM37x Microprocessors Featuring ARM ® Cortex ™ -A8

Lecture 4: Embedded Application Framework Qt Tutorial Cheng-Liang (Paul) Hsieh

Macro instruction synthesis for embedded processors

嵌入式微處理機 Embedded Processors

Introduction to Embedded Systems

2-Hardware Design of Embedded Processors (cont.)

4-Integrating Peripherals in Embedded Systems

Hardware Assisted Control Flow Obfuscation for Embedded Processors

Customizable Embedded System Architectures

MicroChip

e3 PLUS A FRAMEWORK FOR RESPONSIBLE EXPLORATION “ Doing the Right Thing Wherever We Work ”

Some Embedded Processor Alternatives; Processors for this course: Introduction to Altera FPGAs

Embedded Web