What Is a Compiler When The Architecture Is Not Hard(ware) ?

What Is a Compiler When The Architecture Is Not Hard(ware) ? Krishna V Palem This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944. Portions of this presentation were given by the speaker as a keynote at the ACM LCTES2001 and as an invited speaker at EMSOFT01

Embedded Computing Why ? What ? How ?

View is of end application Visible Computing (Hidden computing element) The Nature of Embedded Systems

Favorable Trends • Supported by Moore’s (second) law • Computing power doubles every eighteen months Corollary: cost per unit of computing halves every eighteen months • From hundreds of millions to billions of units • Projected by market research firms (VDC) to be a 50 billion+ space over the next five years • High volume, relatively low per unit $ margin

Low Power • - High battery life • Small size or footprint • Real-time constraints Performance comparable to or surpassing leading edge COTS technology Rapid time-to-market Embedded Systems Desiderata

Timing Example Video-On-Demand Predictable Timing Behavior Unpredictable Timing Behavior

Industrial Automation Telecom Select computational kernels To ASIC Current Art Vertical application domains • Meet desiderata while overcoming NRE cost hurdles through volume • High migration inertia across applications • Long time to market

Subtle but Sure Hurdles • For Moore’s corollary to be true • Non-recurring engineering (NRE) cost must be amortized over high-volume • Else prohibitively high per unit costs • Implies “uniform designs” over large workload classes • (Eg). Numerical, integer, signal processing • Demands of embedded systems • “Non uniform” or application specific designs • Per application volume might not be high • High NRE costs  infeasible cost/unit • Time to market pressure

3D Graphics Industrial Automation Medtronics E-Textiles Telecom Rapidly changing application kernels in moderate volume Custom computing solution meeting constraints The Embedded Systems Challenge Multiple application domains • Sustain Moore’s corollary • Keep NRE costs down

3D Graphics Industrial Automation Medtronics E-Textiles Telecom Rapidly changing application kernels in low volume Automatic Tools Power Timing Size Application specific design Responding Via Automation Multiple application domains

Three Active Approaches • Custom microprocessors • Architecture exploration and synthesis • Architecture assembly for reconfigurable computing

Time Intensive Custom Processor Implementation • High performance implementation • Customized in silicon for particular application domain • O(months) of design time • Once designed, programmable like standard processors Proprietary ISA, Architecture Specification Custom Processor implementation Application analysis Proprietary Tools Fabricate Processor Proprietary ISA Language with custom extensions Application Compiler Binary Tensilica, HP-ST Microelectronics approach

Program In Chip Out “Computer design for the masses” “A custom system architecture in 1 week tape-out in 4 weeks” B Ramakrishna Rau “The Era of Embedded Computing”, Invited talk, CASES 2000. Architecture Exploration and Synthesis The PICO Vision Automatic synthesis of application specific parallel / VLIW ULSI microprocessors And their compilers for embedded computing

Custom Microprocessors Application(s) define workload Analyze Define ISA extension (eg) IA 64+ Optimizing Compiler Define Compiler Optimizations Microprocessor (eg) Itanium + Design Implementation

Application Specific Design Applications Single Application Analyze Program Analysis Extended EPIC Compiler Technology Library of possible implementations (Bypass ISA) Explore and Synthesize implementations VLIW Core + Non programmable extension Application specific processor runs single application

Compiler Hardware Frontend and Optimizer Superscalar Determine Dependences Determine Dependences Dataflow Determine Independences Determine Independences Indep. Arch. Bind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993. The Compiler Optimization Trajectory

What Is the Compiler’s Target ``ISA’’? Compiler Hardware • Target is a range of architectures and their building blocks • Compiler reaches into a constrained space of silicon • Explores architectural implementations • O(days – weeks) of design time • Exploration sensitive to application specific hardware modules • Fixed function silicon is the result • Verification NRE costs still there • One approach to overcoming time to market Frontend and Optimizer Superscalar Determine Dependences Determine Dependences Dataflow Determine Independences Determine Independences Indep. Arch. Bind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

(EDIF) Netlist Fixed silicon implementation Standard cell design etc. Emulated i.e. Reconfigurable target Choices of Silicon High level design/synthesis

Reconfigurable Computing

FPGAs As an Alternative Choice for Customization • Frequent (re)configuration and hence frequent recustomization • Fabrication process is steadily improving • Gate densities are going up • Performance levels are acceptable • Amortize large NRE investments by using COTS platform

Major Motivations • Poor compilation times • Lack of correspondence between standard IR and final configurations • Place and route inherently complex • Would like to have compile times of the order of seconds to minutes • Provide customization of data-path by automatic analysis and optimization in software

Adaptive Explicitly Parallel Instruction ComputingSurendranath TallaDepartment of Computer Science, New York UniversityPhD Thesis, May 2001 Janet Fabri award for outstanding dissertation. Adaptive Explicitly Parallel Instruction ComputingKrishna V. Palem and Surendranath Talla, Courant Institute of Mathematical Sciences; Patrick W. Devaney, Panasonic AVC American Laboratories Inc.Proceedings of the 4th Australasian Computer Architecture Conference, Auckland, NZ. January 1999 Adaptive EPIC

Compiler-Processor Interface ISA Source program format ADD semantics format Compiler LD semantics Registers Exceptions.. Executable

ISA format mysub semantics format myxor semantics Redefining Processor-Compiler Interface Source program Compiler Executable Let compiler determine the instruction sets (and their realization on chip)

Functional units ILP EPIC execution model Record of execution

Record of execution Adaptive EPIC execution model ILP-1 Configured Functional units Reconfigure datapath ILP-2

Adaptive EPIC What can be efficiently compiled for today? FPGA Complex ASIC DPGA RAW RaPiD GARP MultiChip CVH TRACE (Multiscalar) SMT Parallelism SuperSpeculative EPIC/VLIW TTA SuperScalar Dataflow VECTOR Simple Pipelined/ Embedded Early x86 Simple ASIC 0 4 16 32 64 128-512 1K-10K 100K-1M >1M Approximate instruction packet size

RLA F1 F2 Adaptive EPIC • AEPIC = EPIC processing + Datapath reconfiguration • Reconfiguration through three basic operations: • add a functional unit to the datapath • remove a functional unit from the datapath • execute an operation on resident functional unit

AEPIC Machine Organization Level-1 configuration cache I-Cache Configuration cache CRF F/D Multi-context Reconfigurable Logic Array GPR/FPR/… Array Register File L1 L2

AEPIC Compilation

Key Tasks For AEPIC Compiler • Partitioning • Identifying code sections that convert to custom instructions • Operation synthesis • Efficient CFU implementations for identified partitions • Instruction selection • Multiple CFU choices for code partitions • Resource allocation • Allocation of C-cache and MRLA resources for CFUs • Scheduling • When and where to schedule adaptive component instructions

Pre-pass scheduling Resource allocation Post-pass scheduling Code generation Simulation Compiler Modules Source program Front-end processing Partitioning Configuration Library High-level optimizations (EPIC core) Mapping (Adaptive) Machine description Performance statistics

Resource allocation for configurations Reconfigurable logic can accommodate only two configurations simultaneously Record of execution B G RL R Overlapping live-range region Processor

Managing Reconfigurable Resource (contd.) • Simpler version related to register allocation problem • New parameters • No need to save to memory on spill • configurations immutable • Load costs different for different configurations • C-cache, MRLA = multi-level register file • Adapted register allocation techniques to configuration allocation • Non-uniform sizes of configurations: graph multi-coloring • Adapted Chaitin’s coloring based techniques

Configuration load time Configuration execution time The Problem With Long Reconfiguration Times

Speculated configuration loads f1 f2 Configuration execution time Speculating Configuration Loads • Mask reconfiguration times! • Need to know when and where to speculate • If f1>>f2 do not speculate to “red” empty load slot

Sample Compiler Topics • Configuration cache management • Power/clock rate vs. Performance tradeoff • Bit-width optimizations

More Generally Architecture Assembly • An ISA view • Synthesis and other hardware design off-line • Much closer to compiler optimizations implies faster compile time Applications Build off-line (synthesis, place and route) Program Prebuilt Implementations “Compiler” selects assembles and optimizes program Data path Storage Interconnect Dynamically variable ISA Architecture implementation Also applicable to yield fixed implementations in silicon

What Is a Compiler When The Architecture Is Not Hard(ware) ?