1 / 38

What Is a Compiler When The Architecture Is Not Hard(ware) ?

What Is a Compiler When The Architecture Is Not Hard(ware) ?. Krishna V Palem. This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944.

habib
Download Presentation

What Is a Compiler When The Architecture Is Not Hard(ware) ?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What Is a Compiler When The Architecture Is Not Hard(ware) ? Krishna V Palem This work was supported in part by awards from Hewlett-Packard Corporation, IBM, Panasonic AVC, and by DARPA under Contract No. DABT63-96-C-0049 and Grant No. 25-74100-F0944. Portions of this presentation were given by the speaker as a keynote at the ACM LCTES2001 and as an invited speaker at EMSOFT01

  2. Embedded Computing Why ? What ? How ?

  3. View is of end application Visible Computing (Hidden computing element) The Nature of Embedded Systems

  4. Favorable Trends • Supported by Moore’s (second) law • Computing power doubles every eighteen months Corollary: cost per unit of computing halves every eighteen months • From hundreds of millions to billions of units • Projected by market research firms (VDC) to be a 50 billion+ space over the next five years • High volume, relatively low per unit $ margin

  5. Low Power • - High battery life • Small size or footprint • Real-time constraints Performance comparable to or surpassing leading edge COTS technology Rapid time-to-market Embedded Systems Desiderata

  6. Timing Example Video-On-Demand Predictable Timing Behavior Unpredictable Timing Behavior

  7. Industrial Automation Telecom Select computational kernels To ASIC Current Art Vertical application domains • Meet desiderata while overcoming NRE cost hurdles through volume • High migration inertia across applications • Long time to market

  8. Subtle but Sure Hurdles • For Moore’s corollary to be true • Non-recurring engineering (NRE) cost must be amortized over high-volume • Else prohibitively high per unit costs • Implies “uniform designs” over large workload classes • (Eg). Numerical, integer, signal processing • Demands of embedded systems • “Non uniform” or application specific designs • Per application volume might not be high • High NRE costs  infeasible cost/unit • Time to market pressure

  9. 3D Graphics Industrial Automation Medtronics E-Textiles Telecom Rapidly changing application kernels in moderate volume Custom computing solution meeting constraints The Embedded Systems Challenge Multiple application domains • Sustain Moore’s corollary • Keep NRE costs down

  10. 3D Graphics Industrial Automation Medtronics E-Textiles Telecom Rapidly changing application kernels in low volume Automatic Tools Power Timing Size Application specific design Responding Via Automation Multiple application domains

  11. Three Active Approaches • Custom microprocessors • Architecture exploration and synthesis • Architecture assembly for reconfigurable computing

  12. Time Intensive Custom Processor Implementation • High performance implementation • Customized in silicon for particular application domain • O(months) of design time • Once designed, programmable like standard processors Proprietary ISA, Architecture Specification Custom Processor implementation Application analysis Proprietary Tools Fabricate Processor Proprietary ISA Language with custom extensions Application Compiler Binary Tensilica, HP-ST Microelectronics approach

  13. Program In Chip Out “Computer design for the masses” “A custom system architecture in 1 week tape-out in 4 weeks” B Ramakrishna Rau “The Era of Embedded Computing”, Invited talk, CASES 2000. Architecture Exploration and Synthesis The PICO Vision Automatic synthesis of application specific parallel / VLIW ULSI microprocessors And their compilers for embedded computing

  14. Custom Microprocessors Application(s) define workload Analyze Define ISA extension (eg) IA 64+ Optimizing Compiler Define Compiler Optimizations Microprocessor (eg) Itanium + Design Implementation

  15. Application Specific Design Applications Single Application Analyze Program Analysis Extended EPIC Compiler Technology Library of possible implementations (Bypass ISA) Explore and Synthesize implementations VLIW Core + Non programmable extension Application specific processor runs single application

  16. Compiler Hardware Frontend and Optimizer Superscalar Determine Dependences Determine Dependences Dataflow Determine Independences Determine Independences Indep. Arch. Bind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993. The Compiler Optimization Trajectory

  17. What Is the Compiler’s Target ``ISA’’? Compiler Hardware • Target is a range of architectures and their building blocks • Compiler reaches into a constrained space of silicon • Explores architectural implementations • O(days – weeks) of design time • Exploration sensitive to application specific hardware modules • Fixed function silicon is the result • Verification NRE costs still there • One approach to overcoming time to market Frontend and Optimizer Superscalar Determine Dependences Determine Dependences Dataflow Determine Independences Determine Independences Indep. Arch. Bind Operations to Function Units Bind Operations to Function Units VLIW Bind Transports to Busses Bind Transports to Busses TTA Execute B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The Journal of Supercomputing, 7(1-2):9-50, May 1993.

  18. (EDIF) Netlist Fixed silicon implementation Standard cell design etc. Emulated i.e. Reconfigurable target Choices of Silicon High level design/synthesis

  19. Reconfigurable Computing

  20. FPGAs As an Alternative Choice for Customization • Frequent (re)configuration and hence frequent recustomization • Fabrication process is steadily improving • Gate densities are going up • Performance levels are acceptable • Amortize large NRE investments by using COTS platform

  21. Major Motivations • Poor compilation times • Lack of correspondence between standard IR and final configurations • Place and route inherently complex • Would like to have compile times of the order of seconds to minutes • Provide customization of data-path by automatic analysis and optimization in software

  22. Adaptive Explicitly Parallel Instruction ComputingSurendranath TallaDepartment of Computer Science, New York UniversityPhD Thesis, May 2001 Janet Fabri award for outstanding dissertation. Adaptive Explicitly Parallel Instruction ComputingKrishna V. Palem and Surendranath Talla, Courant Institute of Mathematical Sciences; Patrick W. Devaney, Panasonic AVC American Laboratories Inc.Proceedings of the 4th Australasian Computer Architecture Conference, Auckland, NZ. January 1999 Adaptive EPIC

  23. Compiler-Processor Interface ISA Source program format ADD semantics format Compiler LD semantics Registers Exceptions.. Executable

  24. ISA format mysub semantics format myxor semantics Redefining Processor-Compiler Interface Source program Compiler Executable Let compiler determine the instruction sets (and their realization on chip)

  25. Functional units ILP EPIC execution model Record of execution

  26. Record of execution Adaptive EPIC execution model ILP-1 Configured Functional units Reconfigure datapath ILP-2

  27. Adaptive EPIC What can be efficiently compiled for today? FPGA Complex ASIC DPGA RAW RaPiD GARP MultiChip CVH TRACE (Multiscalar) SMT Parallelism SuperSpeculative EPIC/VLIW TTA SuperScalar Dataflow VECTOR Simple Pipelined/ Embedded Early x86 Simple ASIC 0 4 16 32 64 128-512 1K-10K 100K-1M >1M Approximate instruction packet size

  28. RLA F1 F2 Adaptive EPIC • AEPIC = EPIC processing + Datapath reconfiguration • Reconfiguration through three basic operations: • add a functional unit to the datapath • remove a functional unit from the datapath • execute an operation on resident functional unit

  29. AEPIC Machine Organization Level-1 configuration cache I-Cache Configuration cache CRF F/D Multi-context Reconfigurable Logic Array GPR/FPR/… Array Register File L1 L2

  30. AEPIC Compilation

  31. Key Tasks For AEPIC Compiler • Partitioning • Identifying code sections that convert to custom instructions • Operation synthesis • Efficient CFU implementations for identified partitions • Instruction selection • Multiple CFU choices for code partitions • Resource allocation • Allocation of C-cache and MRLA resources for CFUs • Scheduling • When and where to schedule adaptive component instructions

  32. Pre-pass scheduling Resource allocation Post-pass scheduling Code generation Simulation Compiler Modules Source program Front-end processing Partitioning Configuration Library High-level optimizations (EPIC core) Mapping (Adaptive) Machine description Performance statistics

  33. Resource allocation for configurations Reconfigurable logic can accommodate only two configurations simultaneously Record of execution B G RL R Overlapping live-range region Processor

  34. Managing Reconfigurable Resource (contd.) • Simpler version related to register allocation problem • New parameters • No need to save to memory on spill • configurations immutable • Load costs different for different configurations • C-cache, MRLA = multi-level register file • Adapted register allocation techniques to configuration allocation • Non-uniform sizes of configurations: graph multi-coloring • Adapted Chaitin’s coloring based techniques

  35. Configuration load time Configuration execution time The Problem With Long Reconfiguration Times

  36. Speculated configuration loads f1 f2 Configuration execution time Speculating Configuration Loads • Mask reconfiguration times! • Need to know when and where to speculate • If f1>>f2 do not speculate to “red” empty load slot

  37. Sample Compiler Topics • Configuration cache management • Power/clock rate vs. Performance tradeoff • Bit-width optimizations

  38. More Generally Architecture Assembly • An ISA view • Synthesis and other hardware design off-line • Much closer to compiler optimizations implies faster compile time Applications Build off-line (synthesis, place and route) Program Prebuilt Implementations “Compiler” selects assembles and optimizes program Data path Storage Interconnect Dynamically variable ISA Architecture implementation Also applicable to yield fixed implementations in silicon

More Related