SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys

SoC Design • What to do when the performance of your main processor is insufficient? • Go multicore? • Application mapping difficult, resource utilisation unbalanced • Add hardwired accelerators? • Balanced but inflexible

SoC Design • What to do when the performance of your main processor is insufficient? ASIPs: application-specific processors • Anything between general-purpose P and hardwired data-path • Deploys classic hardware tricks (parallelism and customized datapaths) while retaining programmability – Hardware efficiency with software programmability

Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions

Architectural Optimization Space ASIP architectural optimization space Parallelism Speciali-zation

Architectural Optimization Space Parallelism Instruction-level parallelism (ILP) Data-level parallelism Task-level parallelism Orthogonalinstruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Multi-threading

Architectural Optimization Space Specialization App.-specificdata types App.-specificinstructions Pipeline Connectivity & storage matching application’s data-flow Integer, fractional, floating-point, bits, complex, vector… Distributed regs, sub-ranges Multiple mem’s, sub-ranges App.-spec. memory addressing App.-spec. data processing App.-spec. control processing Direct, indirect, post-modification, indexed, stack indirect… Any exoticoperator Jumps, subroutines, interrupts, HW do-loops, residual control, predication… Single or multi-cycle Relative or absolute, address range, delay slots…

IP Designer: ASIP Design and Programming

Synopsys - Full Spectrum Processor Technology Provider

32-bitARC HS ProcessorsHigh-Performance for Embedded Applications • Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in 28-nm process* • HS Family products • HS34 CCM, HS36 CCM plus I&D cache • HS234, HS236 dual-core • HS434, HS436 quad-core • Configurable so each instance can be optimized for performance and power • Custom instructions enable integration of proprietary hardware ARC Floating Point Unit JTAG User Defined Extensions ARCv2 ISA / DSP Real-Time Trace 10-stage pipeline MAC & SIMD Multi-plier ALU Divider Late ALU Memory Protection Unit Instruction CCM Data Cache Data CCM Instruction Cache *Worst case 28-nm silicon and conditions Optional

Pedestrian Detection and HOG • Pedestrian detection • Standard feature in luxury vehicles • Moving to mid-size and compact vehicles in the next 5-10 years, also due to legislation efforts • Implementation requirements • Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades) • Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)

Histogram Of Oriented Gradients Scale to Multiple Resolutions Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames. Gradient Computation Apply Sobel operators:and

Histogram Of Oriented Gradients Histogram Computation The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients. Normalization of the Histograms (1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization Support Vector Machine Linear classification of histogramsfor every 64x128 windows position. Non-Max Suppression Cluster multi-scale dense scan of detection windows and select unique

HOG Functional Validation on ARC HS(640 x 480 pixels) 1 • OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) … D D ASIP1 ASIP2 ASIPn AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl

Profiling (640 x 480 pixels, at 30 FPS)

Task Assignment #2 2 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D ASIP1 ASIP2 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

ASIP Example: HISTOGRAM • Vector-slot next to existing scalar instructions (VLIW) • 16x(8/16)-bit vector register files • 16x8-bit SRAM interface • 16x8-bit FIFO interfaces • Vector arithmetic instructions • Special registers and instructions to compute histograms 4x size increase & 200x speedup (relative to RISC template) Implemented in less than 1 week

Task Assignment #3 3 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1 ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

Task Assignment #4 4 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM

Task Assignment #4 4’ Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS L2 SRAM DCCM L3 Ext. DRAM

Comparison 1 2 3 4

Final Results • 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM • 30 frames/second at 500 MHz • Functionally identical to OpenCV reference • TSMC 28nm • ASIP gate count: 330k gates • ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD usage • Power/performance/area via ASIPs • Scaling due to multi-core, specialization and SIMD usage • Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture

Scenario: Need for Flexible FEC Core • Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi • Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area DVB-X? LDPC-A .11n LDPC-C .11n Vit FlexFEC (turbo/LDPC/Vit) .16e LDPC-D 3GPP-LTEturbo-A UMTS Turbo-B

Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6 ILP: 2 FU (scalar+vector unit) ILP: 6 FU (1 scalar+5 vector units) No duplication for arithmetic functionality For exploiting ILP to increase throughput 2 FUs for local memory access

Fast Area/Performance Trade-off(40nm logical synthesis Processor only) 0.189 sqmm 0.177 sqmm

Architectural ExplorationFU Utilization: 2  5 Vector slot separated in different FUs without overlapping functionality Local memory access congestion

Architectural ExplorationMore Balanced FU Utilization: 5  6

Highly Efficient C-compilationVast Majority of 6 FU Used

Blox-LDPC ASIP Latest IP Available from IMEC Instances available ad

Conclusion • ASIPs enable programmable accelerators • IP Designer enables efficient design and programming of ASIPs • “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators • ASIPsenable balanced multicore SoCarchitectures

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs)

Presentation Transcript

SYNTHESIS OF APPLICATION SPECIFIC VLIW PROCESSORS

Design Automation of Co-Processors for Application Specific Instruction Set Processors

Application-Specific Signatures for Transactional Memory in Soft Processors

Synthesizable, Application-Specific NOC Generation using CHISEL

Architecture and Design Automation for Application-Specific Processors

Using the Frames Subsystem

Using specific praise

Application-Specific Customization of Parameterized FPGA Soft-Core Processors

A Scalable, Cache-Based Queue Management Subsystem for Network Processors

Writing a Subsystem Manager Device Server using YAT

Systematic Register Bypass Customization for Application-Specific Processors

Application-Specific Languages

Application-Specific Customization of FPGA Soft-core Processors

Ideas for the design of an ASIP for LQCD

Platforms, ASIPs and LISATek

Chapter 3 General-Purpose Processors: Software

Exploration and application deployment on a SoC: efficient application

Application Specific Module

A CCELERATION

Using the Frames Subsystem

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research