330 likes | 465 Views
SoC Subsystem A cceleration using Application-Specific Processors (ASIPs). Markus Willems Product Manager Synopsys. SoC Design. What to do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced
E N D
SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys
SoC Design • What to do when the performance of your main processor is insufficient? • Go multicore? • Application mapping difficult, resource utilisation unbalanced • Add hardwired accelerators? • Balanced but inflexible
SoC Design • What to do when the performance of your main processor is insufficient? ASIPs: application-specific processors • Anything between general-purpose P and hardwired data-path • Deploys classic hardware tricks (parallelism and customized datapaths) while retaining programmability – Hardware efficiency with software programmability
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
Architectural Optimization Space ASIP architectural optimization space Parallelism Speciali-zation
Architectural Optimization Space Parallelism Instruction-level parallelism (ILP) Data-level parallelism Task-level parallelism Orthogonalinstruction set (VLIW) Encoded instruction set Vector processing (SIMD) Multicore Multi-threading
Architectural Optimization Space Specialization App.-specificdata types App.-specificinstructions Pipeline Connectivity & storage matching application’s data-flow Integer, fractional, floating-point, bits, complex, vector… Distributed regs, sub-ranges Multiple mem’s, sub-ranges App.-spec. memory addressing App.-spec. data processing App.-spec. control processing Direct, indirect, post-modification, indexed, stack indirect… Any exoticoperator Jumps, subroutines, interrupts, HW do-loops, residual control, predication… Single or multi-cycle Relative or absolute, address range, delay slots…
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
32-bitARC HS ProcessorsHigh-Performance for Embedded Applications • Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in 28-nm process* • HS Family products • HS34 CCM, HS36 CCM plus I&D cache • HS234, HS236 dual-core • HS434, HS436 quad-core • Configurable so each instance can be optimized for performance and power • Custom instructions enable integration of proprietary hardware ARC Floating Point Unit JTAG User Defined Extensions ARCv2 ISA / DSP Real-Time Trace 10-stage pipeline MAC & SIMD Multi-plier ALU Divider Late ALU Memory Protection Unit Instruction CCM Data Cache Data CCM Instruction Cache *Worst case 28-nm silicon and conditions Optional
Pedestrian Detection and HOG • Pedestrian detection • Standard feature in luxury vehicles • Moving to mid-size and compact vehicles in the next 5-10 years, also due to legislation efforts • Implementation requirements • Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades) • Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)
Histogram Of Oriented Gradients Scale to Multiple Resolutions Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames. Gradient Computation Apply Sobel operators:and
Histogram Of Oriented Gradients Histogram Computation The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients. Normalization of the Histograms (1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization Support Vector Machine Linear classification of histogramsfor every 64x128 windows position. Non-Max Suppression Cluster multi-scale dense scan of detection windows and select unique
HOG Functional Validation on ARC HS(640 x 480 pixels) 1 • OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) … D D ASIP1 ASIP2 ASIPn AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl
Task Assignment #2 2 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D ASIP1 ASIP2 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
ASIP Example: HISTOGRAM • Vector-slot next to existing scalar instructions (VLIW) • 16x(8/16)-bit vector register files • 16x8-bit SRAM interface • 16x8-bit FIFO interfaces • Vector arithmetic instructions • Special registers and instructions to compute histograms 4x size increase & 200x speedup (relative to RISC template) Implemented in less than 1 week
Task Assignment #3 3 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1 ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
Task Assignment #4 4 Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS DCCM Subs. ctrl L3 Ext. DRAM
Task Assignment #4 4’ Non-maxsuppression Grey scaleconversion Rescaling Gradient Histogram Normali-zation SVM Dedicated Streaming Interconnect (FIFOs) D D D D ASIP1’ ASIP2 ASIP3 ASIP4 AXI local interconnect DMA, Sync& I/O HS L2 SRAM DCCM L3 Ext. DRAM
Comparison 1 2 3 4
Final Results • 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM • 30 frames/second at 500 MHz • Functionally identical to OpenCV reference • TSMC 28nm • ASIP gate count: 330k gates • ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD usage • Power/performance/area via ASIPs • Scaling due to multi-core, specialization and SIMD usage • Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture
Scenario: Need for Flexible FEC Core • Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi • Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area DVB-X? LDPC-A .11n LDPC-C .11n Vit FlexFEC (turbo/LDPC/Vit) .16e LDPC-D 3GPP-LTEturbo-A UMTS Turbo-B
Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6 ILP: 2 FU (scalar+vector unit) ILP: 6 FU (1 scalar+5 vector units) No duplication for arithmetic functionality For exploiting ILP to increase throughput 2 FUs for local memory access
Fast Area/Performance Trade-off(40nm logical synthesis Processor only) 0.189 sqmm 0.177 sqmm
Architectural ExplorationFU Utilization: 2 5 Vector slot separated in different FUs without overlapping functionality Local memory access congestion
Architectural ExplorationMore Balanced FU Utilization: 5 6
Blox-LDPC ASIP Latest IP Available from IMEC Instances available ad
Agenda • ASIPs as accelerators in SoCs • How to design ASIPs • Examples • Conclusions
Conclusion • ASIPs enable programmable accelerators • IP Designer enables efficient design and programming of ASIPs • “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators • ASIPsenable balanced multicore SoCarchitectures