290 likes | 325 Views
Discover the impact of nature vs. nurture in Artificial Intelligence Systems on Sept. 12th, 2019 by Haopeng Liu. Explore what defines AI SoCs, market trends, key challenges, processing, memory, connectivity, security, and more.
E N D
Nature vs Nurture for Artificial Intelligence SoCs Sept 12th, 2019 Haopeng Liu/刘好朋
Agenda • What’s an AI SoC? • Market Trends • Key Challenges • Processing • Memory • Connectivity • Security • Summary
Defining Artificial Intelligence • Artificial intelligence mimics human behavior • Machine learning uses advanced statistical models to find patterns & results • Deep learning is a specialized subset of machine learning using neural networks data to recognize patterns Artificial Intelligence Mimics human behavior Machine Learning Uses advanced statistical algorithms to improve AI Deep Learning (Neural Networks) Regression Bayesian Clustering Decision Trees Recurrent Neural Network Convolutional Neural Network Vector Machines Capsule Neural Network…. Spiking Neural Network Neural Networks
What Makes it an AI SoC? Vast majority of Inference today Neural Network Hardware Acceleration Vast majority of investment for AI SoCs Power per performance leader Vast majority of Training today Enabled better-than-human-error capabilities • Most investment dedicated for CNN, RNN, some SNN • Solutions include Software Development Kit (SDK) for mapping AI Graphs to hardware • Most competitive include Neural Network Hardware Acceleration
AI’s Insatiable Need for Compute, Memory, & Electricity Model Sizes in Million Weights COMPUTE • COMPUTE • ResNeXt-101 - >30B operations • Google’s Voice Recognition >19B operations • MEMORY • ResNet-152 - >60M parameters • Google’s Voice Recognition - >34M parameters • ELECTRICITY • “AI workloads could consume 80% of all compute cycles and 10% of global electricity use by 2030” MEMORY ELECTRICITY Model Innovation is Increasing # of Weights & Multiplications Needed
Data Center AI is Moving to the Edge AI Capabilities Will Drive Edge Computing Source: companies Data Center AI 2018 Revenue • NVIDA, Intel and Google dominate AI Data Center market share • Edge computing expected CAGR 20% to 50% by 2022 • End node inference optimizations with compression can be time intensive and costly NVIDIA: $3 billion Intel: $1 billion Market Opportunity is at the Edge
Deep Learning SoC Challenges Unique Requirements for Processing, Memory, Connectivity, Security Specialized Processing Memory Performance Real-Time Connectivity
Leading AI Processor Options MetaWare EV Software • ASIP Designer • Design Application Specific Instruction-set Processors • Unlimited design flexibility • Example below shows LSTM (Long Short-Term Memory) a form of RNN built via ASIP Designer. • EV6x Embedded Vision Processor IP • Vision cores + CNN • Standard software toolchains Libraries (OpenCV) & API (OpenVX) Compilers / Debuggers (C/C++, OpenCL C) Simulators (fast NSIM, EV VDK) CNN Mapping Tool Core 1 512-bit vector DSP 32-bit scalar 880 MAC Engine EV6x Embedded Vision Processor VFPU Convolution Conv. 2D CNN Engine (scalable) Vision CPU (1/2/4 cores) Core 4 3520 MAC Engine Core 3 Core 2 Classification Conv. 1D 1760 MAC Engine SFPU Sync & Debug Streaming Transfer Unit Shared Memory AXI Interconnect Best Processing Solutions & Expertise Synopsys Confidential Information
ARC HS47 DSP for Control and Scalar DSP Low-cost Controller for AI Engines • Dual Issue / 64-bit LD/ST • ARCv2DSP ISA – code compatible w/EMxD, higher Fmax • Designed-in multicore support – scalable from 1 to 4 cores • 1 x 32x32 MAC/cycle • 2 x 16x16 MAC/cycle • Complete C/C++ based tool suite Scalar DSP MetaWare Development Tools Libraries Simulators (fast NSIM, VDK) VISION CPU VISION CPU Compilers / Debuggers (C/C++) VISION CPU I$ I$ I$ D$ D$ D$ 32-bit Scalar 32-bit Scalar 32-bit Scalar Scalar FPU Scalar FPU Scalar FPU 512-bit Vector DSP 512-bit Vector DSP 512-bit Vector DSP HS47 DSP Customer Specific AI Engine SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) SIMD (512b) I$ D$ 32-bit Scalar Scalar FPU X-bar X-bar X-bar Vector Memory (CCM) Vector Memory (CCM) Vector Memory (CCM) Apex User-defined Extensions Network-on-Chip
Specialized Processing Challenges in AI CNN NATURE: Processor Hardware Challenges Identify Application Targets Support New and Emerging AI Algorithms Support Training, Inference, Compression Heterogenous Compute NURTURING: System Design Optimization Challenges Framework support Mapping Tool Optimizations Benchmarking Simulation / Prototyping VGG RNN Vision Scalar AlexNet Voice Vector SNN ResNet Pattern Recognition Matrix Multiplication Coherency TensorFlow Architectural Exploration 32-bit FP Caffe2 Software Development ONNX Benchmarking 8-bit Int 16-bit Int Power Analysis Traditional Processing Architectures are Insufficient, Synopsys Supports Your Processing Innovation
Deep Learning SoC Challenges Unique Requirements for Processing, Memory, Connectivity, Security Specialized Processing Memory Performance Real-Time Connectivity
Why is HBM2/2E on many AI Accelerators Providing the Best in Class pJ/bit & highest possible bandwidth • HBM2 • Up to 8Gb die • Up to 2400Mbps • HBM2 Provides more bandwidth and better power efficiency than DDR4 & GDDR5/5X • 8x 64b DDR4 ch @ 3.2Gb/s: 204.8 GB/s • 1x 1024b HBM2 stack @ 2Gb/s: 256GB/s • HBM technologies provide best roadmap for expanding bandwidth / better pJ/bit access with HBM2E & HBM3 • Chiplet technologies are trending • Synopsys provides proven solutions with optimized PPA, DFT, and interposer design support • HBM2E • Up to 16Gb die • Up to 3200Mbps
Memory Bandwidth at the Edge Exercise in Processor Configuration & Mapping Tools • Mobile & Auto use LPDDR • Assumed Compression • Available Techniques • Quantization: Converting from 32b FP to 8b/12b INT or FP • Pruning: Removing zero/near zero Coefficients • Compression: Reducing size of feature maps by removing statistical redundancy Synopsys Confidential Information
Machine Learning Foundation IP Foundation IP for 7-nm Processes • Foundation IP customized for Machine Learning • Special cells for low power dot product implementation • Near threshold voltage cells • Multi-ported Memory (up to 10 ports) • Design specific analysis for large SRAM content • HPC Kit customized for ARC EV • Integrated test mode in Synopsys Memory and SMS reduces area by 7% and dynamic power by 10% • Total system regression testing using Synopsys tools and test blocks; proven with other industry tools • One stop shop – Silicon validated solution consisting of memories, libraries and test delivering best PPA HS SP SRAM HS Logic Library High Speed HS 1P RF Cache HPC Design Kit (Cells + Memory) HD Logic Library HD SP SRAM HD 2P RF (2 clocks) High Density HD 1P RF HD DP SRAM (2 clocks) HPC Design Kit (Cells + Memory) ViaROM UH Density UHD SP SRAM UHD 2P RF (1 clock) UHD 1P RF UHD 2P SRAM (1 clock) Embedded Memory Test & Repair Customizations for Power & Density are Critical to optimize Memories for Deep Learning
HPC Kit Enhanced for AI Applications Special Cells introduced to reduce CNN engine’s power consumption up to 39% Tradeoff tuning enables 7% frequency boost with 28% lower power Synopsys Confidential Information
Memory Performance Challenges in AI NATURE: IP Selection Addresses Memory Challenges (i.e. Synopsys EMLT) Capacity (DDR5) Bandwidth (HBM2e) Power Consumption (LPDDR4x/5) NURTURING: But System Design Optimizations are required Memory & Processing Co-Design Large Array Yield Challenges (EMLT) SRAM Customizations (EMLT) Additional Test Vectors (SMS/SHS) HBM2 Packaging Expertise & Support Processing Time Memory Used Recalculate Activations Activations Stored in Memory # of Blocks # of Blocks
Deep Learning SoC Challenges Unique Requirements for Processing, Memory, Connectivity, Security Specialized Processing Memory Performance Real-Time Connectivity
DesignWare Die2Die SR-112 Support for Ultra & Extra Short Reach Standards • Die To Die Interconnect: • 56G NRZ • 112G PAM4 • 112G PAM4 w/FEC • Key Features / Tradeoffs: • Bandwidth • Power • Latency • Reach • Quality 56G NRZ No FEC 112G PAM4 No FEC 112G PAM4 With FEC
Test Support for AI • AI SoC impact as market moves to 7nm • Soft defects increase • Increases # of test modes needed • Expertise of IP, Test, and Subsystems is more important • DesignWare STAR Hierarchical System • Test Integration, Pattern Porting & Test Scheduling of IP on SoC • DesignWare STAR Memory System • Test, Repair and Diagnostics FinFET and Planar Memories • DesignWare STAR Memory System supports eFlash and now eMRAM at 40nm and beyond
Real-Time Connectivity Challenges in AI NATURE: Great IP for AI Connectivity Challenges Rapid 7nm Development Cache Coherency High Speed Chip to Chip Latency NURTURING: System Design Optimizations Area optimizations (i.e. DDR Hardening) Time to Market (Subsystems) Industry Expertise (Standards Expertise) Early Software Development (Simulation/Prototyping) MIPI to CMOS Image Sensor PCIe, CXL or CCIX High Speed SerDes Image Sensor AI SoC Host SoC AI SoC AI SoC AI SoC Image Sensor
Securing Deep Learning SoCs DesignWare Security IP Solutions: Certified and Standards Compliant Secure authentication, data encryption, key management, platform security & content protection • AI Models • Expensive • Updates required • AI applications use private data • Facial Recognition • Biometrics • Integrity of the model: • Model corruption by nefarious agents • Corrupted models behave poorly • Trusted Execution Environment (TEE) with DesignWare IP secures neural network SoCs for AI applications
Nurturing Amazing AI SoCs Nurture • Expertise reduces risk, improves PPA, and improves time to market • Industry Leading Tools enable more competitive designs • Customizations optimize system performance Expertise AI Frameworks, Graphs & Mapping Tools Large SRAM Array Analysis IP Subsystems Near Threshold Libraries Tools ASIP Designer HPC Kit for EV Processor nSIM MetaWare Platform Architect SMS/SHS HAPS & Zebu Customization Architectural Analysis Bench-marking Security Consulting Customized Processors Custom Memory Cells IP Hardening