Real Time Machine Learning (RTML)

Real Time Machine Learning (RTML) Andreas Olofsson Program Manager DARPA/MTO Proposers Day April 2, 2019 "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Background "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

DARPA UPSIDE program (2012-2018)Unconventional Processing of Signals for Intelligent Data Exploitation Oscillators Low-precision probabilistic computing algorithms Objective: Exploit the physics of emerging devices, analog CMOS, and non-Boolean computational models to achieve new levels of performance and power for real-time sensor imaging systems. Approach: TA1: Image Application for Benchmarking: Recreate a traditional image processing pipeline (IPP) using UPSIDE Compute models showing no degradation in performance. TA2: MS CMOS Demonstration: Mixed signal CMOS implementation of the computational model and system test bed showing 1x105x combined speed-power improvement for analog CMOS. TA3: Emerging Device Implementation: Image processing demonstration combining next-generation devices with new computation model. 1x107x (projected) Graphene Memristors Image Pixels Benchmarked using object classification and tracking applications 3x3 pixels Extracted Library Mapped into emerging devices and analog CMOS Ex. Edges Analog CMOS Emerging Devices Analog, Floating Gate Pattern Match Analog Vector Matrix Multiply Goal: Demonstrate the capability and pathway toward embedded computing efficiency in ISR applications w/ >1,000x processing speed and >10,000x improvement in power consumption Detected Salient Pixels "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Selected UPSIDE results University of Michigan UCSB • Mixed signal processing (50TOPS/W) • Sparse image reconstruction in memristors • Numerous publications (Nature, …) • Key Takeaways • Analog computing beats digital on VMMs • Challenges: • Comparing results (lack of data)  RTML • Transition valley of death  RTML • High cost of design  RTML • Manufacturing latency too long RTML • Manufacturability and scalability  RTML 2Layer MLP Neural Network • First memristor based multilayer perceptron • Flash based 55nm analog computing (>10TOPS/W) • Numerous publications (Nature, …) "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Building a proper baseline for path-finding AI HW research • Extreme expense of HW development means extremely sparse data • How can we know if a new result is good without a baseline? • A compiler would let us “paint the space” of possibilities • Objective: Better science ISSCC2019 "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Generating “right-sized” HW for SWaP constrained systems • 10-100X network tradeoffs • Additional micro-tradeoffs (bit-width, pruning, etc.) • Having more accuracy than needed wastes energy, latency, and power • A compiler would enable generation of right-sized HW • Objective: Enable new applications A. Canziani, et al, “Analysis of deep neuralnetworks” "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Optimizing hardware for ultra-low latency • Current HW optimized for throughput and programmability • Extreme expense of HW development means latency of ASICs is unexplored. • Green-field: How low can we go? • Objective: Enable new applications Source: NVIDIA "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Building bridges Application Experts Platform Experts ML Experts TensorFlow PyTorch RTML (New!) Objective: Faster innovation Source: NVIDIA, Getty Images, Wikipedia "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Example of a low-latency application Source: Qualcomm "Distribution Statement "A" Approved for Public Release, Distribution Unlimited" Source: Qualcomm,2017

DARPA RTML Program "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

DARPA RTML program details The DARPA RTML program seeks to create no-human-in-the-loop hardware generators and compilers to enable fully automated creation of ML Application-Specific Integrated Circuits (ASICs) from high level source code. "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Phase 1: machine learning hardware compiler • Develop hardware generator that converts programs expressed in common ML frameworks (such as TensorFlow, PyTorch) and generate standard Verilog code and hardware configurations • Generate synthesizable Verilog that can be fed into layout generation tools, such as from DARPA IDEA • Demonstrate a compiler that auto-generates a large catalog of scalable ML hardware instances • Demonstrate generation of instances for a diversity of architectures "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

RTML general purpose generator The RTML generator should support a diversity of ML architectures. Architectures of interest include: • conventional feed forward (convolutional) neural networks, • recurrent networks and their specialized versions, • neuroscience-inspired architectures, such as spike time-dependent neural nets including their stochastic counterparts, • non-neural ML architectures inspired by psychophysics as well as statistical techniques, • classical supervised learning (e.g., regression and decision trees), • unsupervised learning (e.g., clustering) approaches, • semi-supervised learning methods, • generative adversarial learning techniques, and • other approaches such as transfer learning, reinforcement learning, manifold learning, and/or life-long learning "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Phase 1 RTML generator metrics 1Program is interested in real work accomplished per Watt, not arbitrary peak mathematical ops/W. As a general guidance we are specifying a 10 TOPS/W at 14nm as a minimum threshold with the understanding that efficiency numbers are tightly coupled to accuracy, data sets, and actual applications. Efficiency metric includes all SoC power including IO power needed to sustain peak throughput. Based upon normalized MAC for the proposed application. 2To demonstrate a general purpose ML compiler, teams are expected to complete GDSII implementation of multiple ML architectures 3Delivered with a minimum of government purpose rights; open source licenses are preferred. "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

An introduction to the IDEA silicon compiler (RTL/schematic to GDSII) Data Training IDEA Unified Layout Generator Models Chip Package Board 24 hours, No Human In the Loop "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

An introduction to the CHIPS interface AIB (Advanced Interface Bus)is a PHY-level interface standard for high bandwidth, low power die-to-die communication • AIB is a clock-forwarded parallel data transfer like DDR DRAM • High density with 2.5D interposer (e.g., CoWoS, EMIB) for multi-chip packaging • AIB is PHY level (OSI Layer 1) • Can build protocols like AXI-4 on top of AIB AIB Performance: • 1Tbps/mm shoreline • ~0.1pJ/bit • <5ns latency Open Source! • Standard and reference implementation • https://github.com/intel/aib-phy-hardware ADC/DAC Machine Learning Memory Processors Adjacent IP Etc. … Your Chiplet Our Chiplet AIB AIB AIB Adopters: -Boeing -Intrinsix -Synopsys -Intel -Lockheed Martin -Sandia -Jariet -NCSU -U. of Michigan -Ayar Labs AIB AIB "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Phase 2: real time machine learning systems • Design space exploration through circuit implementation of multiple ML architectures • General purpose, tunable generator that can support optimization of ML hardware for specific requirements • Hardware demonstration of RTML for a particular application area • Application areas: • Future high bandwidth wireless communication systems, like the 60 GHz range of the 5G standard • High bandwidth image processing in SWaP constrained systems • DARPA will provide fabrication support through a number of separately funded multi-project or dedicated wafer runs "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Phase 2 RTML metrics 1Teams are expected to explore a wide trade space of power, latency, accuracy, and data throughput and show the ability to tune hardware over a large range of performance metrics. Max values are not expected to be achieved simultaneously. 2Power must include everything needed to operate, including power delivery, thermal management, external memory, and sensor interfaces. 3For example, ResNet152 has an accuracy of > 0.96 on the ImageNet database: http://image-net.org/challenges/LSVRC/2015/results 4Proposals are expected to outline a clear plan for validating the quality of the compiler output, including details of the publicly available benchmarks and datasets from industry, government, and academia that will be used 5Delivered with a minimum of government purpose rights; open source licenses are preferred "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

RTML schedule • 0 months (Fall 2019): Kickoff workshop • 9 months (Mid 2020): Alpha release of RTML generator at joint NSF/DARPA workshop • 18 months (Spring 2021): Release of V1.0 RTML generator and demonstration with a RTML compiler flow • 27 months (End 2021): Release of V1.5 tunable hardware generator • 36 months (Fall 2022): Hardware demonstration of a real time machine for specific application "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

RTML seeks answers for the following research questions • Can we build an application specific silicon compiler for RTML? • What subset of current ML frameworks syntax/methods can be supported with a compiler? • What needs to be added to current ML frameworks to support efficient translation? • What hardware architectures are best suited for real time operation? • What are the lower latency limits for various RTML tasks? • What is the lowest SWaP feasible for various RTML tasks? • What are the tradeoffs between energy efficiency, throughput, latency, area, accuracy? "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

RTML does NOT seek proposals for these areas • Investigatory research that does not result in deliverable hardware designs • Circuits that cannot be produced in standards CMOS foundries (like 14nm) • New Domain Specific Languages • New approaches to physical layout (RTL to GDSII) • Incremental efforts "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Joint NSF collaboration NSF and DARPA team to explore rapid development of energy efficient hardware and real-time machine learning architectures • NSF: Single phase, exploratory research into circuit architectures and algorithms • DARPA: • Phase 1: Fully automated hardware generators “compilers” for state of the art machine learning algorithms and networks, using existing programming frameworks (TensorFlow, etc.) as inputs • Phase 2: Deliver novel machine learning architectures and circuit generators that enable real time machine learning for autonomous machines • Joint solicitation release and workshops at 9 and 18 mos into each phase • DARPA teams pull in NSF work during Phase 1 to Phase 2 transition V1.0 Release and GDSII Delivery Silicon Demo V1.5 Release & Tapeout Alpha Release "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Collaboration and licensing Required: • Collaboration with other program performers • Active participation in joint DARPA-NSF workshops every 9 months • Open interfaces Strongly encouraged: • Publishing code and results early and often • Permissive (non-viral, non-proprietary) open source licensing "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Funding of DARPA RTML Phase 2 • RTML includes a base Phase 1 and option Phase 2 • The proposed planning and costing by Phase (and by Task) provides DARPA with convenient times to evaluate funding options and technical progress • Progression into Phase 2 is not guaranteed; factors that may affect Phase 2 funding decisions • Availability of funding • Cost of proposals selected for funding • Demonstrated performance relative to program goals • Interaction with government evaluation teams • Compatibility with potential national security needs "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Important dates • BAA Posting Date: March 15, 2019 • Proposers Day: April 2, 2019 • FAQ Submission Deadline: April 15, 2019 at 1:00 PM • DARPA will post a consolidated Question and Answer (FAQ) document on a regular basis. To access the posting go to: http://www.darpa.mil/work-with-us/opportunities. • Proposal Due Date: May 1, 2019 at 1:00 PM • Estimated period of performance start: October 2019 • Questions: HR001119S0037@darpa.mil "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Evaluation criteria, in order of importance • Overall Scientific and Technical Merit • Demonstrate that the proposed technical approach is innovative, feasible, achievable, and complete • A clear and feasible plan for release of high quality software is provided • Task descriptions and associated technical elements provided are complete and in a logical sequence with all proposed research clearly defined such that a final outcome that achieves the goal • Potential Contribution and Relevance to the DARPA Mission • Note the updated wording, with an emphasis on contribution to U.S. national security and U.S. technological capabilities • Impact on Machine Learning Landscape • The proposed research will successfully complete a fundamental exploration of the tradeoffs between system efficiency and performance for a number of ML architectures • The proposed research significantly advanced the state of the art in machine learning hardware • Cost Realism • Ensure proposed costs are realistic for the technical and management approach and accurately reflect the goals and objectives of the solicitation • Verify that proposed costs are sufficiently detailed, complete, and consistent with the Statement of Work "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Agenda "Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

"Distribution Statement "A" Approved for Public Release, Distribution Unlimited"

Real Time Machine Learning (RTML)