300 likes | 348 Views
Visual Object Recognition Accelerator Based on Approximate In-Memory Processing. Yeseong Kim, Mohsen Imani , Tajana Rosing University of California, San Diego Department of Computer Science and Engineering seelab.ucsd.edu. Internet of Things and Big Data.
E N D
Visual Object Recognition Accelerator Based on Approximate In-Memory Processing Yeseong Kim, Mohsen Imani, Tajana Rosing University of California, San Diego Department of Computer Science and Engineering seelab.ucsd.edu
Internet of Things and Big Data • Internet of Things: Billions-trillions of interconnected devices • 1.8 zettabytes of data generated in 2015, increased by 50% in 2020! • Diverse applications handle Big Data in a system-efficient way http://www.iottechworld.com/
Cost of Operations DRAM consumes 170x more energy than FPU Mult Ref: Dally, Tutorial, NIPS’15
Processing In Memory • Processing In Memory (PIM):Performing a part of computation tasks inside the memory General Purpose Processor Core Core Core Core Core Core Core Core Large Memory for Big Data Large Memory for Big Data Computational logic
Supporting In-Memory Operations Bitwise Search Operation Addition/ Multiplication Supported Operations OR, AND, XOR Multiple row Search/ Nearest Search Matrix Multiplication Example of Operations Classifications Clustering Database Deep learning Security Multimedia HD computing Graph processing Query processing Applications
Machine Learning Acceleration • Machine learning is a popular choice to handle and assimilate Big Data, but usually requires lots of computation & tuning • e.g., Deep Neural network • AdaBoost: One of the best off-the-shelf learning algorithm • Exploit ensemble of weak learning models (e.g., decision trees)=> robust & general purpose Computation Acceleration Machine Learning (ML) Model Dataset Is this a face? What’s the probability?
DNN vs AdaBoost • Deep neural networks: • Show superior quality in the recognition tasks • Accelerated on FPGA, ASIC-based, and PIM-based designs • Significant energy and performance issues due to the high computation complexity and large memory footprints of models • In contrast our design targets AdaBoost which: • Is viable solution for diverse object recognition tasks without losing generality • Has been widely used in computer vision field • Its learning method is often relatively light-weight, and shows better accuracy than DNN in some cases of the image recognition, e.g., face detection • Requires less effort to tune parameters and the trained models are easy to interpret
Example of AdaBoost Functionality Training set: 10 points (represented by plus or minus)Original Status: Equal Weights for all training samples www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt
Example of AdaBoost Functionality (cont’d) Round 1: Three “plus” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt
Example of AdaBoost Functionality (cont’d) Round 2: Three “minuse” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt
Example of AdaBoost Functionality (cont’d) Round 3: One “minuse” and two “plus” points are not correctly classified;They are given higher weights. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt
Example of AdaBoost Functionality (cont’d) Final Classifier: integrate the three “weak” classifiers and obtain a final strong classifier. www.ist.temple.edu/~vucetic/cis526fall2007/xin.ppt
Object Recognition Acceleration • We conducted a case study for popular classification problem: • Image object recognition • Classification decision happens with ensemble of DT-MEM blocks • Additional memory blocks are designed for in-memory feature extraction Input image
Histogram of Oriented Gradient (HoG) • Describes the shape and appearance of target objects • Computes the gradient values of all pixels by considering its adjacent pixels • Gradient can represents by a vector which has an orientation (direction) and magnitude • There are 256 values for a pixel of an image color channel, each cell includes 9 pixels. prohibitively huge memory size for all pixel combinations HoG
Approximate HoG • Optimize this memory size by storing only approximate and representative values • For example, in MNIST hand-written alphabet the input pixels using two values, e.g., for black and white • MNIST: 87% of pixels of the MNIST images are either 0 and 255. • WebFaces: many pixels have similar values in the middle range
Feature Extraction Acceleration • The address decoder quantizes each pixel into a Q levels • E.g. Q = 4, the 256 pixel values are quantized to 4 values, 00, 01, 10, and 11. • The quantized value are concatenated to form a memory address which indicates a row of the crossbar memory block • Each row of the recipe memory includes: • di: bin index of the vector direction • mi: the magnitude Original computation In-memory computation
Haar-like Feature Extraction • Feature’s value is calculated as the difference between the sum of the pixels within white and black rectangle regions
Original image Integral Image Facial Haar features 1 1 1 1 2 3 1 1 1 2 4 6 3 6 9 1 1 1 Stores Pixel sum of Rect(from top-left corner to this point) p1 p2 D p3 p4 Calculate Haar-feature value: Pixel_Sum(Rect_W) – Pixel_Sum(Rect_B) Face Detection in Sub-window D 3 Need 4 corner values • How to add all pixel values in the red region? • Should we add all pixel values? • Can we do better? D= P4 - P2 - P3 + P1 Ref: Scalable Object Detection Accelerators on FPGAs Using Custom Design Space Exploration
In-Memory Haar-Like Feature Extraction • Memory block has to be initialized with the integral image • Computes a Haar-like feature from the two in-memory additions • Subsequent subtraction and weighting are processed by a small CMOS-based • Implementing weighted subtractor block and weighting logic using shift operations • Memory optimized for write latency
In-Memory Decision Tree • A DT-MEM implements a decision tree based on the concept of auto-associative memory • 1) Activates the decision stump of the root node • 2) Auto-associative memory: performs the similarity search for the two enabled rows • 3) Repeatedly search a similar row with the given buffer data until the node type flag is 0 • 4) Tree based adder: combines different features based on their weights
Experimental Setup • C++ cycle-accurate simulator to model the ORCHARD functionality • Circuit level simulation to support performance and energy consumption of proposed hardware • Cadence Virtuoso to support 45nm CMOS technology • VTEAM memristor model [*] for our memory design simulation: • RON and ROFF of 10kΩ and 10MΩ respectively • Evaluate the energy and performance efficiency to existing processor-based implementation • Measured the power consumption of Intel Xeon E5440 processor and ARM Cortex A53 processor [*] S. Kvatinsky et al., “Vteam: a general model for voltage-controlled memristors,” TCAS II, vol. 62, no. 8, pp. 786–790, 2015.
Object Recognition Accuracy • The proposed design successfully recognizes different objectsusing the same acceleration strategy • Benchmarks: • MNIST • 10000 WebFace • INRIA Pedestrian • UIUC Vehicle
Tradeoff: Accuracy vs. Approximation • The approximate in-memory feature extraction increases with minimal accuracy loss, e.g., • MNIST: only 0.4% (97.5%) for Q = 2 and L = 1024 • WebFace: only 0.3 % error (96.7%) for Q = 6 and L = 2048
Energy & Performance Comparison • ORCHARD executing all the tasks inside memory: • 1,896X energy efficiency and 376X speedup as compared to Intel Xeon E5440 • 552X energy efficiency and 2,654X speedup as compared to ARM Cortex A53
In-Memory Computing Accelerator Classification Clustering Hyperdimensional Classification Supporting both Training and Testing Kmeans Adaboost Hyperdimensional Clustering DNN, CNN Decision Tree kNN Database Graph Processing Query Processing Graph Processing
Conclusion • We propose ORCHARD which: • Accelerates two well-known feature extractors fully in memory • Accelerating decision tree as a base learner of Adaboost using CAM and crossbar memory • Supports approximate in-memory computing • Our evaluation: • Tested on four practical image recognition tasks • ORCHARD achieves energy efficiency improvement up to 1896x and 376x speedup • The accuracy loss due to the approximation is minimal: only 0.3%
Energy/Execution Breakdown • Domination of feature extractor: requires many memory operations • Haar-like feature extractor: • Consumes 93% power to write the integral image • 63% latency of write operations (write latency optimized) • DT-MEM: only 4% of the energy and parallelizable for different weak kernels
Area Overhead • The crossbar memory of for two feature extractors take 63.7% of the total area • DT-MEM blocks take 31.9% of the total area (86% CAM, 7.5% latch) • Tree-based adder takes 3.5% of the total area