Accelerating Machine Learning Applications using Delite

Accelerating Machine Learning Applications using Delite Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1st June, 2010

What is Machine Learning? • Learning patterns from data • Regression • Inference (e.g. Loopy Belief Propagation) • Adaptive control (e.g. Reinforcement Learning) • Neural networks (e.g. Restricted Boltzmann Machine) • A good domain for studying parallelism • Both throughput and latency are important • Many applications exhibit both data and task parallelism • Often at varying granularities • At the core of many emerging applications (speech recognition, robotics, data mining, etc.) • Many optimizations specific to the domain • e.g., Sacrificing accuracy for performance

Domain Specific Languages • A language or library that exploits domain knowledge for productivity and efficiency • Widely used in many application areas • MATLAB, Verilog, OpenGL • Raises the level of abstraction higher than general purpose languages • Programmer describes what he wants to do rather than how he wants to do it • Allows for an implicitly parallel environment

OptiML: A DSL for ML • Provides a familiar (MATLAB-like) language and API for writing ML applications • Embedded in Scala • Encodes common ML kernels as implicitly parallel operations • Matrix multiply, dot product, etc.

What is Delite? • A dynamic parallel runtime • Domain Extracted Locality Informed Task Execution • Executes a task graph on parallel, heterogeneous hardware • CPUs, GPUs, etc. • Performs both static and dynamic scheduling • Integrates task and data parallelism in a single environment • Can apply dynamic domain-specific optimizations provided by a Domain-Specific Language

Delite Execution Model Calls Matrix DSL methods DSL defers OP execution to Delite Delite applies generic & domain transformations and generates mapping

Scheduling • An NP Hard problem in general • Very simple local clustering algorithm for general purpose scheduling • Checks for dependency on previous M OPs to minimize communication • Control flow hints • Allows for an efficient parallel for loop schedule when the loop iterations are independent without an explicit parallelFor construct • Data Parallel operations • Splits each OP into N chunks for N threads

Integrating the GPU(s) • Portion of the task graph to be executed on the GPU is sent to a dedicated GPU scheduler • GPU scheduler identifies OP and sends appropriate CUDA kernel to GPU • Manages the GPU memory • Shipped data remains on GPU for fast re-use until memory overflows or CPU requests data

Experimental Results • Performed using ML applications written in OptiML and using Delite • The application and Delite scheduler are run in a single thread + • Either N CPU worker threads • Or 1 GPU

ML Kernel Tests • 3 Application Kernels • Gaussian Discriminant Analysis • Naïve Bayes • Weighted Linear Regression • System 1: Multi-Core CPU & GPU Tests • Intel Nehalem • 2 sockets, 8 cores, 16 threads • 24 GB DRAM • NVIDIA GTX 275 GPU • System 2: Scalability Tests • Sun Niagara T2+ • 4 sockets, 32 cores, 256 threads • 128 GB DRAM

Gaussian Discriminant Analysis 2.4x 2.6x 3.4x 3.9x 13.1x 18.7x *Normalized to execution time for 1 CPU

Naïve Bayes 2.2x 3.5x 5.6x 7.6x

Weighted Linear Regression 1.1x 2.5x 3.3x 3.9x 4.3x 5.5x

Multi-Core Scalability

Overheads: GDA

Deep Belief Networks (DBNs) • Very promising algorithms • Learns complex features • Shows great potential in solving difficult problems • Researched by Andrew Ng • Research is limited by compute power • Computation scales quadratically • Algorithm dominated by serial matrix multiplications

DBN Current Results 3.1x 22.3x

Conclusions • Domain knowledge facilitates implicit coarse-grained parallelism • Delite targets heterogeneous hardware automatically • Hits the sweet spot of ease-of-programming and scalable performance

Future Work • Hardware scheduling acceleration • Dataflow processing could become more feasible due to the natural expression of coarse-grained tasks in Delite • Static analysis of task graph • Allows intelligent scheduling before runtime • Task graph optimizations

Thank You! • Questions? • Thanks to Hassan Chafi, ArvindSujeeth, HyoukJoong Lee, Nathan Bronson, and KunleOlukotun

Accelerating Machine Learning Applications using Delite

Accelerating Machine Learning Applications using Delite

Presentation Transcript

Machine Learning Applications in Grid Computing

Accelerating Machine Learning Applications on Graphics Processors

Predicting Phospholipidosis Using Machine Learning

Digit Recognition Using Machine Learning

Accelerating Applications using HPC Server 2008

Accelerating Basic Skills Courses Using Blended Learning

Scientific Applications of Machine Learning

Accelerating PHP Applications

Bioinformatics Applications of Machine Learning

Accelerating PHP Applications

Machine Learning for Business Applications

Machine learning using spark

Machine learning Applications - SciExperts

Machine Learning Applications

Smart Phones using Machine Learning

Applications of machine learning

Topic Detection using Machine Learning

Develop Machine Learning using Python

Accelerating PHP Applications

Detecting Phishing Using Machine Learning

Emotion Recognition Using Machine Learning

Machine Learning Projects | Machine Learning Applications | Machine Learning Training | Simplilearn