200 likes | 344 Views
Accelerating Machine Learning Applications using Delite. Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1 st June, 2010. What is Machine Learning?. Learning patterns from data Regression Inference (e.g. Loopy Belief Propagation)
E N D
Accelerating Machine Learning Applications using Delite Anand Atreya, Kevin Brown, George Rossin Stanford University CS315A 1st June, 2010
What is Machine Learning? • Learning patterns from data • Regression • Inference (e.g. Loopy Belief Propagation) • Adaptive control (e.g. Reinforcement Learning) • Neural networks (e.g. Restricted Boltzmann Machine) • A good domain for studying parallelism • Both throughput and latency are important • Many applications exhibit both data and task parallelism • Often at varying granularities • At the core of many emerging applications (speech recognition, robotics, data mining, etc.) • Many optimizations specific to the domain • e.g., Sacrificing accuracy for performance
Domain Specific Languages • A language or library that exploits domain knowledge for productivity and efficiency • Widely used in many application areas • MATLAB, Verilog, OpenGL • Raises the level of abstraction higher than general purpose languages • Programmer describes what he wants to do rather than how he wants to do it • Allows for an implicitly parallel environment
OptiML: A DSL for ML • Provides a familiar (MATLAB-like) language and API for writing ML applications • Embedded in Scala • Encodes common ML kernels as implicitly parallel operations • Matrix multiply, dot product, etc.
What is Delite? • A dynamic parallel runtime • Domain Extracted Locality Informed Task Execution • Executes a task graph on parallel, heterogeneous hardware • CPUs, GPUs, etc. • Performs both static and dynamic scheduling • Integrates task and data parallelism in a single environment • Can apply dynamic domain-specific optimizations provided by a Domain-Specific Language
Delite Execution Model Calls Matrix DSL methods DSL defers OP execution to Delite Delite applies generic & domain transformations and generates mapping
Scheduling • An NP Hard problem in general • Very simple local clustering algorithm for general purpose scheduling • Checks for dependency on previous M OPs to minimize communication • Control flow hints • Allows for an efficient parallel for loop schedule when the loop iterations are independent without an explicit parallelFor construct • Data Parallel operations • Splits each OP into N chunks for N threads
Integrating the GPU(s) • Portion of the task graph to be executed on the GPU is sent to a dedicated GPU scheduler • GPU scheduler identifies OP and sends appropriate CUDA kernel to GPU • Manages the GPU memory • Shipped data remains on GPU for fast re-use until memory overflows or CPU requests data
Experimental Results • Performed using ML applications written in OptiML and using Delite • The application and Delite scheduler are run in a single thread + • Either N CPU worker threads • Or 1 GPU
ML Kernel Tests • 3 Application Kernels • Gaussian Discriminant Analysis • Naïve Bayes • Weighted Linear Regression • System 1: Multi-Core CPU & GPU Tests • Intel Nehalem • 2 sockets, 8 cores, 16 threads • 24 GB DRAM • NVIDIA GTX 275 GPU • System 2: Scalability Tests • Sun Niagara T2+ • 4 sockets, 32 cores, 256 threads • 128 GB DRAM
Gaussian Discriminant Analysis 2.4x 2.6x 3.4x 3.9x 13.1x 18.7x *Normalized to execution time for 1 CPU
Naïve Bayes 2.2x 3.5x 5.6x 7.6x
Weighted Linear Regression 1.1x 2.5x 3.3x 3.9x 4.3x 5.5x
Deep Belief Networks (DBNs) • Very promising algorithms • Learns complex features • Shows great potential in solving difficult problems • Researched by Andrew Ng • Research is limited by compute power • Computation scales quadratically • Algorithm dominated by serial matrix multiplications
DBN Current Results 3.1x 22.3x
Conclusions • Domain knowledge facilitates implicit coarse-grained parallelism • Delite targets heterogeneous hardware automatically • Hits the sweet spot of ease-of-programming and scalable performance
Future Work • Hardware scheduling acceleration • Dataflow processing could become more feasible due to the natural expression of coarse-grained tasks in Delite • Static analysis of task graph • Allows intelligent scheduling before runtime • Task graph optimizations
Thank You! • Questions? • Thanks to Hassan Chafi, ArvindSujeeth, HyoukJoong Lee, Nathan Bronson, and KunleOlukotun