SINGA: Putting Deep Learning into the Hands of Multimedia Users

SINGA: Putting Deep Learning into the Hands of Multimedia Users http://singa.apache.org/ Wei Wang, Gang Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng Wang

Introduction • Multimedia data and application • Motivations • Deep learning models and training, and design principles • SINGA • Usability • Scalability • Implementation • Experiment

Introduction Social Media VocallIQ (acquired by Apple) Audio Madbits (acquired by Twitter) MultimediaData Perceptio (acquired by Apple) Image/video LookFlow (acquired by Yahoo! Flickr) Deepomatic (e-commerce product search) E-commerce Descartes Labs (satellite images) Text Clarifai (tagging) Health-care Ldibon ParallelDots Deep Learning has been noted for its effectiveness for multimedia applications! AlchemyAPI (acquired by IBM) Semantria (NLP tasks >10 languages)

Motivations Model Categories Feedforward Models CNN, MLP, Auto-encoder Image/video classification CNN Krizhevsky, Sutskever, and Hinton, 2012; Szegedy et al., 2014; Simonyan and Zisserman, 2014a

Motivations Model Categories Feedforward Models CNN, MLP, Auto-encoder Image/video classification DBN, RBM, DBM Speech recognition Energy models DBN RBM Dahl et al., 2012

Motivations Model Categories Feedforward Models CNN, MLP, Auto-encoder Image/video classification Recurrent Neural Networks Energy models DBN, RBM, DBM Speech recognition RNN, LSTM, GRU Natural language processing Mikolov et al., 2010; Cho et al., 2014

Motivations Model Categories Feedforward Models CNN, MLP, Auto-encoder Image/video classification Design Goal I Usability: easy to implement various models Recurrent Neural Networks Energy models DBN, RBM, DBM Speech recognition RNN, LSTM, GRU Natural language processing

Motivations: Training Process • Training process • Update model parameters to minimize prediction error • Training algorithm • Mini-batch Stochastic Gradient Descent (SGD) • Training time • (time per SGD iteration) x (number of SGD iterations) • Long time to train large models over large datasets, e.g., 2 weeks for training Overfeat (Pierre, et al.) reported by Intel (https://software.intel.com/sites/default/files/managed/74/15/SPCS008.pdf). Back-propagation (BP) Contrastive Divergence (CD)

Motivations: Distributed Training Frameworks • Synchronous training (Google Sandblaster, Dean et al., 2012; Baidu AllReduce, Wu et al., 2015) • Reduce time per iteration • Scalable for single-node with multiple GPUs • Cannot scale to large cluster • Asynchronous training (Google Downpour, Dean et al., 2012, Hogwild!, Recht et al., 2011) • Reduce number of iterations per machine • Scalable for big cluster with commodity machine(CPU) • Not stable • Hybrid frameworks Design Goal II Scalability: not just flexible, but also efficient and adaptive to run different training frameworks

SINGA: A Distributed Deep Learning Platform

Usability: Abstraction NeuralNet stop Layer classLayer{ vector<Blob> data, grad; vector<Param*> param; ... void Setup(LayerProto& conf, vector<Layer*> src); void ComputeFeature(int flag, vector<Layer*> src); void ComputeGradient(int flag, vector<Layer*> src); }; Driver::RegisterLayer<FooLayer>("Foo"); // register new layers TrainOneBatch

Usability: Neural Net Representation NeuralNet stop Loss Layer labels Hidden TrainOneBatch Input Feedforward models (e.g., CNN) RBM RNN

Usability: TrainOneBatch NeuralNet stop Loss Layer labels Hidden TrainOneBatch Input Feedforward models (e.g., CNN) Back-propagation (BP) Contrastive Divergence (CD) Just need to override the TrainOneBatch function to implement other algorithms! RBM RNN

Scalability: Partitioning for Distributed Training 1 • NeuralNet Partitioning: • 1. Partition layers into different subsets • 2. Partition each singe layer on batch dimension. • 3. Partition each singe layer on feature dimension. • 4. Hybrid partitioning strategy of 1, 2 and 3. Worker 1 Worker 2 2 3 Users just need to CONFIGURE the partitioning scheme and SINGA takes care of the real work (eg. slice and connect layers) Worker 2 Worker 1 Worker 2 Worker 1 Worker 1

Scalability:Training Framework Legends: Cluster Topology Neural Net Worker Server Node Worker Worker Worker Group Parameters Inter-node Communication Server Server Server Synchronous training cannot scale to large group size Server Group

Scalability:Training Framework Legends: Cluster Topology Worker Server Node Group Inter-node Communication Communication is the bottleneck!

Scalability:Training Framework Legends: Cluster Topology Worker Server Node Group Inter-node Communication sync async SINGA is able to configure most known frameworks. (c) Downpour (d) Distributed Hogwild (a) Sandblaster (b) AllReduce

Implementation While(not stop): Worker::TrainOneBatch() While(not stop): Server::Update() Worker thread SINGA Software Stack Server thread Driver::Train() Stub::Run() Main Thread Remote Nodes CNN RBM RNN Driver Legend: Server Stub Worker Mesos Zookeeper HDFS DiskFile Docker SINGA Component Ubuntu CentOS MacOS Optional Component

Deep learning as a Service (DLaaS) Third party APPs (Web app, Mobile,..) ---------------------- API Developers (Browser) ---------------------- GUI http request http request Rafiki Server User, Job, Model, Node Management Data Base File Storage System (e.g. HDFS) Routing(Load balancing) http request http request Rafiki Agent Rafiki Agent Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) … … … 1. To improve the Usability of SINGA; 2. To “level” the playing field by taking care of complex system plumbing work, its reliability, efficiency and scalability. SINGA SINGA SINGA SINGA SINGA’s RAFIKI

Comparison:Features of the Systems MXNet on 28/09/15 Comparison with other open source projects

Experiment --- Usability • Used SINGA to train three known models and verify the results Hinton, G. E. and Salakhutdinov, R. R. (2006)Reducing the dimensionality of data with neural networks.Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. … RBM Deep Auto-Encoders

Experiment --- Usability W. Wang, X. Yang, B. C. Ooi, D. Zhang, Y. Zhuang: Effective Deep Learning Based Multi-Modal Retrieval. VLDB Journal - Special issue of VLDB'14 best papers, 2015. W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang: Effective MultiModal Retrieval based on Stacked AutoEncoders. Int'l Conference on Very Large Data Bases (VLDB), 2014. Deep Multi-Model Neural Network CNN MLP

Experiment --- Usability Mikolov Tomá, Karafiát Martin, Burget Luká, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, INTERSPEECH 2010), Makuhari, Chiba, JP

Experiment --- Efficiency and Scalability Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet • Single Node • 4 NUMA nodes (Intel Xeon 7540, 2.0GHz) • Each node has 6 cores • hyper-threading enabled • 500 GB memory • Cluster • Quad-core Intel Xeon 3.1 GHz CPU and 8GB memory, 1Gbps switch • 32 nodes, 4 workers per node Caffe, GTX 970 Synchronous

Experiment --- Scalability Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet • Single Node • Cluster Caffe SINGA Asynchronous

Conclusions • Programming Model, Abstraction, and System Architecture • Easy to implement different models • Flexible and efficient to run different frameworks • Experiments • Train models from different categories • Scalability test for different training frameworks • SINGA • Usable, extensible, efficient and scalable • Apache SINGA v0.1.0 has been released • V0.2.0 (with GPU-CPU, DLaaS, more features) out next month • Being used for healthcare analytics, product search, …

Thank You! Acknowledgement: Apache SINGA Team (ASF mentors, contributors, committers, and users) + funding agencies (NRF, MOE, ASTAR)

SINGA: Putting Deep Learning into the Hands of Multimedia Users