Li Deng Microsoft Research, Redmond

MLSLP-2012Learning Deep Architectures Using Kernel Modules(thanks collaborations/discussions with many people) Li Deng Microsoft Research, Redmond

Outline • Deep neural net (“modern” multilayer perceptron) • Hard to parallelize in learning  • Deep Convex Net (Deep Stacking Net) • Limited hidden-layer size and part of parameters not convex in learning  • (Tensor DSN/DCN) and Kernel DCN • K-DCN: combines elegance of kernel methods and high performance of deep learning • Linearity of pattern functions (kernel) and nonlinearity in deep nets

Deep Neural Networks

Deep Stacking Network (DSN) • “Stacked generalization” in machine learning: • Use a high-level model to combine low-level models • Aim to achieve greater predictive accuracy • This principle has been reduced to practice: • Learning parameters in DSN/DCN (Deng & Yu, Interspeech-2011; Deng, Yu, Platt, ICASSP-2012) • Parallelizable, scalable learning (Deng, Hutchinson, Yu, Interspeech-2012)

. DCN Architecture . . Example: L=3 10 • Many modules • Still easily trainable • Alternative linear & nonlinear sub-layers • Actual architecture for digit image recognition (10 classes) • MNIST: 0.83% error rate (LeCun’s MNIST site) 3000 10 784 3000 10 784 3000 784

Anatomy of a Module in DCN targets 10 10 3000 U=pinv(h)t 10 784 h 3000 10 784 Wrand WRBM 3000 10 linear units 784 linear units x 784

From DCN to Kernel-DCN

Kernel-DCN • Replace each module by kernel ridge regression • , where • Symmetric kernel matrix in the kernel-feature space G typically in infinite dimension (e.g., Gaussian kernel) • Kernel trick: no need to compute G directly • Problem: expensive inverse when training size is large • Solutions: • Nystrom Woodbury approximation • Reduced rank kernel regression - Feature vector selection

Nystrom Woodbury Approximation • Kernel ridge regression • , where • is large when number of training samples is large • Nystrom Woodbury Approximation • Sampling columns from ->: • Approximation • SVD of W: O() • Computation of : O() C

Nystrom Woodbury Approximation • Eigenvalues of W, select 5000 samples enlarge

Nystrom Woodbury Approximation • Kernel approximation error (5000 samples). • Frobenius norm • 2-norm

K-DSN Using Reduced Rank Kernel Regression • Solution: feature vector selection • To identify a subset • , • Kernel can be written as a sparse kernel expansion involving only terms corresponding to the subset of the training data forming an approximate basis in the feature space. • Feature vector selection algorithm from “Feature vector selection and projection using kernels” (G. Baudat and F. Anouar)

Computation Analysis (run time) • D dim input, M hidden units, Training data size: T, k outputs • , if there are sparse support vectors or tree VQ clusters

K-DCN: Layer-Wise Regularization • Two hyper-parameters in each module • Tuning them using cross validation data • Relaxation at lower modules • Different regularization procedures • Lower-modules vs. higher modules

SLT-2012 paper:

Li Deng Microsoft Research, Redmond

Li Deng Microsoft Research, Redmond

Presentation Transcript

Redmond

Microsoft Research

Paruj Ratanaworabhan Kasetsart University, Thailand Ben Livshits and Ben Zorn Microsoft Research, Redmond

Scott Counts Microsoft Research Redmond, WA

By: David McQuilling; Jesus Caban Deng Li

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond

Ethan Jackson Research in Software Engineering (RiSE), Microsoft Research - Redmond

Nikolai Tillmann Foundations of Software Engineering Microsoft Research Redmond WA, USA

Flat Datacenter Storage Microsoft Research, Redmond

Microsoft Research

Deng

Deng

Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 4, 2013 (Day 3)

Roberto Togneri University of Western Australia Li Deng Microsoft Research, Redmond

Redmond, WA | March 8-9, 2012 Microsoft / pilhighered

Li Deng Microsoft Research, Redmond, USA Tianjin University, July 2-5, 2013

Nikolaj Bjørner Senior Researcher Microsoft Research Redmond

Issam Bazzi, Alex Acero, and Li Deng Microsoft Research One Microsoft Way Redmond, WA, USA 2003