Multiple Kernel Learning

Multiple Kernel Learning Marius Kloft Technische Universität Berlin (now: Courant / Sloan-Kettering Institute) TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA

Machine Learning • Example • Object detection in images • Aim • Learning the relation of of two random quantities and • from observations • Kernel-based learning:

Multiple Views / Kernels • (Lanckriet, 2004) Space Shape How to combine the views? Weightings. Color

Computation of Weights? • State of the art • Sparse weights • Kernels / views are completely discarded • But why discard information? • (Lanckriet et al., Bach et al., 2004)

From Vision to Reality? • State of the art: sparse method • empirically ineffective • New methodology • (Gehler et al., Noble et al., Cortes et al., Shawe-Taylor et al., etc., NIPS WS 2008, ICML, ICCV, ...) • More effective in applications • More efficient and effective in practice • learning bounds of rate up to O(M/n)

Presentation of Methodology Non-sparse Multiple Kernel Learning

New Methodology • (Kloft et al., NIPS 2009, • ECML 2009/10, JMLR 2011) • Generalized formulation • arbitrary loss • arbitrary norms • e.g. lp-norms: • 1-norm leads to sparsity: • Computation of weights? • Model • Kernel • Mathematical program Convex problem. Optimization over weights

Theoretical Analysis • Theoretical foundations • Active research topic • NIPS workshop 2009 • We show: • Theorem (Kloft & Blanchard). The localRademacher com-plexity of MKL is bounded by1: • Corollaries (Learning Bounds) • Bound: • depends on kernel • best known rate: • (Cortes, Mohri, Rostamizadeh, ICML 2010) • (Kloft & Blanchard, NIPS 2011, JMLR 2012; • Kloft, Rückert, Bartlett, ECML 2010)

Proof (Sketch) • Relating the original class with the centered class • Bounding the complexity of the centered class • Khintchine-Kahane’s and Rosenthal’s inequalities • Bounding the complexity of the original class • Relating the bound to the truncation of the spectra of the kernels

Optimization • (Kloft et al.,JMLR 2011) • Implementation • In C++ (“SHOGUN Toolbox”) • Matlab/Octave/Python/R support • Runtime: ~ 1-2 orders of magnitude faster • Algorithms • Newton method • sequential, quadratically constrained programming with level set projections • block-coordinate descent alg. • Alternate • solve (P) w.r.t. w • solve (P) w.r.t. : • Until convergence (proved) • (Sketch) analytical

Toy Experiment • Choice of p crucial • Optimality of p depends on true sparsity • SVM fails in the sparse scenarios • 1-norm best in the sparsest scenario only • p-norm MKL proves robust in all scenarios • Bounds • Can minimize bounds w.r.t. p • Bounds well reflect empirical results • Design • Two 50-dimensional Gaussians • mean μ1 := binary vector; • Zero-mean features irrelevant • 50 training examples • One linear kernel per feature • Six scenarios • Vary % of irrelevant features • Variance so that Bayes error constant • Results test error • 50% • 0% • 0% 44% 64% 82% 92% 98% sparsity • 0% 44% 64% 82% 92% 98% sparsity Sparsity

Applications • Lesson learned: • Optimality of a method depends on true underlying sparsity of problem • Applications studied: • Computer vision • Visual object recognition • Bioinformatics • Subcellular localization • Protein fold prediction • DNA Transcription start site detection • Metabolic network reconstruction non-sparsity

Application Domain: Computer Vision • Visual object recognition • Aim: annotation of visual media (e.g., images) • Motivation: • content-based image retrieval • aeroplane bicycle bird

Application Domain: Computer Vision • Visual object recognition • Aim: annotation of visual media (e.g., images) • Motivation: • content-based image retrieval • Multiple kernels • based on • Color histograms • shapes (gradients) • local features (SIFT words) • spatial features

Application Domain: Computer Vision • Datasets • 1. VOC 2009 challenge • 7054 train / 6925 test images • 20 object categories • aeroplane, bicycle,… • 2. ImageCLEF2010 Challenge • 8000 train / 10000 test images • taken from Flickr • 93 concept categories • partylife, architecture, skateboard, … • 32 Kernels • 4 types: • varied over color channel combinations and spatial tilings(levels of a spatial pyramid) • one-vs.-rest classifier for each visual concept

Application Domain: Computer Vision • Challenge results • Employed our approach for ImageCLEF2011 Photo Annotation challenge • achieved the winning entries in 3 categories • Preliminary results: • using BoW-S only gives worse results → BoW-S alone is not sufficient • (Binder, Kloft, et al., 2010, 2011)

Application Domain: Computer Vision • Experiment • Disagreement of single-kernel classifiers • BoW-S kernels induce more or less the same predictions • Why can MKL help? • some images better captured by certain kernels • different images may have different kernels that capture them well

Application Domain: Genetics • (Kloft et al., NIPS 2009, JMLR 2011) • Detection of • transcription start sites: • by means of kernels based on: • sequence alignments • distribution of nukleotides • downstream, upstream • folding properties • bindingenergies and angles • Empirical analysis • detection accuracy (AUC): • higher accuracies than sparse MKL and ARTS (small n) • ARTS = winner of international comparison of 19 models • (Abeel et al., 2009)

Application Domain: Genetics • (Kloft et al., NIPS 2009, JMLR 2011) • Theoretical analysis • impact of lp-Norm on bound • confirms experimental results: • stronger theoretical guarantees for proposed approach (p>1) • empirical and theoretical results approximately equal for • Empirical analysis • detection accuracy (AUC): • higher accuracies than sparse MKL and ARTS (small n) • ARTS = winner of international comparison of 19 models • (Abeel et al., 2009)

Application Domain: Pharmacology • Results • Accuracy: • 1-norm MKL and SVM on par • p-norm MKL performs best • 6% higher accuracy than baselines • Protein Fold Prediction • Prediction of fold class of protein • Fold class related to protein’s function • e.g., important for drug design • Data set and kernels from Y. Ying • 27 fold classes • Fixed train and test sets • 12 biologically inspired kernels • e.g. hydrophobicity, polarity, van-der-Waals volume

Further Applications • Computer vision • Visual object recognition • Bioinformatics • Subcellular localization • Protein fold prediction • DNA Transcription splice site detection • Metabolic network reconstruction non-sparsity

Conclusion: Non-sparse Multiple Kernel Learning Visual Object Recognition winner of ImageCLEF 2011 Photo-annotation Challenge Computational Biology gene detector competitive to winner of int. comparison Appli- cations Training with >100,000 data points and >1000 Kernels Sharp learning bounds

Thank you for your attention • I will be happy to answer any additional questions

References • Abeel, Van de Peer, Saeys (2009).Toward a gold standard for promoter prediction evaluation. Bioinformatics. • Bach (2008).Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research (JMLR). • Kloft, Brefeld, Laskov, and Sonnenburg (2008).Non-sparse Multiple Kernel Learning. NIPS Workshop on Kernel Learning. • Kloft, Brefeld, Sonnenburg, Laskov, Müller, and Zien (2009).Efficient and Accurate Lp-norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2009). • Kloft, Rückert, and Bartlett (2010). A Unifying View of Multiple Kernel Learning. ECML. • Kloft, Blanchard (2011).The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2011). • Kloft, Brefeld, Sonnenburg, and Zien (2011). Lp-Norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR),12(Mar):953-997. • Kloftand Blanchard (2012). On the Convergence Rate of Lp-norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 13(Aug):2465-2502. • Lanckriet, Cristianini, Bartlett, El Ghaoui, Jordan (2004). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research (JMLR).

Multiple Kernel Learning