Density Ratio Estimation: A New Versatile Tool for Machine Learning

CMU Density Ratio Estimation:A New Versatile Toolfor Machine Learning Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/~sugi

Overview of My Talk (1) • Consider the ratio of two probability densities. • If the ratio is estimated, various machine learning problems can be solved! • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test

Overview of My Talk (2) Vapnik said: When solving a problem of interest, one should not solve a more general problem as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • Various direct density ratio estimation methods have been developed recently. Knowing densities Knowing ratio

Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation

Covariate Shift Adaptation Shimodaira (JSPI2000) • Training/test input distributions are different, but target function remains unchanged: “extrapolation” Input density Function Training samples Test samples Target function

Ordinary least-squares is not consistent. Adaptation Using Density Ratios • Density-ratio weighted least-squares is consistent. Applicable to any likelihood-based methods!

Model Selection • Controlling bias-variance trade-off is important in covariate shift adaptation. • No weighting: low-variance, high-bias • Density ratio weighting: low-bias, high-variance • Density ratio weighting plays an important role for unbiased model selection: • Akaike information criterion (regular models) • Subspace information criterion (linear models) • Cross-validation (arbitrary models) Shimodaira (JSPI2000) Sugiyama & Müller (Stat&Dec.2005) Sugiyama, Krauledat & Müller (JMLR2007)

Active Learning • Covariate shift naturally occurs in active learning scenarios: • is designed by users. • is determined by environment. • Density ratio weighting plays an important role for low-bias active learning. • Population-based methods: • Pool-based methods: Wiens (JSPI2000) Kanamori & Shimodaira (JSPI2003) Sugiyama, (JMLR2006) Kanamori, (NeuroComp2007), Sugiyama & Nakajima (MLJ2009)

Real-world Applications • Age prediction from faces (with NECsoft): • Illumination and angle change (KLS&CV) • Speaker identification (with Yamaha): • Voice quality change (KLR&CV) • Brain-computer interface: • Mental condition change (LDA&CV) Ueki, Sugiyama & Ihara (ICPR2010) Yamada, Sugiyama & Matsui (SP2010) Sugiyama, Krauledat & Müller (JMLR2007) Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)

Real-world Applications (cont.) こんな・失敗・は・ご・愛敬・だ・よ・． • Japanese text segmentation (with IBM): • Domain adaptation from general conversation to medicine (CRF&CV) • Semiconductor wafer alignment (with Nikon): • Optical marker allocation (AL) • Robot control: • Dynamic motion update (KLS&CV) • Optimal exploration (AL) Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008) Sugiyama & Nakajima (MLJ2009) Hachiya, Akiyama, Sugiyama & Peters (NN2009) Hachiya, Peters & Sugiyama (ECML2009) Akiyama, Hachiya, & Sugiyama (NN2010)

Inlier-based Outlier Detection Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008) Smola, Song & Teo (AISTATS2009) • Test samples having different tendency from regular ones are regarded as outliers. Outlier

Real-world Applications • Steel plant diagnosis (with JFE Steel) • Printer roller quality control (with Canon) • Loan customer inspection (with IBM) • Sleep therapy from biometric data Takimoto, Matsugu & Sugiyama (DMSS2009) Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010) Kawahara & Sugiyama (SDM2009)

Mutual Information Estimation • Mutual information (MI): • MI as an independence measure: • MI can be approximated using density ratio: andare statistically independent Suzuki, Sugiyama & Tanaka (ISIT2009)

Usage of MI Estimator Suzuki, Sugiyama, Sese & Kanamori (BMC Bioinformatics 2009) • Independence between input and output: • Feature selection • Sufficient dimension reduction • Independence between inputs: • Independent component analysis • Independence between input and residual: • Causal inference Suzuki & Sugiyama (AISTATS2010) Suzuki & Sugiyama (ICA2009) Yamada & Sugiyama (AAAI2010) y Causal inference ε Feature selection, Sufficient dimension reduction x x’ Independent component analysis

Conditional Probability Estimation • When is continuous: • Conditional density estimation • Mobile robots’ transition estimation • When is categorical: • Probabilistic classification Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010) 80% Class 1 Pattern Class 2 20% Sugiyama (Submitted)

Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment matching Approach • Ratio matching Approach • Comparison • Dimensionality Reduction

Density Ratio Estimation • Density ratios are shown to be versatile. • In practice, however, the density ratio should be estimated from data. • Naïve approach: Estimate two densities separately and take their ratio

Vapnik’s Principle When solving a problem, don’t solve more difficult problems as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • We estimate density ratio without going through density estimation. Knowing densities Knowing ratio

Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction

Probabilistic Classification • Separate numerator and denominator samples by a probabilistic classifier: • Logistic regression achieves the minimum asymptotic variance when model is correct. Qin (Biometrika1998)

Moment Matching • Match moments of and : • Ex) Matching the mean:

Kernel Mean Matching Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006) • Matching a finite number of moments does not yield the true ratio even asymptotically. • Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS: • Ratio values at samples can be estimated via quadratic program. • Variants for learning entire ratio function and other loss functions are also available. Kanamori, Suzuki & Sugiyama (arXiv2009)

Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction

Ratio Matching Sugiyama, Suzuki & Kanamori (RIMS2010) • Match a ratio model with true ratio under the Bregman divergence: • Its empirical approximation is given by (Constant is ignored)

Kullback-Leibler Divergence (KL) Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007) Nguyen, Wainwright & Jordan (NIPS2007) • Put . Then BR yields • For a linear model , minimization of KL is convex. (ex. Gauss kernel)

Properties Nguyen, Wainwright & Jordan (NIPS2007) Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008) • Parametric case: • Learned parameter converge to the optimal value with order . • Non-parametric case: • Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).

Experiments • Setup: d-dim. Gaussian with covariance identity and • Denominator: mean (0,0,0,…,0) • Numerator: mean (1,0,0,…,0) • Ratio of kernel density estimators (RKDE): • Estimate two densities separately and take ratio. • Gaussian width is chosen by CV. • KL method: • Estimate density ratio directly. • Gaussian width is chosen by CV.

Accuracy as a Functionof Input Dimensionality • RKDE: “curse of dimensionality” • KL: works well RKDE Normalized MSE KL Dim.

Variations • EM algorithms for KL with • Log-linear models • Gaussian mixture models • Probabilistic PCA mixture models are also available. Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009) Yamada & Sugiyama (IEICE2009) Yamada, Sugiyama, Wichern & Jaak (Submitted)

Squared Distance (SQ) Kanamori, Hido & Sugiyama (JMLR2009) • Put . Then BR yields • For a linear model , minimizer of SQ is analytic: • Computationally efficient!

Properties Kanamori, Hido & Sugiyama (JMLR2009) Kanamori, Suzuki & Sugiyama (arXiv2009) • Optimal parametric/non-parametric rates of convergence are achieved. • Leave-one-out cross-validation score can be computed analytically. • SQ has the smallest condition number. • Analytic solution allows computation of the derivative of mutual information estimator: • Sufficient dimension reduction • Independent component analysis • Causal inference

Density Ratio Estimation Methods • SQ would be preferable in practice.

Direct Density Ratio Estimationwith Dimensionality Reduction (D3) • Directly density ratio estimation without going through density estimation is promising. • However, for high-dimensional data, density ratio estimation is still challenging. • We combine direct density ratio estimation with dimensionality reduction!

Hetero-distributional Subspace (HS) • Key assumption: and are different only in a subspace (called HS). • This allows us to estimate the density ratio only within the low-dimensional HS! HS : Full-rank and orthogonal

HS Search Based onSupervised Dimension Reduction Sugiyama, Kawanabe & Chui (NN2010) • In HS, and are different, so and would be separable. • We may use any supervised dimension reduction method for HS search. • Local Fisher discriminant analysis (LFDA): • Can handle within-class multimodality • No limitation on reduced dimensions • Analytic solution available Sugiyama (JMLR2007)

Pros and Cons of SupervisedDimension Reduction Approach • Pros: • SQ+LFDA gives analytic solution for D3. • Cons: • Rather restrictive implicit assumption that and are independent. • Separability does not always lead to HS (e.g., two overlapped, but different densities) Separable HS

Characterization of HS Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010) • HS is given as the maximizer of the Pearson divergence : • Thus, we can identify HS by finding maximizer of PE with respect to . • PE can be analytically approximated by SQ!

HS Search Amari (NeuralComp1998) • Gradient descent: • Natural gradient descent: • Utilize the Stiefel-manifold structure for efficiency • Givens rotation: • Choose a pair of axes and rotate in the subspace • Subspace rotation: • Ignore rotation within subspaces for efficiency Plumbley (NeuroComp2005) Rotation across the subspace Rotation within the subspace Hetero-distributional subspace

Experiments True ratio (2d) Samples (2d) Plain SQ (2d) Increasing dimensionality (by adding noisy dims) Plain SQ D3-SQ (2d) D3-SQ

Conclusions • Many ML tasks can be solved via density ratio estimation: • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test • Directly estimating density ratios without going through density estimation is the key. • LR, KMM, KL, SQ, and D3 .

Books • Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009. • Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!) • Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)

The World of Density Ratios Real-world applications: Brain-computer interface, Robot control, Image understanding, Speech recognition, Natural language processing, Bioinformatics Machine learning algorithms: Covariate shift adaptation, Outlier detection, Conditional probability estimation, Mutual information estimation Density ratio estimation: Fundamental algorithms (LR, KMM, KL, SQ) Large-scale, High-dimensionality, Stabilization, Robustification Theoretical analysis: Convergence, Information criteria, Numerical stability

Density Ratio Estimation: A New Versatile Tool for Machine Learning