500 likes | 845 Views
Density Ratio Estimation: A New Versatile Tool for Machine Learning. Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/~sugi. Overview of My Talk (1). Consider the ratio of two probability densities .
E N D
CMU Density Ratio Estimation:A New Versatile Toolfor Machine Learning Department of Computer Science Tokyo Institute of Technology Masashi Sugiyama sugi@cs.titech.ac.jp http://sugiyama-www.cs.titech.ac.jp/~sugi
Overview of My Talk (1) • Consider the ratio of two probability densities. • If the ratio is estimated, various machine learning problems can be solved! • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test
Overview of My Talk (2) Vapnik said: When solving a problem of interest, one should not solve a more general problem as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • Various direct density ratio estimation methods have been developed recently. Knowing densities Knowing ratio
Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation
Covariate Shift Adaptation Shimodaira (JSPI2000) • Training/test input distributions are different, but target function remains unchanged: “extrapolation” Input density Function Training samples Test samples Target function
Ordinary least-squares is not consistent. Adaptation Using Density Ratios • Density-ratio weighted least-squares is consistent. Applicable to any likelihood-based methods!
Model Selection • Controlling bias-variance trade-off is important in covariate shift adaptation. • No weighting: low-variance, high-bias • Density ratio weighting: low-bias, high-variance • Density ratio weighting plays an important role for unbiased model selection: • Akaike information criterion (regular models) • Subspace information criterion (linear models) • Cross-validation (arbitrary models) Shimodaira (JSPI2000) Sugiyama & Müller (Stat&Dec.2005) Sugiyama, Krauledat & Müller (JMLR2007)
Active Learning • Covariate shift naturally occurs in active learning scenarios: • is designed by users. • is determined by environment. • Density ratio weighting plays an important role for low-bias active learning. • Population-based methods: • Pool-based methods: Wiens (JSPI2000) Kanamori & Shimodaira (JSPI2003) Sugiyama, (JMLR2006) Kanamori, (NeuroComp2007), Sugiyama & Nakajima (MLJ2009)
Real-world Applications • Age prediction from faces (with NECsoft): • Illumination and angle change (KLS&CV) • Speaker identification (with Yamaha): • Voice quality change (KLR&CV) • Brain-computer interface: • Mental condition change (LDA&CV) Ueki, Sugiyama & Ihara (ICPR2010) Yamada, Sugiyama & Matsui (SP2010) Sugiyama, Krauledat & Müller (JMLR2007) Li, Kambara, Koike & Sugiyama (IEEE-TBE2010)
Real-world Applications (cont.) こんな・失敗・は・ご・愛敬・だ・よ・. • Japanese text segmentation (with IBM): • Domain adaptation from general conversation to medicine (CRF&CV) • Semiconductor wafer alignment (with Nikon): • Optical marker allocation (AL) • Robot control: • Dynamic motion update (KLS&CV) • Optimal exploration (AL) Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2008) Sugiyama & Nakajima (MLJ2009) Hachiya, Akiyama, Sugiyama & Peters (NN2009) Hachiya, Peters & Sugiyama (ECML2009) Akiyama, Hachiya, & Sugiyama (NN2010)
Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation
Inlier-based Outlier Detection Hido, Tsuboi, Kashima, Sugiyama & Kanamori (ICDM2008) Smola, Song & Teo (AISTATS2009) • Test samples having different tendency from regular ones are regarded as outliers. Outlier
Real-world Applications • Steel plant diagnosis (with JFE Steel) • Printer roller quality control (with Canon) • Loan customer inspection (with IBM) • Sleep therapy from biometric data Takimoto, Matsugu & Sugiyama (DMSS2009) Hido, Tsuboi, Kashima, Sugiyama & Kanamori (KAIS2010) Kawahara & Sugiyama (SDM2009)
Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation
Mutual Information Estimation • Mutual information (MI): • MI as an independence measure: • MI can be approximated using density ratio: andare statistically independent Suzuki, Sugiyama & Tanaka (ISIT2009)
Usage of MI Estimator Suzuki, Sugiyama, Sese & Kanamori (BMC Bioinformatics 2009) • Independence between input and output: • Feature selection • Sufficient dimension reduction • Independence between inputs: • Independent component analysis • Independence between input and residual: • Causal inference Suzuki & Sugiyama (AISTATS2010) Suzuki & Sugiyama (ICA2009) Yamada & Sugiyama (AAAI2010) y Causal inference ε Feature selection, Sufficient dimension reduction x x’ Independent component analysis
Organization of My Talk • Usage of Density Ratios: • Covariate shift adaptation • Outlier detection • Mutual information estimation • Conditional probability estimation • Methods of Density Ratio Estimation
Conditional Probability Estimation • When is continuous: • Conditional density estimation • Mobile robots’ transition estimation • When is categorical: • Probabilistic classification Sugiyama, Takeuchi, Suzuki, Kanamori, Hachiya & Okanohara (AISTATS2010) 80% Class 1 Pattern Class 2 20% Sugiyama (Submitted)
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment matching Approach • Ratio matching Approach • Comparison • Dimensionality Reduction
Density Ratio Estimation • Density ratios are shown to be versatile. • In practice, however, the density ratio should be estimated from data. • Naïve approach: Estimate two densities separately and take their ratio
Vapnik’s Principle When solving a problem, don’t solve more difficult problems as an intermediate step • Estimating density ratio is substantially easier than estimating densities! • We estimate density ratio without going through density estimation. Knowing densities Knowing ratio
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction
Probabilistic Classification • Separate numerator and denominator samples by a probabilistic classifier: • Logistic regression achieves the minimum asymptotic variance when model is correct. Qin (Biometrika1998)
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction
Moment Matching • Match moments of and : • Ex) Matching the mean:
Kernel Mean Matching Huang, Smola, Gretton, Borgwardt & Schölkopf (NIPS2006) • Matching a finite number of moments does not yield the true ratio even asymptotically. • Kernel mean matching: All the moments are efficiently matched using Gaussian RKHS: • Ratio values at samples can be estimated via quadratic program. • Variants for learning entire ratio function and other loss functions are also available. Kanamori, Suzuki & Sugiyama (arXiv2009)
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction
Ratio Matching Sugiyama, Suzuki & Kanamori (RIMS2010) • Match a ratio model with true ratio under the Bregman divergence: • Its empirical approximation is given by (Constant is ignored)
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction
Kullback-Leibler Divergence (KL) Sugiyama, Nakajima, Kashima, von Bünau & Kawanabe (NIPS2007) Nguyen, Wainwright & Jordan (NIPS2007) • Put . Then BR yields • For a linear model , minimization of KL is convex. (ex. Gauss kernel)
Properties Nguyen, Wainwright & Jordan (NIPS2007) Sugiyama, Suzuki, Nakajima, Kashima, von Bünau & Kawanabe (AISM2008) • Parametric case: • Learned parameter converge to the optimal value with order . • Non-parametric case: • Learned function converges to the optimal function with order slightly slower than (depending on the covering number or the bracketing entropy of the function class).
Experiments • Setup: d-dim. Gaussian with covariance identity and • Denominator: mean (0,0,0,…,0) • Numerator: mean (1,0,0,…,0) • Ratio of kernel density estimators (RKDE): • Estimate two densities separately and take ratio. • Gaussian width is chosen by CV. • KL method: • Estimate density ratio directly. • Gaussian width is chosen by CV.
Accuracy as a Functionof Input Dimensionality • RKDE: “curse of dimensionality” • KL: works well RKDE Normalized MSE KL Dim.
Variations • EM algorithms for KL with • Log-linear models • Gaussian mixture models • Probabilistic PCA mixture models are also available. Tsuboi, Kashima, Hido, Bickel & Sugiyama (JIP2009) Yamada & Sugiyama (IEICE2009) Yamada, Sugiyama, Wichern & Jaak (Submitted)
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Kullback-Leibler Divergence • Squared Distance • Comparison • Dimensionality Reduction
Squared Distance (SQ) Kanamori, Hido & Sugiyama (JMLR2009) • Put . Then BR yields • For a linear model , minimizer of SQ is analytic: • Computationally efficient!
Properties Kanamori, Hido & Sugiyama (JMLR2009) Kanamori, Suzuki & Sugiyama (arXiv2009) • Optimal parametric/non-parametric rates of convergence are achieved. • Leave-one-out cross-validation score can be computed analytically. • SQ has the smallest condition number. • Analytic solution allows computation of the derivative of mutual information estimator: • Sufficient dimension reduction • Independent component analysis • Causal inference
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction
Density Ratio Estimation Methods • SQ would be preferable in practice.
Organization of My Talk • Usage of Density Ratios • Methods of Density Ratio Estimation: • Probabilistic Classification Approach • Moment Matching Approach • Ratio Matching Approach • Comparison • Dimensionality Reduction
Direct Density Ratio Estimationwith Dimensionality Reduction (D3) • Directly density ratio estimation without going through density estimation is promising. • However, for high-dimensional data, density ratio estimation is still challenging. • We combine direct density ratio estimation with dimensionality reduction!
Hetero-distributional Subspace (HS) • Key assumption: and are different only in a subspace (called HS). • This allows us to estimate the density ratio only within the low-dimensional HS! HS : Full-rank and orthogonal
HS Search Based onSupervised Dimension Reduction Sugiyama, Kawanabe & Chui (NN2010) • In HS, and are different, so and would be separable. • We may use any supervised dimension reduction method for HS search. • Local Fisher discriminant analysis (LFDA): • Can handle within-class multimodality • No limitation on reduced dimensions • Analytic solution available Sugiyama (JMLR2007)
Pros and Cons of SupervisedDimension Reduction Approach • Pros: • SQ+LFDA gives analytic solution for D3. • Cons: • Rather restrictive implicit assumption that and are independent. • Separability does not always lead to HS (e.g., two overlapped, but different densities) Separable HS
Characterization of HS Sugiyama, Hara, von Bünau, Suzuki, Kanamori & Kawanabe (SDM2010) • HS is given as the maximizer of the Pearson divergence : • Thus, we can identify HS by finding maximizer of PE with respect to . • PE can be analytically approximated by SQ!
HS Search Amari (NeuralComp1998) • Gradient descent: • Natural gradient descent: • Utilize the Stiefel-manifold structure for efficiency • Givens rotation: • Choose a pair of axes and rotate in the subspace • Subspace rotation: • Ignore rotation within subspaces for efficiency Plumbley (NeuroComp2005) Rotation across the subspace Rotation within the subspace Hetero-distributional subspace
Experiments True ratio (2d) Samples (2d) Plain SQ (2d) Increasing dimensionality (by adding noisy dims) Plain SQ D3-SQ (2d) D3-SQ
Conclusions • Many ML tasks can be solved via density ratio estimation: • Non-stationarity adaptation, domain adaptation, multi-task learning, outlier detection, change detection in time series, feature selection, dimensionality reduction, independent component analysis, conditional density estimation, classification, two-sample test • Directly estimating density ratios without going through density estimation is the key. • LR, KMM, KL, SQ, and D3 .
Books • Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, 2009. • Sugiyama, von Bünau, Kawanabe & Müller, Covariate Shift Adaptation in Machine Learning, MIT Press (coming soon!) • Sugiyama, Suzuki & Kanamori, Density Ratio Estimation in Machine Learning, Cambridge University Press (coming soon!)
The World of Density Ratios Real-world applications: Brain-computer interface, Robot control, Image understanding, Speech recognition, Natural language processing, Bioinformatics Machine learning algorithms: Covariate shift adaptation, Outlier detection, Conditional probability estimation, Mutual information estimation Density ratio estimation: Fundamental algorithms (LR, KMM, KL, SQ) Large-scale, High-dimensionality, Stabilization, Robustification Theoretical analysis: Convergence, Information criteria, Numerical stability