Kenji Fukumizu The Institute of Statistical Mathematics

Nonparametric Tests of Independence and Conditional Independence with Positive definite Kernels Kenji Fukumizu The Institute of Statistical Mathematics Based on collaborations with A. Gretton, X. Sun, B. Sriperumbudur, D. Janzing and B. Schölkopf Machine Learning Approaches to Statistical Dependences and Causality Dagstuhl, Sept.28 - Oct.2, 2009. 1

Outline • Introduction: Kernel method and RKHS • Independence with kernel • Conditional independence with kernel • Discussions 2

Statistical tests for causal learning • Statistical tests of conditional independence • Independence and conditional independence tests: important component in constraint-based causal learning. e.g. Inductive Causation (Verma&Pearl 90), PC algorithm (Sprites & Glymour 91) • Difficulty in tests for causal learning • High-dimensionality: combinations of variables are tested • Mixed types of variables: continuous, discrete, etc. • Multiple comparison: same variables are tested many times • Conventional methods for continuous variables: • Linear/Gaussian assumption. • Discretize variables and contingency-table approaches.

F(X) X F W (original space) H (RKHS) Kernel method • Transform datato feature space (RKHS)  “nonlinearity” or “higher-order moments”. Data: X1, …, XN F(X1),…, F(XN) - Linear methods on RKHS (e.g. SVM) - Basic statistics: mean, covariance, cond. covariance “Kernel trick” Easy compotation of the inner product.

Positive definite kernel and RKHS • Positive definite kernel W: set. k is positive definite if k(x,y) = k(y,x) and for any the matrix (Gram matrix) is positive semidefinite. • Example: Gaussian RBF kernel Polynomial kernel • Reproducing kernel Hilbert space (RKHS) k: positive definite kernel onW. H: Hilbert space consisting of functions on W. (reproducing property)

Why RKHS? The reproducing property gives easy empirical computation. • The inner product of H is efficiently computable, while the dimensionality may be infinite. • The computational cost essentially depends on the sample size N, once the Gram matrices are obtained. c.f. L2 inner product / power series expansion • Construction of empirical estimators is easy. • Can be applied to non-Euclidean data (strings, graphs, etc.). 6

Independence with Kernels

Covariance on RKHS • Linear case (Gaussian): VYX = Cov[X, Y] = E[YXT] – E[Y]E[X]T : covariance matrix • On RKHS: X , Y : random variables on WX and WY , resp. (HX, kX), (HY , kY) : RKHSs defined on WX and WY, resp. Define random variables on the RKHSHX and HY by Def.Cross-covariance operator SYX : HXHY for all

Higher-order moments - intuition If we can represent the feature vector w.r.t. a basis 1, u, u2, u3, …, The operator SYX contains the information on all the higher-order correlation.

XY XY Characterization of independence • Independence and cross-covariance operator Theorem If the product kernel is characteristic on WX x WY, then • Analog to Gaussian random vectors: for all 10

for all XY XY • A kernel is called characteristic if i.e. RKHS is rich enough to determine a probability by the means. Example: Gaussian kernel, Laplacian kernel • Analogy to the characteristic function approach The kernel characterization: for all u and v

where is an independent copy of (X,Y). Measuring dependence • HSIC (Hilbert-Schmidt Independence Criterion, Gretton et al. 2005) • Hilbert-Schmidt norm of an operator (c.f. Frobenius norm of a matrix) : orthonormal basis of HX and HY, resp.

Estimation of cross-cov. operator • Empirical estimation is straightforward. : i.i.d. sample on Empirical covariance • Measure of independence: WX x WY. (rank ) where (centered Gram matrix) 13

Application to independence test Statistics for independence test: • Under null hypothesis (X and Y are independent), • Under alternative (X and Y are not independent), Zi : i.i.d. ~ N(0,1), is the eigenvalues of the following integral operator Ui = (Xi, Yi)

Other nonparametric indep. test • Power Divergence (Ku&Fine05, Read&Cressie) • Make partition : Each dimension is divided into q parts so that each bin contains almost the same number of data. • Null distribution under independence • Difficulty for high-dimensional data (no data in most bins) : frequency in Aj : marginal freq. in r-th interval I0 = MI I2 = Mean Square Contingency (c2-divergence).

P.d.f. based • Low accuracy for high-dimensional data. • Sensitive to the choice of bandwidth. • Characteristic function based • Direct estimation of the integral is difficult. • Some choice of w give estimators relevant to kernel method. e.g. Feuerverger 1993 Estimation by Parzen-window.

Example of Independent Test • Synthesized data • Data: two d-dimensional samples strength of dependence c2test for Power-divergence

Conditional Independence with Kernels

Conditional Covariance on RKHS • Conditional Cross-covariance operator X, Y, Z : random variables on WX, WY, WZ (resp.). (HX, kX), (HY , kY), (HZ , kZ) : RKHS defined on WX, WY, WZ (resp.). • Conditional cross-covariance operator • Expresses conditional covariance c.f. for Gaussian variable

X | Z Y • Characterization of conditional independence TheoremDefine the augmented variable and define a kernel on by Assume are characteristic, then c.f. for Gaussian variables X Y | Z

Empirical Estimator of Cond. Cov. Operator (X1, Y1, Z1), ... , (XN, YN, ZN) • Empirical conditional covariance operator • Estimator of Hilbert-Schmidt norm etc. finite rank operators regularization for inversion centered Gram matrix

Kernel Conditional Indep. Test Test statistics • The asymptotic null distribution has not been derived, unfortunately. • Computational cost is high  low rank approximation helps much. • Permutation test still requires partitioning To simulate conditional independence test for continuous variables, we need to partition Z and permute X or Y within a bin. This is better than partitioning all the variables, but high-dimensional Z causes problem.

Other methods • Discretization and c2. • non-parametric conditional independence test for continuous domain. Conditional characteristic function based (Su&White 2006) • Conditional p.d.f. must be used. • Clever band-width must be chosen to derive asymptotic null distribution.

s = median Choice of Kernel • How to choose a kernel? • No theoretically justified method for dependence analysis. c.f. For classification/regression, cross-validation is standard. • Some heuristic methods that work: • Heuristics for Gaussian kernels • Speed of asymptotic convergence Compare the bootstrapped variance and the theoretical one, and choose the parameter to give the minimum discrepancy. under independence 24

Normalized Independence Measure • HS Normalized Independence Criterion (HSNIC) Assume is Hilbert-Schmidt • Can be extended to the conditional case (not shown here). 25

Kernel-free Expression • Integral expression of HSNIC without kernels Theorem (FGSS07) Assume Assume kXkY is characteristic, and the law PXY has p.d.f. w.r.t. the measures m1 and m2, resp. • HSNIC is defined by kernels, but it does not depend on the kernels! • HSNICemp gives a kernel estimator of the c2 – divergence (mean square contingency), which is a well-known dependence criterion. = c2 - divergence (Mean Square Contingency) 26

Experiment Coupled Hénon map X, Y: Yt-1 Yt Y g X Xt-1 Xt x1-y1 x1-x2 g = 0 g = 0.25 g = 0.8 27 27

Causality of coupled Hénon map X is a cause of Y if g> 0. Y is not a cause of X for all g. Permutation tests for non-causality with N = 100, p = 1 Number of times accepting H0 among 100 datasets (a = 5%) 28 28

Chicken or Egg ? Data: Annual US time series 1930-83 on egg production and the chicken population. (Thurman and Fisher. Amer. J. Agricultural Economics 1988) Null Hypothesis (A): EGG is NOT a cause of CHICKEN p(Ct | Ct-1,..., Ct-L, Et-1,..., Et-L) = p(Ct | Ct-1,..., Ct-L) Null Hypothesis (B): CHICKEN is NOT a cause of EGG p(Et | Et-1,..., Et-L, Ct-1,..., Ct-L) = p(Et | Et-1,..., Et-L) Egg comes first! (coincides with Granger test by Thurman and Fisher) 29

Discussions • Nonparametric inference with kernels • Positive definite kernels work for nonparametric inference such as independence, conditional independence tests and more. • Application to causal learning • See Xiaohai Sun’s Ph.D. Thesis! • Challenges • Theoretical study on kernel choice. • Power or efficiency of independence tests? • Null distribution of conditional independence tests. • Derivation of asymptotic null distribution. • Permutation test without partitioning? For constraint-based causal learning • Reduce the number of tests • How to set the significance level for the highly correlated tests 30

References Fukumizu, K., F. R. Bach and A. Gretton: Statistical Consistency of Kernel Canonical Correlation Analysis. JMLR 8, 361-383 (2007). Fukumizu, K., F. Bach and M. Jordan. Kernel dimension reduction in regression. The Annals of Statistics, 37(4) pp.1871-1905 (2009). Sun, X., D. Janzing, B. Schölkopf, and K. Fukumizu. A kernel-based causal learning algorithm. Proc. 24th Intern. Conf. Machine Learning (ICML2007), pp.855-862. (2007) Gretton, A. K. Fukumizu, C.H. Teo, L. Song, B.Schölkopf, A. Smola. A Kernel Statistical Test of Independence. Advances in NIPS 20:585-592. 2008. Fukumizu, K., A. Gretton, X. Sun., and B. Schölkopf. Kernel Measures of Conditional Dependence. Advances in NIPS 20:489-496. 2008. Sriperumbudur, B., A. Gretton, K. Fukumizu, G. Lanckriet, and B. Schölkopf. Injective Hilbert Space Embeddings of Probability Measures. Proc. 21st Annual Conference on Learning Theory (COLT 2008). (2008). Slides for lecture on kernel method: http://www.ism.ac.jp/~fukumizu/H20_kernel/

Kenji Fukumizu The Institute of Statistical Mathematics

Kenji Fukumizu The Institute of Statistical Mathematics

Presentation Transcript

International Statistical Institute Vision

Kenji Hakuta

INTERNATIONAL STATISTICAL INSTITUTE

THE MATHEMATICS OF STATISTICAL MACHINE TRANSLATION Sriraman M Tallam

Mizoguchi Kenji

Mizoguchi Kenji

Mizoguchi Kenji

Mathematics Institute

INTERNATIONAL STATISTICAL INSTITUTE

Kenji MAEDA Meteorological Research Institute, JMA

University of Rostock, Germany Institute of Mathematics

INSTITUTE OF NUMERICAL MATHEMATICS: 30 YEARS

Mizoguchi Kenji

Keldysh Institute of Applied Mathematics

INSTITUTE OF MATHEMATICS

Yosihiko Ogata , Institute of Statistical Mathematics

Enrico Giovannini President of the Italian Statistical Institute

Kenji Takagi