Distance Correlation E-Statistics

Distance Correlation E-Statistics Gábor J. Székely Rényi Institute of the Hungarian Academy of Sciences Columbia University, April 28-April 30, 2014

Topics • Lecture 1. Distance Correlation. From correlation (Galton/Pearson, 1895) to distance correlation (Szekely, 2005). Important measures of dependences and how to classify them via invariances. Distance correlation t-test of independence. Open problems for big data. • Lecture 2. Energy statistics (E-statistics) and their applications. Testing for symmetry, testing for normality, DISCO analysis, energy clustering, etc. A simple inequality on energy statistics and a beautiful theorem of Fourier transforms. What makes a statistic U (or V)? • Lecture 3. Brownian correlation. Correlation with respect to stochastic processes. Distances and negative definite functions. Physics principles in statistics (the uncertainty principle of statistics, symmetries/invariances, equilibrium estimates). CLT for dependent variables via Brownian correlation. What if the sample is not iid, what if the sample comes from a stochastic process? • Colloquium talk. Partial distance correlation. Distance correlation and dissimilarities via unbiased distance covariance estimates. What is wrong with the Mantel test? Variable selection via pdCor. What is a good measure of dependence? My Erlangen program in Statistics.

Lecture 1.Distance Correlation Dependence Measures and Tests for Independence Kolmogorov: “Independence is the most important notion of probability theory” • Correlation (Galton 1885-1888, Natural Inheritance, 1889, Pearson, 1895) • Chi-square (Pearson, 1900) • Spearman’s rank correlation (1904) Amer. J. Psychol.15: 72–101. • Fisher, R. (1922) and Fisher’s exact test • Kendall’s tau (1938) A New Measure of Rank Correlation".Biometrika30 (1–2):81–89. • Maximal correlation (Hirschfeld) (Gebelein,1941) ( Lancaster, 1957), (Rényi,1959), (Sarmanov,1958), (Buja, 1990) Dembo(2001) • Hoeffding’s independence test (1948) Annals of Mathematical Statistics19: 293–325, 1948. • Blum-Kiefer-Wolfowitz (1961) • Mantel test (1967) • RKHS Baker (1973), Fukumizo, Gretton, Poczos, … • RV coefficient (1976) Robert, P.; Escoufier, Y. A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient“ Applied Statistics25 (3): 257–265. • Also here is a question/answer from Stack Exchange which mentioned dcor and it was better than RV coefficient, apparently. • http://math.stackexchange.com/questions/690972/distance-or-similarity-between-matrices-that-are-not-the-same-size • Distance correlation (dCor) Szekely (2005), Szekely Bakirov and Rizzo (2007) • nice free version apparently for Matlab/Octave. I should perhaps add a link to our energy page. • http://mastrave.org/doc/mtv_m/dist_corr • Brownian correlation Szekely and Rizzo (2009) DCor generalizes and improves Correlation, RV, Mantel and Chi-square (denominator!) MIC, 2010 Valhalla --- GÖTTERDÄMMERUNG

Kolmogorov: “Independence is the most important notion of probability theory”What is Pearson’s correlation?Sample: (Xk ,Yk ) k=1,2,…,n Centered sample: Ak,=Xk-X. Bk=Yk-Y. cov(x,y)=(1/n)ΣkAkBk r:=cor(x,y) = cov(x,y)/[cov(x,x) cov(y,y)]1/2Prehistory: (i) Gauss (1823) – normal surface with n correlated variables – for Gauss this was just one of the several parameters(ii) Auguste Bravais(1846) referred to one of the parameters of the bivariate normal distribution as « une correlation” but like Gauss he did not recognize the importance of correlation as a measure of dependence between variables. [Analyse mathématique sur les probabilités des erreurs de situation d'un point. Mémoires présentés par divers savants à l'Académie royale des sciences de l'Institut de France, 9, 255-332.](iii)Francis Galton (1885-1888) (iv)Karl Pearson (1895) product-moment rLIII. On lines and planes of closest fit to systems of points in spacePhilosophical Magazine Series 6, 1901 -- cited by 1700Pearson had no unpublished thoughtsWhy do we (NOT) like Pearson’s correlation? What is the remedy?

Apples and Oranges If we want to study the dependence between oranges and apples then it is hard to add or multiply them but it is always easy to do the same with their distances.

ak,l:= |Xk – Xl| bk,l:= |Yk – Yl| for k,l=1,2,…,n Ak,l:= ak,l–ak.–a. l + a. . Bk,l:= bk,l–bk .–b. l + b. .Distance Cov(X,Y)2:=dCov²(X,Y) :=V²(X,Y):= (1/n2)Σk lAk,l Bk,l≥ 0 (!?!) seeSzekely, Rizzo, Bakirov(2007) Ann. Statist. 35/7

Distance Covariance:V²(X,Y):= (1/n2)Σk lAk,l Bk,lDistance standard deviation: V(X):= V(X,X), V(Y):=V(Y,Y)Distance Correlation: dCor(X,Y)²:=R(X,Y)²:= V(X,Y)²/ V(X)V(Y)This should be introduced in our teaching at the undergraduate level.

The population values are a.s.limits of the empirical ones as n→∞ .Thm: dCov²=||fn(s,t)-fn(s)fn(t)||²where ||.|| is the L2-norm withthe singular kernel w(s,t):= c/(st)²This kernel is unique if we have the following invariance: dCov²(a1+b1O1X, a2+b2O2Y)=b1b2dCov²(X,Y).

A beautriful theorem on Fourier transforms ∫(1-cos tx)/t2dt= c|x| The Fourier transform of any power of |t| is a constant times a power of |x| Gel’fand, I. M. – Shilov, G. E. (1958, 1964), Generalized Functions

ThmV(X) =0 iff X is constantV(a + bCX) = |b|V(X)V(X+Y) <= V(X) + V(Y) for independent rv’s with equality iff X or Y is constant

0 ≤ dCorr(X,Y) ≤1 dCorr(X,Y) =0 iff X, Y are independentdCorr (X,Y) = 1 iff Y=a + bXC

Thm Under independence of X and Yn dCov2n(X,Y) →Q=∑ λkZ²kotherwise the limit is ∞Thus we have a universally consistent test of independence

What if (X,Y) is bivariate normal?In this case 0.89 |corr|≤dCor ≤ |cor|

Unbiased Distance Correlation

Unbiased distance correlation Unbiased estimator of dCor² (X, Y) is dCor*n := < A*, B*>:= 1/[n(n-3)] (A*, B*) This is an inner product in the linear space Hn of nxn matrices generated by nxn distance matrices. The population Hilbert space is denoted by H where the inner product is (generated by) dCov*(X, Y).

The power of dCor test for independence is very good especially for high dimensions p,q Denote the unbiased version by dCov*nThe corresponding bias corrected distance correlation isR*nThis is the correlation for the 21st century.Theorem. In high dimension if the CLT holds for the coordinates thenTn:=[M-1] 1/2 R*n/[1-(R*n)2]1/2where M=n(n-3)/2 is t-distributed with d.f. M-1.

Why? R*n = Σij Uij Vij / [Σ Uij2Σ Vij2 ] 1/2 with iid standard normal variables. Put Zij = Uij / [Σ Uij2] 1/2 ; then Σ Zij2 = 1 Under the null, independence of Uij and Vij, Zij does not depend on Vij Given Z, by Cochran’s thm (the square of ΣijZijVij has rank 1), Tn is t-distributed when Z is given, thus even without this condition.

Under the alternative? We need to show that if U, V are standard normal with zero expected value and correlation ρ>0 then P(UV > c) is a monotone increasing function of ρ. For the proof notice that if X, Y are iid standard normal a²+b² =1, 2ab = ρ then for U:=aX+bY and V:= bX+aY We have Var(U)=Var(V) = 1 and E(UV)=2ab = ρ. Thus UV=ab(X²+Y²)+(a²+b²)XY= ρ(X²+Y²)/2 + XY Q.E.D. (I do not need but I do not know what if the expectations are not zero.)

The number of operations is O(n²),independently of the dimensionwhich can even be infinite (X and Y can be in two different metric spaces – Hilbert spaces)The storage complexity can be reduced to O(n) via recursive formulaParallel processing for big n?

A characteristic measure of dependence (population value)dCov2(X,Y) =E|X-X’||Y-Y’|+E|X-X’|E|Y-Y’| - 2E|X-X’||Y-Y’’|

dCov = cov of distances? (X,Y) , (X’,Y’), (X”, Y”) are iid dCov2(X,Y)=E[|X–X’||Y-Y’|] +E|X-X’|E|Y-Y’| -E[|X–X’||Y-Y’’|] - E[|X–X’’||Y-Y’|] = cov(|X–X’|, |Y–Y’|) – 2cov(|X-X’|, |Y-Y”|) (i)Does cov(|X–X’|, |Y–Y’|)=0 imply X and Y are independent? (ii)Does the independence of X and Y imply the independence of X, Y? (i)q(x)=–c/2 for -1<x<0, ½ for 0<x<c, 0 otherwise, p(x,y):=1/4–q(x)q(y)

Max correlation? supf,g Cor(f(X), g(Y)) for all f,g Borel functions with 0 < Var f(X), Var g(Y) < ∞. Why should we (not) like max cor? If max cor (X, Y) = 0 then X, Y are independent For bivariate normal normal maxcor = |cor| For partial sums if iid maxcor2(Sm,Sn)=m/n for m≤n Sarmanov(1958) Dokl. Nauk. SSSR What is wrong with maxcor?

What is the meaning of max cor = 1?

Trigonometric coins Sn := sin U+ sin 2U + … + sin nU tends to Cauchy (we did not divide by √n !!) Open problem: What is the sup of dCor for uncorrelated X and Y. Can it be > 0.85 ?

Lecture 2.Energy Statistics (E-statistics) Newton’s gravitational potential energy can be generalized for statistical applications. Statistical observations are heavenly bodies (in a metric space) governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis holds. Potential energy statistics are symmetric functions of distances between statistical observations in metric spaces. EXAMPLETesting Independence

Potential Energy Statistics Potential energy statistics or energy statistics or E-statistics in short are U-statistics or V-statistics that are functions of distances between sample elements. The idea is to consider statistical observations as heavenly bodies governed by a statistical potential energy which is zero iff an underlying statistical null hypothesis is true.

Distances and Energy: the next level of abstraction (Prelude) In the beginning Man created integers. The accountants of Uruk in Mesopotamia, about five thousand years ago, invented the first numerals – signs encoding the concept of oneness, twoness, threeness, etc. abstracted from any particular entity. Before that for about another 5000 years jars of oil were counted with ovoid, measures of grain were counted with cones, etc. , numbers were indicated with one-to-one correspondence. Numerals revolutionized our civilization: they expressed abstract thoughts, after all, “two” does not exist in nature, only two fingers, two people, two sheep, two apples, two oranges. After this abstraction we could not tell from the numerals what the objects were; seeing the signs of 1,2,3,... we could not see or smell oranges, apples, etc. but we could do comparisons, we could do “statistics”, “statistical inference”. In this lecture instead of working with statistical observations, data taking integer or real values, or taking values in Euclidean spaces, Hilbert spaces or in more general metric spaces we make inferences from their distances. Distances and angles make wonders in science (see e.g. Thales 600 BC; G.J. Szekely: Thales and the Ten Commandments). Here we will exploit this in statistics. Instead of working with numbers, vectors, functions, etc. first we compute their distances and all our inferences will be based on these distances. This is the next level of abstraction where not only we cannot tell the objects, we cannot even tell how big their numbers are, we cannot tell what the data are, we can just tell how far they are from each other. At this level of abstraction we of course lose even more information, we cannot sense lots of properties of data, e.g. if we add the same constant to all data then their distances won’t change. No rigid motion in the space changes the distances. On the other hand we gain a lot: distances are always easy to add, multiply, etc. even when it is not so natural to add or multiply vectors and more abstract observations especially if they are not from the same space. The next level of abstraction is energy statistic: invariance wrt ratios of distances: the angles are invariant. Distance correlation is depends on angles.

Goodness-of-fit

Dual space

Application in statistics Construct a U (or V) statistic with kernel h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y| Vn =(1/n2)∑ h(Xi,Xk) By the NULL: Eh(X,Y) =0 but h is also a rank 1 degenerate kernel because Eh(x,Y’) =0 a.s. under the null thus Under the Null the limit distribution of nVn is Q:=∑kλkZk2 where λk are eigenvalues of Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y)= λψ(x) and under the alternative (X and Y has different distributions) nVn →∞ a.s.

What to do with Hilbert-Schmidt? ∫h(x,y)ψ(y)dF(y)= λψ(x), Q:=∑kλkZk2 (i)Approximate the eigenvalues: (1/n)∑i h(Xi, Xj)ψ=λψ (ii) If Σiλi = 1 then P(Q ≥ c) ≤ P(Z2 ≥ c) if this probability is at most 0.215 [conservative, consistent test) (iii) t-test (see later)

Simple vs Good Eα(X,Y):= 2E|X-Y|α-E|X-X’|α-E|Y-Y’|α ≥ 0 For 0 < α < 2 = 0 iff X and Y are identically distributed For α=2 we have E2(X,Y):= 2[E(X) – E(Y)]2 In case of “classical statistics” (α =2) life is simple but not always good In case of “energy statistics” (0 < α < 2) life is not so simple but good.

Testing for (multivariate) normality Standardized sample: y1 y2,…, yn E|Z – Z’|= ? For U(a,b): E|U-U’|=|b-a|/3 for exponential E|e-e’| = 1/λ E|y – Z|= ? For U(a,b): E|x-U|= quadratic polynomial (hint: if Z is a d-variate standard normal then |y-Z|2 has a noncentral chi-square distribution with non-centrality parameter |y|2/2 and d.f. d+p where p is a Poisson r.v. with mean |y|2/2, see Zacks (1981) p. 55) In 1 dim: E|y – Z| = (2/π)1/2exp{-y2/2}+ x Φ(y) - y √ For implementation see Energy package in R and Szekely and Rizzo (2004).

Why is energy a very good test for normality? • It is affine invariant 2. Consistent against general alternatives 3. Powerful omnibus test In the univariate case our energy test is “almost” the same as the Anderson-Darling EDF test based on ∫(Fn(x) – F(x))2dF(x)/ [F(x)(1-F(x)] But here dF(x)/[F(x)(1-F(x)] is close to constant for standard normal F and thus almost the same as “energy” thus our energy test is essentially a multivariate extension of the powerful Anderson-Darling test.

Distance skewness Advantages: Skew(X):= E[X- E(X)/σ]³ = 0 does NOT characterize symmetry but distance skewness: dSkew(X):= 1–E|X-X’|/E|X+X’|=0 iff X is centrally symmetric. Sample: 1-Σ|Xi-Xk|/Σ|Xi+Xk|

DISCO: a nonparametric extension of ANOVA DISCO is a multi-sample test of equal distributions, a generalization of the hypothesis of equal means which is ANOVA. Put A=(X1, X2 , …, Xn), B=(Y1, Y2 , …, Ym), and d(A, B):=(1/nm)∑ |Xi – Yk| Within-sample dispersion W:= ∑j (nj/2)∑ d(Aj ,Aj ) Put N =n1 +n2+…+nK and A:= {A1 , A2 , …, Ak} Total dispersion T:=(N/2) d(A,A) Thm. T = B + W where B,the between sample dispersion, is the energy distance, i.e. the weighted sum of E(Aj ,Ak)= 2 d(Aj ,Ak)- d(Aj ,Aj)- d(Ak ,Ak) The same thing with exponent α = 2 in d(A, B) is ANOVA

E-clustering Hierarchical clustering: we merge clusters with minimum energy distance: E(CiUCj,Ck)=(ni+nk)/(ni+nj+nk)E(Ci,Ck)+(nj+nk)/(ni+nj+nk)E(Cj, Ck) - nk/(ni+nj+nk)E(Ci, Cj) In E-clustering not only the cluster centers matter but the cluster point distributions. If the exponent in d is α=2 then we get Ward’s minimum variance method, a geometrical method that separates and identifies clusters by their centers. Thus Ward is not consistent but E-clustering is consistent. The ability of E-clustering to separate and identify clusters with equal or nearly equal centers has important practical applications. For details see Szekely- Rizzo (2005) Hierarchical clustering via joint between-within distances, Journal of Classification, 22(2), 151-183.

Kinetic Energy (E) Under the Null the limit distribution of nVn is Q:=∑kλkZk2 where λk are eigenvalues of Hilbert-Schmidt: ∫h(x,y)ψ(y)dF(y)= λψ(x) where h(x,y)= E|x-X| + E|y-Y| - E|X-Y| - |x-y| Differentiate twice: -ψ”/(2f) = Eψ with boundary conditions: ψ’(a)= ψ’(b)= 0 (the second derivative wrt x of (1/2)|x-y| is -δ(x-y) where δ is the Dirac delta) Thus in 1Dimension E= 1/λ Thus we transformed the potential energy (Hilbert-Schmidt) equation into a kinetic energy (Schrödinger) equation. Schrödinger equation(1926): -ψ(x)”/(2m) + V(x)ψ(x)= (E + 1/E)ψ(x) Energy conservation law?

My Erlangen Program in Statistics Klein, Felix 1872. "A comparative review of recent researches in geometry". This is a classification of geometries via invariances (Euclidean, Similarity, Affine, Projective,…) Klein was then at Erlangen. Energy statistics are always rigid motion invariant, their ratios, e.g. dCor is also invariant wrt scaling (angles remain invariant like in Thales’s geometry of similarities) Can we have more invariance? In the univariate case we have monotone invariant rank statistics. But in the multivariate case if a statistic is 1-1 affine/projective invariant and continuous then it is constant. (projection is affine but not 1-1, still because of continuity thr statistics are invariant to all projections to (coordinate) lines thus they are constant).

Affine invariant energy statistics They cannot be continuous but in case of testing for normality affine invariance is natural (it is not natural for testing independence because it changes angles). BUT dCor =0 is invariant with respect to all 1-1 Borel functions and max cor is also invariant wrt all 1-1 Borel but these are population values. Maximal correlation is too invariant. Why? Max correlation can easily be 1 for uncorrelated rv’s but the max of dCor for uncorrelated variables is < 0.85.

Unsolved dCor problems • Using subsampling construct confidence interval for dCor^2. Why (not) bootstrap? • Definition of Complexity of function f via dCor (X, f(X)) • Find sup dCor (X,Y) for uncorrelated X and Y.

Energy and U, V

Lecture 3. Brownian Correlation / Covariance Xid:= X – E(X) = id(X) – E(id(X)|id(.)) Cov2 (X, Y)= E(XidX’ idYid’Y’id’) XW:=W(X) – E(W(X)|W(.))Cov2W(X,Y):=E(XWX’WYW’Y’W’) Remark: Covid(X,Y) = |Cov(X,Y)| Theorem: dCov (X,Y)= CovW (X,Y) (!!) Szekely (2009) Ann. Appl. Statist 3/4 Discussion Paper What if Brownian motion is replaced by another stochastic process? What matters is the (positive definite) covariance function of the process.

Why Brownian? We can replace BM by any two stochastic processes U=U(t) and V=V(t) Cov2U,V(X,Y):=E(XUX’UYVY’V) But why is this generalization good, how to compute, how to apply? The covariance function of BM is 2min(s,t)= |s| + |t| -|s-t|.

Fractional BM The simplest extension is |s|α + |t|α -|s-t|α and a zero mean Gaussian process with this cov is the fractional BM defined for 0 < α < 2. This process was mentioned for the first time in Kolmogorov(1940). α = 2H where H is the Hurst exponent.Fractal dimension D= 2-H.H describes the raggedness of the resultant motion, with a higher value leading to a smoother motion. The value of H determines what kind of process the fBm is: • if H = 1/2 then the process is in fact a Brownian motion or Wiener process; • if H > 1/2 then the increments of the process are positively correlated; • if H < 1/2 then the increments of the process are negatively correlated. The increment process, X(t) = BH(t+1) − BH(t), is known as fractional Gaussian noise.

Variogram What properties of the (fractional) BM we need to make sure that the cov wrt certain stochastic processes is “energy” type i.e. it dependson the distances of observations only? In spatial statistics the variagram 2γ(s,t) of a random field Z(t) is2γ(s,t):= Var(Z(s) –Z(t)). Suppose E(Z(t))=0. For stationary processes γ(s,t):= γ(s-t)and for stationary isotropic ones:γ(s,t):= γ(|s-t|) A function is a variogram of a zero expected value process/field iff it is conditionally negative definite (see later). If the covariance function C(s,t) of a process exists then 2C(s,t) = 2E[(Z(s)Z(t)]= EZ(s)2 +EZ(t)2 +E[Z(t)-Z(s)]2 = γ(s,s) + γ(t,t) – 2 γ(s-t). For BM we had 2min(s,t)= |s| + |t| - 2|s-t|. We also have the converse: γ(s-t)= C(s,s) + C(t,t) – 2C(s,t). Cov2U,V(X,Y) is of “energy type” if the increments of U,V are stationary isotropic.

Special Gaussian processes The negative log of the symmetric Laplace ch.f is γ(t):=log(1 + |t|2) defines a Laplace-Gaussian process with the corresponding C(s,t) because this γ is conditionally negative definite. The negative log of the ch.f. of the difference of two iid Poisson is γ(t):= cos t - 1. This defines a Poisson-Gaussian process.

Correlation wrt stochastic processes When the covariance counts only we can assume the processes are Gaussian. Why do we need this generalization? Conjecture. We need this generalization if the observations (Xt Yt) are not iid but stationary ergodic? Then consider cor wrt zero mean (Gaussian) processes with stationary increments having the same cov as (Xt Yt)?

A property of squared distances What exactly are the properties of distances in Euclidean spaces (and Hilbert spaces) that we need for statistical inferences? We need the following properties of squared distances |x-y|^2 in Hilbert spaces. Let H be a real Hilbert space, xi in H. Then we have that if ai in R and Σi ai =0 then Σijai aj |xi – xj|²= - 2| Σijai xi |² ≤ 0 Thus if yi is in H i=1,2,…, n is another set of elements from H then Σijai aj |xi – yj|² = -2 Σijai aj xi yj ≤ -| Σiai (xi + yi)|² ≤ 0. This is what we call the (conditional) negative definite property of |x-y|^2.

Distance Correlation E-Statistics