300 likes | 458 Views
Multivariate Analysis Past, Present and Future. Harrison B. Prosper Florida State University PHYSTAT 2003 10 September 2003. Outline. Introduction Historical Note Current Practice Issues Summary. Introduction. Data are invariably multivariate Particle physics ( h , f , E, f)
E N D
Multivariate AnalysisPast, Present and Future Harrison B. Prosper Florida State University PHYSTAT 2003 10 September 2003 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Outline • Introduction • Historical Note • Current Practice • Issues • Summary Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction • Data are invariably multivariate • Particle physics (h, f, E, f) • Astrophysics (θ, f, E, t) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – II A Textbook Example • Objects • Jet 1 (b) 3 • Jet 2 3 • Jet 3 3 • Jet 4 (b) 3 • Positron 3 • Neutrino 2 17 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – III • Astrophysics/Particle physics: Similarities • Events • Interesting events occur at random • Poisson processes • Backgrounds are important • Experimental response functions • Huge datasets Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – IV • Differences • In particle physics we control when events occur and under what conditions • We have detailed predictions of the relative frequency of various outcomes Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Time → Introduction – VAll we do is Count! • Our experiments are ideal Bernoulli trials • At Fermilab, each collision, that is, trial, is conducted the same way every 400ns • de Finetti’s analysis of exchangeable trials is an accurate model of what we do Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Introduction – VI • Typical analysis tasks • Data Compression • Clustering and cluster characterization • Classification/Discrimination • Estimation • Model selection/Hypothesis testing • Optimization Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Historical Note Karl Pearson (1857 – 1936) R.A. Fisher (1890 – 1962) P.C. Mahalanobis (1893 – 1972) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Historical Note – Iris Data Iris Versicolor Iris Sotosa R.A. Fisher, The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, v. 7, p. 179-188 (1936) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Iris Data • Variables • X1 Sepal length • X2 Sepal width • X3 Petal length • X4 Petal width • “What linear function of the four measurements will maximize the ratio of the difference between the specific means to the standard deviations within species?” R.A. Fisher Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Fisher Linear Discriminant (1936) Solution: Which is the same, within a constant, as Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Current Practice in Particle Physics • Reducing number of variables • Principal Component Analysis (PCA) • Discrimination/Classification • Fisher Linear Discriminant (FLD) • Random Grid Search (RGS) • Feedforward Neural Network (FNN) • Kernel Density Estimation (KDE) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Current Practice – II • Parameter Estimation • Maximum Likelihood (ML) • Bayesian (KDE and analytical methods) • e.g., see talk by Florencia Canelli (12A) • Weighting • Usually 0, 1, referred to as “cuts” • Sometimes use the R. Barlow method Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
B = S = Cuts (0, 1 weights) Points that lie below the cuts are “cut out” 1 0 We refer to (x0, y0) as a cut-point Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
B = S = Grid Search Apply cuts at each grid point compute some measure of their effectiveness and choose most effective cuts Curse of dimensionality: number of cut-points ~ NbinNdim Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
1 Signal fraction 0 1 0 Background fraction Random Grid Search Take each point of the signal class as a cut-point y n = # events in sample k = # events after cuts fraction = n/k x H.B.P. et al, Proceedings, CHEP 1995 Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Example: DØ Top Discovery (1995) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
r(x,y) = constant defines the optimal decision boundary Optimal Discrimination Bayes Discriminant Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
FeedForward Neural Networks • Applications • Discrimination • Parameter estimation • Function and density estimation • Basic Idea • Encode mapping (Kolmogorov, 1950s). using a set of 1-D functions. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
LQ Example: DØSearch for LeptoQuarks l l q LQ q g Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Issues • Method choice • Life is short and data finite; so how should one choose a method? • Model complexity • How to reduce dimensionality of data, while minimizing loss of “information”? • How many model parameters? • How should one avoid over-fitting? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Issues – I I • Model robustness • Is a cut on a multivariate discriminant necessarily more sensitive to modeling errors than a cut on each of its input variables? • What is a practical, but useful, way to assess sensitivity to modeling errors and robustness with respect to assumptions? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Issues - III • Accuracy of predictions • How should one place “error bars” on multivariate-based results? • Is a Bayesian approach useful? • Goodness of fit • How can this be done in multiple dimensions? Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Summary • After ~ 80 years of effort we have many powerful methods of analysis • A few of which are now used routinely in physics analyses • The most pressing need is to understand some issues better so that when the data tsunami strikes we can respond sensibly Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
FNN – Probabilistic Interpretation Minimize the empirical risk function with respect to w Solution (for large N) If t(x) = kd[1-I(x)], where I(x) = 1 if x is of class k, 0 otherwise D.W. Ruck et al., IEEE Trans. Neural Networks 1(4), 296-298 (1990) E.A. Wan, IEEE Trans. Neural Networks 1(4), 303-305 (1990) Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Self Organizing Map • Basic Idea (Kohonen, 1988) • Map each of K feature vectors X = (x1,..,xN)T into one of Mregions of interest defined by the vector wm so that all X mapped to a given wm are closer to it than to all remaining wm. • Basically, perform a coarse-graining of the feature space. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Support Vector Machines • Basic Idea • Data that are non-separable in N-dimensions have a higher chance of being separable if mapped into a space of higher dimension • Use a linear discriminant to partition the high dimensional feature space. Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Independent Component Analysis • Basic Idea • Assume X = (x1,..,xN)T is a linear sum X=AS of independent sources S = (s1,..,sN)T. Both A, the mixing matrix, and S are unknown. • Find a de-mixing matrix T such that the components of U = TX are statistically independent Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper
Multivariate Analysis PHYSTAT 2003 Harrison B. Prosper