330 likes | 481 Views
Exploratory Data Analysis and Multivariate Strategies. Andrew Mead (School of Life Sciences). Multi-… approaches in statistics. Multiple comparison tests Multiple testing adjustments
E N D
Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)
Multi-… approaches in statistics • Multiple comparison tests • Multiple testing adjustments • Methods for adjusting the significance levels when doing a large number of tests (comparisons between treatments) within a single analyses • Multiple regression analysis • Statistical model building with more than one explanatory variable • Multi-factor analysis of variance • Analysis of designed experiments with more than one explanatory factor • Multivariate Analysis • Methods to summarise and explore the relationships among multiple response variables, and/or to assess differences among “treatments” based on multiple response variables
Multivariate data • Responses by a number of individuals to a range of different (related) questions in a (social studies) survey • Counts of different species in a range of locations in an ecological survey • Measurements of a range of traits of individual people, animals, plants, products, … • Different medical measurements made on a group of patients • Measurements of gene expression / protein expression, metabolite expression in biological samples • Counts of sequence consensus matches from microbial samples • Data on the “distances” between a number of objects
Multivariate questions • Identifying groups of objects with similar (and different) responses • e.g. UK landscape areas with similar percentage composition of different land covers (woodland, arable, urban, …)? • Identifying the particular measurement that contribute to the variability among a set of objects • e.g. weed species that are present under different long-term herbicide strategies? • Identifying the particular measurements that contribute to differences between groups of objects • e.g. which genes discriminate between patients with and without some form of cancer? • Identifying the particular measurements that explain variation in some over-arching response variable • e.g. which traits of predatory insects influence the predation rate of aphids?
Exploratory data analysis • Summary statistics / graphical summaries • Variability for each variable/measurement • Groups of observations • both pre-determined – to find potential differences • and to be identified – based on each individual variable • Scatter plots / correlations • Associations between pairs of variables
Univariate analysis • For each individual variable: • Hypothesis tests • Choice depends on the question to be answered • Analysis of variance • For variables measured in designed experiments • Regression analysis • To build statistical models to describe how one response variable depends on one (or more) explanatory variables • Generalised Linear Models (GLMs) • For data where standard assumptions do not hold! • Time Series Analysis • …
Multivariate analysis • For a set of “correlated” variables: • Assess relationships between variables • Consider the effects of “treatments” on these relationships • Consider how a “response” depends on these relationships • Multivariate methods concerned with “data reduction” • Summarise the correlations between variables • Produce a smaller set of (uncorrelated) variables containing the important information • For a set of “related” objects • Identify groups of similar objects • Identify differences between groups of similar objects • And what makes the objects similar!
Simple graphical summaries 1 • For compositional data • e.g. numbers of onion bulbs in different marketable size grades • Present as a stacked bar-chart • For raw data • For percentage of the total
Simple graphical summaries 2 • More general data • e.g. different measurements on a set of plants • Scatter plots for each pair of variables • Present in a matrix • Calculate linear correlation coefficient for each pair of variables
Two forms of data matrix • The DATA matrix • pvariables for each of nsamples (observations) • Presented in a rectangular matrix • nrows and pcolumns • Theassociationmatrix • Distance, similarity or dissimilarity • Between every pair of variables or every pair of samples • Symmetric square matrix • n-by-n - between samples • p-by-p - between variables • just show lower triangle • Turns multivariate data into univariate data? p variables n samples p variables p variables Lower Triangle
Analysing Association Data • Start with associations • Distances between locations on a map • Psychometric (sensory) similarities between products • Construct associations from data • Depends on the types of data • Binary (presence/absence) data • Simple matching coefficient; Jaccard coefficient; … • Continuous data • Euclidean distance; Manhattan distance; … • Similarities or Disimilarities/Distances
Finding groups of similar objects • Hierarchical Cluster Analysis • Aim: to arrange the objects into homogenous groups • Output: • Dendrogram showing how objects are joined together • Levels of similarity/distance at which groups are formed or divided • Primarily a descriptive technique • Interpretation includes identification of “how many groups?” • Agglomerative methods • Start with individual objects, group the two most similar together, re-calculate similarity between new group and other objects, and continue until all objects in one group • Different rules (algorithms) for re-calculating similarities, resulting in different dendrograms
Simpleexample • Relative intensity of fluorescence spectrum at four different wavelengths • Calculate distances using the “Euclidean” metric • standardised by the mean absolute deviation • Illustrate HCA using Single Link (Nearest Neighbour) algorithm • Distance to new group is the minimum of the distances to the objects being grouped
Step 1 • Identify minimum distance • 1.153 between E and F • Join these objects into a group • Re-calculate all distances to this group • And repeat!
Non-Hierarchical Clustering • Aim: to divide units into a number of mutually exclusive groups • Optimize some suitable criterion directly from the data matrix • Does not analyse the similarity matrix • Criteria include • Maximise the total Euclidean distance between groups • Minimise the determinant of the within-group variance-covariance matrix, pooled over groups • Repeat for different numbers of groups • Usually start with a large number of groups, and gradually reduce the number • Grouping is not hierarchical, i.e. best 3-group solution may not be best 2-group solution with one group divided into 2 sub-groups • Need a rule to determine the “right” number of groups • Also known as K-means clustering
Analysing Association Data • Multidimensional Scaling (MDS) and Principal Co-ordinate Analysis (PCO) • Analyse the same matrix of similarities or distances to produce a multidimensional picture of the relationships between units • Generates an “ordination” or configuration for a set of objects • Matches the inter-point distances to the dissimilarities or distances • PCO works with similarities • Produces an analytical solution (metric scaling) • Matches configuration distances to the observed dissimilarities based on the sum of squared differences • MDS works with distances or dissimilarities • Produces an iterated solution (non-metric scaling) • Matches the configuration distances to the observed dissimilarities based on rank orders (monotonic regression)
Exploringpatterns • Principal Component Analysis (PCA) • Aim: to identify the (combinations of) variables that explain the variability within a data set • Primarily a descriptive technique • Usually for quantitative variables • Starts with DATA matrix (pvariables by nunits) • Transforms original set of correlated variables into new set of orthogonal (independent) variables • Linear combinations of original variables • First principal component accounts for as much of the variability in the data as possible • Second principal component accounts for as much of the remaining variability as possible, and is orthogonal to the first • etc.
Matrix algebra • PCA best described in terms of matrix algebra • In common with almost all multivariate analysis methods • PCA is an eigenvalue decomposition of the matrix of associations between the variables • Produces two matrices • A diagonal matrix containing the eigenvalues • A rectangular (n-by-p) matrix containing the eigenvectors • Three possible matrices of associations can be used • Constructed from the original data matrix • Sum of Squares and Products Matrix (SSPM) • Variance-Covariance Matrix • Correlation Matrix • Different results from the PCA applied to each
PCA Output • Roots (eigenvalues) • How much of the variation is explained by each component • Expressed as a percentage of total • Indicates how many components are necessary • Loadings (eigenvectors) • How each original variable contributes to each principal component • Shows which variables are important • Scores • Values of each observation on each principal component
Fluorescence example • Relative intensity of fluorescence spectrum at four different wavelengths
PCA output • Analysis based on the variance-covariance matrix • Variancesof original variables are similar (~25% each) • PC1 accounts for almost 73% of total variability • First two PCs account for nearly 89% • Obtain PC Scores for each compound by multiplying observed values by coefficients (loadings) • View groupings against PCs
Example PCA plot C A F B L G K E H J D I
Biplots • Graphical approach to present results from PCA (and a number of other multivariate methods) • Plots objects as points • As for example PCA plot • Plots variables as vectors • Supports interpretation of analysis • Shows those objects that are similar • Identifies variables that are highly correlated • Identifies variables that are particular associated with groups of objects
Correspondence Analysis • Analogous to Principal Components Analysis • Appropriate for categorical variables rather than continuous variables • Also known as “reciprocal averaging” • Finds an ordination (ordering) of each categorical variable that maximises the correlation between the two categorical variables • Used in the analysis of ecological community data • e.g. counts of the numbers of different species in different environments – identifies the species associated with particular environments • Extension to Canonical Correspondence Analysis • Incorporates the influence of one or more explanatory variables (such as environmental variables) in finding the ordination • Enables sites to be ranked along each environmental variable, taking account of correlations between species and environmental variables
Factor Analysis • Similar approach to PCA, predominantly used in the social sciences • Principle: that correlations between variables can be explained by a number of common factors, plus a number of specific factors (one for each original variable) • Focused on explaining the covariance between variables • While PCA is focused on explaining the maximum amount of variance in the data • Observed variables are assumed to be linear combinations of hypothetical underlying (and un-observable) factors • Creates an underlying causal model • Original application to the derivation of factors underlying intelligence • Can be used in a hypothesis testing mode (confirmatory factor analysis)
Canonical VariateAnalysis (CVA) • Similar to PCA in working on the data matrix • Works on within-group SSPM and between-group SSPM • Finds combinations of the original variables to maximise the ratio of between-group variance to within-group variance • Groups are separated as much as possible whilst keeping each group as compact as possible • Combinations can be used to discriminate between the groups • For g groups – at most g-1 combinations of variables to discriminate between them • Need at least g-1 original variables • For a new observation, use “discriminant” functions to identify which group it is most likely to belong to • Discriminant Analysis
CVA Output • Output • Latent roots (eigenvalues) • How much variation is explained by each component • Expressed as a percentage of total • Root greater than 1 indicates that there is discrimination between groups on that canonical variate • Explicit test for dimensionality • Latent vectors (loadings) • Contributions of each original variable to new canonical variates • Canonical Variate Means • Mean values for each group on the canonical variates • Adjustment terms so that the centroid of group means is at the origin • Produce plots showing each group mean with a 95% confidence interval • Construct confidence intervals for the “population”
Example – Fisher “iris” data • Measurements of sepal length, sepal width, petal length and petal width for 50 plants of each of three iris species • Plot shows separation between the three species • Loadings (coefficients) indicate which variables are used to separate the groups
Multivariate ANOVA and Regression • Generalisation of univariate Analysis of Variance • Analyse multiple variables of data from a designed experiment • Assess for the effects of different factors on the whole set of variables • Takes account of the covariance between variables • Interpretation based on matrices of variances and covariances • Similar approach for multivariate regression • Relates a set of correlated response variables to one or more explanatory variables • PC Regression • PLS Regression
Procrustes Rotation • Provides a way of comparing two (or more) multidimensional configurations of a set of units • Procrustes = Greek inn-keeper who fitted his guests to one size of bed by chopping bits off or stretching bits! • Takes one configuration and fits the second configuration to it • Combinations of rotation, reflection and scaling each axis • Measure how much manipulation is needed to make the configurations similar • Measure how similar it is possible to make them • Generalise to more than two configurations
Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)