Exploratory Data Analysis and Multivariate Strategies

Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)

Multi-… approaches in statistics • Multiple comparison tests • Multiple testing adjustments • Methods for adjusting the significance levels when doing a large number of tests (comparisons between treatments) within a single analyses • Multiple regression analysis • Statistical model building with more than one explanatory variable • Multi-factor analysis of variance • Analysis of designed experiments with more than one explanatory factor • Multivariate Analysis • Methods to summarise and explore the relationships among multiple response variables, and/or to assess differences among “treatments” based on multiple response variables

Multivariate data • Responses by a number of individuals to a range of different (related) questions in a (social studies) survey • Counts of different species in a range of locations in an ecological survey • Measurements of a range of traits of individual people, animals, plants, products, … • Different medical measurements made on a group of patients • Measurements of gene expression / protein expression, metabolite expression in biological samples • Counts of sequence consensus matches from microbial samples • Data on the “distances” between a number of objects

Multivariate questions • Identifying groups of objects with similar (and different) responses • e.g. UK landscape areas with similar percentage composition of different land covers (woodland, arable, urban, …)? • Identifying the particular measurement that contribute to the variability among a set of objects • e.g. weed species that are present under different long-term herbicide strategies? • Identifying the particular measurements that contribute to differences between groups of objects • e.g. which genes discriminate between patients with and without some form of cancer? • Identifying the particular measurements that explain variation in some over-arching response variable • e.g. which traits of predatory insects influence the predation rate of aphids?

Exploratory data analysis • Summary statistics / graphical summaries • Variability for each variable/measurement • Groups of observations • both pre-determined – to find potential differences • and to be identified – based on each individual variable • Scatter plots / correlations • Associations between pairs of variables

Univariate analysis • For each individual variable: • Hypothesis tests • Choice depends on the question to be answered • Analysis of variance • For variables measured in designed experiments • Regression analysis • To build statistical models to describe how one response variable depends on one (or more) explanatory variables • Generalised Linear Models (GLMs) • For data where standard assumptions do not hold! • Time Series Analysis • …

Multivariate analysis • For a set of “correlated” variables: • Assess relationships between variables • Consider the effects of “treatments” on these relationships • Consider how a “response” depends on these relationships • Multivariate methods concerned with “data reduction” • Summarise the correlations between variables • Produce a smaller set of (uncorrelated) variables containing the important information • For a set of “related” objects • Identify groups of similar objects • Identify differences between groups of similar objects • And what makes the objects similar!

Simple graphical summaries 1 • For compositional data • e.g. numbers of onion bulbs in different marketable size grades • Present as a stacked bar-chart • For raw data • For percentage of the total

Simple graphical summaries 2 • More general data • e.g. different measurements on a set of plants • Scatter plots for each pair of variables • Present in a matrix • Calculate linear correlation coefficient for each pair of variables

Two forms of data matrix • The DATA matrix • pvariables for each of nsamples (observations) • Presented in a rectangular matrix • nrows and pcolumns • Theassociationmatrix • Distance, similarity or dissimilarity • Between every pair of variables or every pair of samples • Symmetric square matrix • n-by-n - between samples • p-by-p - between variables • just show lower triangle • Turns multivariate data into univariate data? p variables n samples p variables p variables Lower Triangle

Analysing Association Data • Start with associations • Distances between locations on a map • Psychometric (sensory) similarities between products • Construct associations from data • Depends on the types of data • Binary (presence/absence) data • Simple matching coefficient; Jaccard coefficient; … • Continuous data • Euclidean distance; Manhattan distance; … • Similarities or Disimilarities/Distances

Finding groups of similar objects • Hierarchical Cluster Analysis • Aim: to arrange the objects into homogenous groups • Output: • Dendrogram showing how objects are joined together • Levels of similarity/distance at which groups are formed or divided • Primarily a descriptive technique • Interpretation includes identification of “how many groups?” • Agglomerative methods • Start with individual objects, group the two most similar together, re-calculate similarity between new group and other objects, and continue until all objects in one group • Different rules (algorithms) for re-calculating similarities, resulting in different dendrograms

Simpleexample • Relative intensity of fluorescence spectrum at four different wavelengths • Calculate distances using the “Euclidean” metric • standardised by the mean absolute deviation • Illustrate HCA using Single Link (Nearest Neighbour) algorithm • Distance to new group is the minimum of the distances to the objects being grouped

Step 1 • Identify minimum distance • 1.153 between E and F • Join these objects into a group • Re-calculate all distances to this group • And repeat!

Dendrograms

Non-Hierarchical Clustering • Aim: to divide units into a number of mutually exclusive groups • Optimize some suitable criterion directly from the data matrix • Does not analyse the similarity matrix • Criteria include • Maximise the total Euclidean distance between groups • Minimise the determinant of the within-group variance-covariance matrix, pooled over groups • Repeat for different numbers of groups • Usually start with a large number of groups, and gradually reduce the number • Grouping is not hierarchical, i.e. best 3-group solution may not be best 2-group solution with one group divided into 2 sub-groups • Need a rule to determine the “right” number of groups • Also known as K-means clustering

Analysing Association Data • Multidimensional Scaling (MDS) and Principal Co-ordinate Analysis (PCO) • Analyse the same matrix of similarities or distances to produce a multidimensional picture of the relationships between units • Generates an “ordination” or configuration for a set of objects • Matches the inter-point distances to the dissimilarities or distances • PCO works with similarities • Produces an analytical solution (metric scaling) • Matches configuration distances to the observed dissimilarities based on the sum of squared differences • MDS works with distances or dissimilarities • Produces an iterated solution (non-metric scaling) • Matches the configuration distances to the observed dissimilarities based on rank orders (monotonic regression)

MDS Output for fluorescence data

Exploringpatterns • Principal Component Analysis (PCA) • Aim: to identify the (combinations of) variables that explain the variability within a data set • Primarily a descriptive technique • Usually for quantitative variables • Starts with DATA matrix (pvariables by nunits) • Transforms original set of correlated variables into new set of orthogonal (independent) variables • Linear combinations of original variables • First principal component accounts for as much of the variability in the data as possible • Second principal component accounts for as much of the remaining variability as possible, and is orthogonal to the first • etc.

Matrix algebra • PCA best described in terms of matrix algebra • In common with almost all multivariate analysis methods • PCA is an eigenvalue decomposition of the matrix of associations between the variables • Produces two matrices • A diagonal matrix containing the eigenvalues • A rectangular (n-by-p) matrix containing the eigenvectors • Three possible matrices of associations can be used • Constructed from the original data matrix • Sum of Squares and Products Matrix (SSPM) • Variance-Covariance Matrix • Correlation Matrix • Different results from the PCA applied to each

PCA Output • Roots (eigenvalues) • How much of the variation is explained by each component • Expressed as a percentage of total • Indicates how many components are necessary • Loadings (eigenvectors) • How each original variable contributes to each principal component • Shows which variables are important • Scores • Values of each observation on each principal component

Fluorescence example • Relative intensity of fluorescence spectrum at four different wavelengths

PCA output • Analysis based on the variance-covariance matrix • Variancesof original variables are similar (~25% each) • PC1 accounts for almost 73% of total variability • First two PCs account for nearly 89% • Obtain PC Scores for each compound by multiplying observed values by coefficients (loadings) • View groupings against PCs

Example PCA plot C A F B L G K E H J D I

Biplots • Graphical approach to present results from PCA (and a number of other multivariate methods) • Plots objects as points • As for example PCA plot • Plots variables as vectors • Supports interpretation of analysis • Shows those objects that are similar • Identifies variables that are highly correlated • Identifies variables that are particular associated with groups of objects

Correspondence Analysis • Analogous to Principal Components Analysis • Appropriate for categorical variables rather than continuous variables • Also known as “reciprocal averaging” • Finds an ordination (ordering) of each categorical variable that maximises the correlation between the two categorical variables • Used in the analysis of ecological community data • e.g. counts of the numbers of different species in different environments – identifies the species associated with particular environments • Extension to Canonical Correspondence Analysis • Incorporates the influence of one or more explanatory variables (such as environmental variables) in finding the ordination • Enables sites to be ranked along each environmental variable, taking account of correlations between species and environmental variables

Factor Analysis • Similar approach to PCA, predominantly used in the social sciences • Principle: that correlations between variables can be explained by a number of common factors, plus a number of specific factors (one for each original variable) • Focused on explaining the covariance between variables • While PCA is focused on explaining the maximum amount of variance in the data • Observed variables are assumed to be linear combinations of hypothetical underlying (and un-observable) factors • Creates an underlying causal model • Original application to the derivation of factors underlying intelligence • Can be used in a hypothesis testing mode (confirmatory factor analysis)

Canonical VariateAnalysis (CVA) • Similar to PCA in working on the data matrix • Works on within-group SSPM and between-group SSPM • Finds combinations of the original variables to maximise the ratio of between-group variance to within-group variance • Groups are separated as much as possible whilst keeping each group as compact as possible • Combinations can be used to discriminate between the groups • For g groups – at most g-1 combinations of variables to discriminate between them • Need at least g-1 original variables • For a new observation, use “discriminant” functions to identify which group it is most likely to belong to • Discriminant Analysis

CVA Output • Output • Latent roots (eigenvalues) • How much variation is explained by each component • Expressed as a percentage of total • Root greater than 1 indicates that there is discrimination between groups on that canonical variate • Explicit test for dimensionality • Latent vectors (loadings) • Contributions of each original variable to new canonical variates • Canonical Variate Means • Mean values for each group on the canonical variates • Adjustment terms so that the centroid of group means is at the origin • Produce plots showing each group mean with a 95% confidence interval • Construct confidence intervals for the “population”

Example – Fisher “iris” data • Measurements of sepal length, sepal width, petal length and petal width for 50 plants of each of three iris species • Plot shows separation between the three species • Loadings (coefficients) indicate which variables are used to separate the groups

Multivariate ANOVA and Regression • Generalisation of univariate Analysis of Variance • Analyse multiple variables of data from a designed experiment • Assess for the effects of different factors on the whole set of variables • Takes account of the covariance between variables • Interpretation based on matrices of variances and covariances • Similar approach for multivariate regression • Relates a set of correlated response variables to one or more explanatory variables • PC Regression • PLS Regression

Procrustes Rotation • Provides a way of comparing two (or more) multidimensional configurations of a set of units • Procrustes = Greek inn-keeper who fitted his guests to one size of bed by chopping bits off or stretching bits! • Takes one configuration and fits the second configuration to it • Combinations of rotation, reflection and scaling each axis • Measure how much manipulation is needed to make the configurations similar • Measure how similar it is possible to make them • Generalise to more than two configurations

Exploratory Data Analysis and Multivariate Strategies Andrew Mead (School of Life Sciences)

Exploratory Data Analysis and Multivariate Strategies

Exploratory Data Analysis and Multivariate Strategies

Presentation Transcript

Exploratory Data Analysis

Exploratory Data Analysis and Data Visualization

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis and Multi- variate Strategies

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis

Multivariate Data Analysis

Multivariate Data Analysis

Exploratory Data Analysis Continued

Exploratory Data Analysis

Exploratory Data Analysis and Data Visualization

EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis

Exploratory Data Analysis