1.28k likes | 1.31k Views
Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 4 Backgrounder in Statistical Methods. Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016. ppm. 7. 6. 5. 4. 3. 2. 1. Yesterday. 25. PC2. 20. 15. ANIT. 10. 5. 0.
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 4 Backgrounder in Statistical Methods Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016
ppm 7 6 5 4 3 2 1 Yesterday
25 PC2 20 15 ANIT 10 5 0 -5 Control -10 -15 PAP -20 PC1 -25 -30 -20 -10 0 10 Today
Learning Objectives Learn about summary statistics and normal distributions Learn about univariate statistics (t-tests and ANOVA) Learn about p values calculation and hypothesis testing Learn about multivariate statistics (clustering, PCA and PLS-DA)
What is Statistics Data A way to get information from (usually big & complex) data Statistics Information
Main Components • Input: metabolomics data • A matrix containing numerical values • Meta-data: data about data • Class labels, experimental factors • Output: useful information • Significant features • Clustering patterns • Rules (for prediction) • Models • …...
Types of Data Meta Data Data Matrix Y X
Quantitative Data • The data matrix • Continuous • Microarray intensities • Metabolite concentrations • Discrete • Read counts • Need to be treated with different statistical models
Categorical Data • Binary data • 0/1, Y/N, Case/Control • Nominal Data (> two groups) • Single = 1, Married = 2, Divorced = 3, Widowed = 4 • Orders are not important • Ordinal data • Low < Medium < High • Orders matter
Some Jargons (I) • Data are the observed values of a variable. • A variable is some characteristic of a population or sample • A gene or a compound • The valuesof the variable are the range of possible values for a variable. • i.e. measurements of gene expression, metabolite concentration • Dimension of data are based on the variables it contains • Omics data are usually called high-dimensional data
Some Jargons (II) • Univariate: • Measuring one variable per subject • Bivariate: • Measuring two variables per subject • Multivariate • Measuring many variables per subject
From samples to population • So how do we know whether the effect observed in our sample was genuine? • We don’t • Instead we use p values to indicate our level of certainty that our results represent a genuine effect present in the whole population
P values • P values = the probability that the observed result was obtained by chance • i.e. when the null hypothesis is true • α level is set a priori (usually 0.05) • If p < α level then we reject the null hypothesis and accept the experimental hypothesis • If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis • More on this topic later
How do we describe the data? • Central Tendency – center of the data location • Mean, Median, Mode • Variability – the spread of the data • Variance • Standard deviation • Relative standing – distribution of data within the spread • Quantiles • Range • IQR (inter-quantile range)
Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value
Mean, Median & Mode Mode Median Mean
Variance, SD and SEM • Variance: average of the squared distance to the center (mean) • It is squared, unit of change is not meaningful • Increasing contributions from outliers • SD: standard deviation: • Square root of variance • “Standardized”, unit is meaningful • SEM: standard error of the mean • Quantifies the precision of the mean. • Takes into account both the value of the SD and the sample size. σ2 σ
Quantiles • The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger • Q2 is the same as median (50% are smaller and 50% larger) • Q3 is the value that only 25% of the observations are larger • Range is minimum to maximum
Mean vs. Variance • Most of univariate tests are comparing the difference in the means, assuming equal variance
Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:
A Bell Curve # of each Height Also called a Gaussian or Normal Distribution
Features of a Normal Distribution • Symmetric distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known m = mean
Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40
Some Equations Mean m = Sxi N s2 = S(xi - m)2 Variance N s = S(xi - m)2 Standard Deviation N
Standard Deviation (σ) 99% 95%
Different Distributions Unimodal Bimodal
Skewed Distribution • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Outliers
Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian
exp’t B linear scale exp’t B log transformed Log Transformation Normal distribution Skewed distribution
Centering, scaling, and transformations BMC Genomics. 2006; 7: 142
The Result # of each Height Are they different?
The Result # of each Height Are they different?
t-tests • Compare the mean between 2 samples/ conditions • if 2 samples are taken from the same population, then they should have fairly similar means • if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, i.e. they really are different
Paired t-tests • Reduce Variance This
Another approach to group differences • Analysis Of VAriance (ANOVA) • Variances not means • Multiple groups (> 2) • H0 = no differences between groups • H1 = differences between groups • Based on F distribution or F-tests
Calculating F • F = the between group variance divided by the within group variance • the model variance/error variance • For F to be significant the between group variance should be considerably larger than the within group variance • A large value of F indicates relatively more difference between groups than within groups • Evidence against H0
What can be concluded from a significant ANOVA? • There is a significant difference between the groups • NOT where this difference lies • Finding exactly where the differences lie requires further statistical analyses • Post-hoc tests
Different types of ANOVA • One-way ANOVA • One factor with more than 2 levels • Factorial ANOVAs • More than 1 factor • Mixed design ANOVAs • Some factors independent, others related
Conclusions • T-tests assess if two group means differ significantly • Can compare two samples or one sample to a given value • ANOVAs compare more than two groups or more complicated scenarios • They use variances instead of means
The p-value The p-value is the probability of seeing a result as extreme or more extreme than the result from a given sample, if the null hypothesis is true How do we calculate a p value?
How to we compute a p value 99% 95%
Non-normal distribution • What if we don’t know the distribution? • The only thing we know is that the data does not follow normal distribution • Poor performance using normal distribution based model • Common approaches • Normalization • Non-parametric tests • Permutation (i.e. empirical p values)