1 / 127

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 4 Backgrounder in Statistical Methods. Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016. ppm. 7. 6. 5. 4. 3. 2. 1. Yesterday. 25. PC2. 20. 15. ANIT. 10. 5. 0.

feliciag
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Module #: Title of Module 2

  3. Module 4 Backgrounder in Statistical Methods Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016

  4. ppm 7 6 5 4 3 2 1 Yesterday

  5. 25 PC2 20 15 ANIT 10 5 0 -5 Control -10 -15 PAP -20 PC1 -25 -30 -20 -10 0 10 Today

  6. Learning Objectives Learn about summary statistics and normal distributions Learn about univariate statistics (t-tests and ANOVA) Learn about p values calculation and hypothesis testing Learn about multivariate statistics (clustering, PCA and PLS-DA)

  7. What is Statistics Data A way to get information from (usually big & complex) data Statistics Information

  8. Main Components • Input: metabolomics data • A matrix containing numerical values • Meta-data: data about data • Class labels, experimental factors • Output: useful information • Significant features • Clustering patterns • Rules (for prediction) • Models • …...

  9. Types of Data Meta Data Data Matrix Y X

  10. Quantitative Data • The data matrix • Continuous • Microarray intensities • Metabolite concentrations • Discrete • Read counts • Need to be treated with different statistical models

  11. Categorical Data • Binary data • 0/1, Y/N, Case/Control • Nominal Data (> two groups) • Single = 1, Married = 2, Divorced = 3, Widowed = 4 • Orders are not important • Ordinal data • Low < Medium < High • Orders matter

  12. Some Jargons (I) • Data are the observed values of a variable. • A variable is some characteristic of a population or sample • A gene or a compound • The valuesof the variable are the range of possible values for a variable. • i.e. measurements of gene expression, metabolite concentration • Dimension of data are based on the variables it contains • Omics data are usually called high-dimensional data

  13. Some Jargons (II) • Univariate: • Measuring one variable per subject • Bivariate: • Measuring two variables per subject • Multivariate • Measuring many variables per subject

  14. Key Concepts in Statistics

  15. Issues when making inferences

  16. From samples to population • So how do we know whether the effect observed in our sample was genuine? • We don’t • Instead we use p values to indicate our level of certainty that our results represent a genuine effect present in the whole population

  17. P values • P values = the probability that the observed result was obtained by chance • i.e. when the null hypothesis is true • α level is set a priori (usually 0.05) • If p < α level then we reject the null hypothesis and accept the experimental hypothesis • If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis • More on this topic later

  18. Summary/Descriptive Statistics

  19. How do we describe the data? • Central Tendency – center of the data location • Mean, Median, Mode • Variability – the spread of the data • Variance • Standard deviation • Relative standing – distribution of data within the spread • Quantiles • Range • IQR (inter-quantile range)

  20. Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value

  21. Mean, Median & Mode Mode Median Mean

  22. Variance, SD and SEM • Variance: average of the squared distance to the center (mean) • It is squared, unit of change is not meaningful • Increasing contributions from outliers • SD: standard deviation: • Square root of variance • “Standardized”, unit is meaningful • SEM: standard error of the mean • Quantifies the precision of the mean. • Takes into account both the value of the SD and the sample size. σ2 σ

  23. Quantiles • The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger • Q2 is the same as median (50% are smaller and 50% larger) • Q3 is the value that only 25% of the observations are larger • Range is minimum to maximum

  24. Mean vs. Variance • Most of univariate tests are comparing the difference in the means, assuming equal variance

  25. Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

  26. A Bell Curve # of each Height Also called a Gaussian or Normal Distribution

  27. Features of a Normal Distribution • Symmetric distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known m = mean

  28. Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40

  29. Some Equations Mean m = Sxi N s2 = S(xi - m)2 Variance N s = S(xi - m)2 Standard Deviation N

  30. Standard Deviation (σ) 99% 95%

  31. Different Distributions Unimodal Bimodal

  32. Skewed Distribution • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Outliers

  33. Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

  34. exp’t B linear scale exp’t B log transformed Log Transformation Normal distribution Skewed distribution

  35. Log Transformation (Real Data)

  36. Centering, scaling, and transformations BMC Genomics. 2006; 7: 142

  37. The Result # of each Height Are they different?

  38. The Result # of each Height Are they different?

  39. t-tests • Compare the mean between 2 samples/ conditions • if 2 samples are taken from the same population, then they should have fairly similar means • if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, i.e. they really are different

  40. Types of t-tests

  41. Paired t-tests • Reduce Variance This

  42. Another approach to group differences • Analysis Of VAriance (ANOVA) • Variances not means • Multiple groups (> 2) • H0 = no differences between groups • H1 = differences between groups • Based on F distribution or F-tests

  43. Calculating F • F = the between group variance divided by the within group variance • the model variance/error variance • For F to be significant the between group variance should be considerably larger than the within group variance • A large value of F indicates relatively more difference between groups than within groups • Evidence against H0

  44. What can be concluded from a significant ANOVA? • There is a significant difference between the groups • NOT where this difference lies • Finding exactly where the differences lie requires further statistical analyses • Post-hoc tests

  45. Different types of ANOVA • One-way ANOVA • One factor with more than 2 levels • Factorial ANOVAs • More than 1 factor • Mixed design ANOVAs • Some factors independent, others related

  46. Conclusions • T-tests assess if two group means differ significantly • Can compare two samples or one sample to a given value • ANOVAs compare more than two groups or more complicated scenarios • They use variances instead of means

  47. UnderstandingP values

  48. The p-value The p-value is the probability of seeing a result as extreme or more extreme than the result from a given sample, if the null hypothesis is true  How do we calculate a p value?

  49. How to we compute a p value 99% 95%

  50. Non-normal distribution • What if we don’t know the distribution? • The only thing we know is that the data does not follow normal distribution • Poor performance using normal distribution based model • Common approaches • Normalization • Non-parametric tests • Permutation (i.e. empirical p values)

More Related