1.07k likes | 1.09k Views
Learn about distributions, significance, univariate and multivariate statistics in bioinformatics through this informative and practical workshop module. Understand the application and importance of statistical methods in various biological and physical measurements.
E N D
Canadian Bioinformatics Workshops www.bioinformatics.ca
Module 6 Backgrounder in Statistical Methods David Wishart Informatics and Statistics for Metabolomics June 16-17,2014
Learning Objectives • Learn about distributions and significance • Learn about univariate statistics (t-tests and ANOVA) • Learn about correlation and clustering • Learn about multivariate statistics (PCA and PLS-DA)
Statistics • There are three kinds of lies: lies, damned lies, and statistics - Benjamin Disraeli • 98% of all statistics are made up – Unknown • Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital - Aaron Levenstein • Statistics is the mathematics of impressions
Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:
A Bell Curve # of each Height Also called a Gaussian or Normal Distribution
Features of a Normal Distribution • Symmetric Distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known m = mean
Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40
Some Equations Mean m = Sxi N s2 = S(xi - m)2 Variance N s = S(xi - m)2 Standard Deviation N
Significance • Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3%
Significance • In a test with a class of 400 students, if you score the average you typically receive a “C” • In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” • In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”,
The P-value • The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant
P-value • If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? • If you choose an a of 0.05 is a 6’ 11” individual a member of the human species? • If you choose an a of 0.01 is a 6’ 11” individual a member of the human species?
P-value • If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 • If you choose an a of 0.05 is this coin a fair coin? • If you choose an a of 0.10 is this coin a fair coin?
Mean, Median & Mode Mode Median Mean
Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value
Different Distributions Unimodal Bimodal
Other Distributions • Binomial Distribution • Poisson Distribution • Extreme Value Distribution • Skewed or Exponential Distribution
Binomial Distribution 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q)n
m =0.1 m = 1 m = 2 m = 3 m Proportion of samples = 10 P(x) x Poisson Distribution
Extreme Value Distribution • Arises from sampling the extreme end of a normal distribution • A distribution which is “skewed” due to its selective sampling • Skew can be either right or left Gaussian Distribution
Skewed Distribution • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Outliers
Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian
exp’t B linear scale exp’t B log transformed Log Transformation Skewed distribution Normal distribution
Distinguishing 2 Populations Normals Leprechauns
The Result # of each Height Are they different?
The Result # of each Height Are they different?
Student’s t-Test • Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same • If the t-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the t-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different • Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples
Student’s t-Test • A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 1 Variable 2
Mann-Whitney U-Test • Also called the Wilcoxon Rank Sum Test • Used to determine if 2 non-normally distributed populations are different • More powerful and robust than the t-test • Formally allows you to calculate the probability that 2 sample medians are the same • If the U-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the U-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different
Distinguishing 3+ Populations Normals Leprechauns Elves
The Result # of each Height Are they different?
The Result # of each Height Are they different?
ANOVA • Also called Analysis of Variance • Used to determine if 3 or more populations are different, it is a generalization of the t-Test • Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal • Uses an F-measure to test for significance • 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different
ANOVA • ANOVA can also be used to determine whether 3+ clusters are different -- if the clusters follow a normal distribution Variable 1 Variable 2
Distinguishing N Populations (False Discovery Rate) • Suppose you performed 100 different t-tests, and found 20 results with a p value of <0.05 • What are the odds that one of these findings is going to be false? • Roughly 20 X 0.05 = 1.00 • How many of these 20 tests are likely false positives? 20x0.05 = 1 • To correct for this you try to choose those results with a p value < 0.05/20 or p < 0.0025
Example (Some Weather Predictions) • P = 0.08 It will rain • P = 0.05 It will be sunny • P = 0.06 It will be foggy • P = 0.02 It’ll be cloudy • P = 0.05 It will snow • P = 0.07 It will be windy • P = 0.06 It will be calm • P = 0.09 It will hail • P = 0.02 Lightning • P = 0.16 Thunder • P = 0.001 Eclipse • P = 0.09 Tornado • P = 0.18 Hurricane • P = 0.05 Sleet 100% certainty it will do something tomorrow Only one prediction is significant with FDR or Bonferroni correction (Eclipse)
Normalization/Scaling • What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: # of each Height
Normalization • Normalization adjusts for systematic bias in the measurement tool • After normalization we would get: # of each Height