Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 6 Backgrounder in Statistical Methods David Wishart Informatics and Statistics for Metabolomics June 16-17,2014

Schedule

Learning Objectives • Learn about distributions and significance • Learn about univariate statistics (t-tests and ANOVA) • Learn about correlation and clustering • Learn about multivariate statistics (PCA and PLS-DA)

Statistics • There are three kinds of lies: lies, damned lies, and statistics - Benjamin Disraeli • 98% of all statistics are made up – Unknown • Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital - Aaron Levenstein • Statistics is the mathematics of impressions

Distributions & Significance

Univariate Statistics

Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:

A Bell Curve # of each Height Also called a Gaussian or Normal Distribution

Features of a Normal Distribution • Symmetric Distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known m = mean

Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40

Gaussian Distribution

Some Equations Mean m = Sxi N s2 = S(xi - m)2 Variance N s = S(xi - m)2 Standard Deviation N

Standard Deviations (Z-values)

Significance • Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3%

Significance • In a test with a class of 400 students, if you score the average you typically receive a “C” • In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” • In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”,

The P-value • The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant

P-value • If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? • If you choose an a of 0.05 is a 6’ 11” individual a member of the human species? • If you choose an a of 0.01 is a 6’ 11” individual a member of the human species?

P-value • If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 • If you choose an a of 0.05 is this coin a fair coin? • If you choose an a of 0.10 is this coin a fair coin?

Mean, Median & Mode Mode Median Mean

Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value

Different Distributions Unimodal Bimodal

Other Distributions • Binomial Distribution • Poisson Distribution • Extreme Value Distribution • Skewed or Exponential Distribution

Binomial Distribution 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q)n

m =0.1 m = 1 m = 2 m = 3 m Proportion of samples = 10 P(x) x Poisson Distribution

Extreme Value Distribution • Arises from sampling the extreme end of a normal distribution • A distribution which is “skewed” due to its selective sampling • Skew can be either right or left Gaussian Distribution

Skewed Distribution • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Outliers

Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian

exp’t B linear scale exp’t B log transformed Log Transformation Skewed distribution Normal distribution

Log Transformation on Real Data

Distinguishing 2 Populations Normals Leprechauns

The Result # of each Height Are they different?

What about these 2 Populations?

Student’s t-Test • Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same • If the t-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the t-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different • Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples

Student’s t-Test • A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 1 Variable 2

What if the Distributions are not Normal?

Mann-Whitney U-Test • Also called the Wilcoxon Rank Sum Test • Used to determine if 2 non-normally distributed populations are different • More powerful and robust than the t-test • Formally allows you to calculate the probability that 2 sample medians are the same • If the U-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the U-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different

Distinguishing 3+ Populations Normals Leprechauns Elves

Distinguishing 3+ Populations

ANOVA • Also called Analysis of Variance • Used to determine if 3 or more populations are different, it is a generalization of the t-Test • Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal • Uses an F-measure to test for significance • 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different

ANOVA • ANOVA can also be used to determine whether 3+ clusters are different -- if the clusters follow a normal distribution Variable 1 Variable 2

Distinguishing N Populations (False Discovery Rate) • Suppose you performed 100 different t-tests, and found 20 results with a p value of <0.05 • What are the odds that one of these findings is going to be false? • Roughly 20 X 0.05 = 1.00 • How many of these 20 tests are likely false positives? 20x0.05 = 1 • To correct for this you try to choose those results with a p value < 0.05/20 or p < 0.0025

Example (Some Weather Predictions) • P = 0.08 It will rain • P = 0.05 It will be sunny • P = 0.06 It will be foggy • P = 0.02 It’ll be cloudy • P = 0.05 It will snow • P = 0.07 It will be windy • P = 0.06 It will be calm • P = 0.09 It will hail • P = 0.02 Lightning • P = 0.16 Thunder • P = 0.001 Eclipse • P = 0.09 Tornado • P = 0.18 Hurricane • P = 0.05 Sleet 100% certainty it will do something tomorrow Only one prediction is significant with FDR or Bonferroni correction (Eclipse)

Normalization/Scaling

Normalization/Scaling • What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: # of each Height

Normalization • Normalization adjusts for systematic bias in the measurement tool • After normalization we would get: # of each Height

Canadian Bioinformatics Workshops