Advanced Statistics I

Advanced Statistics I Xiayu (Stacy) Huang Bioinformatics Shared Resource Sanford | Burnham Medical Research Institute

Outline • Summary of Previous Talk • Descriptive & inferential statistics • T test • Common statistical tests and Applications (Continue) • One-way ANOVA • Post hoc/multiple comparison tests • One-way ANOVA power analysis • Common data transformation methods

Why statistics is important to biologists? • Designing experiment • Analyzing biological data and understanding analysis results • Preparing manuscript and grant applications How many ??? How many replicates for my microarray exp??? Identifying outlier Normalization/transformation Statistical test, etc. DEGs No replicates=No statistics?

Summary of previous talk • Descriptive statistics • Measure of central tendency (mean, median, mode) • Measure of dispersion (standard deviation, range, etc). Previous ppt can be downloaded from http://bsrweb.burnham.org

Summary of previous talk • Inferential statistics • Null hypothesis (H0) and alternative hypothesis(Ha) • Type I and type II errors (aka. false positives and false negatives) • P-value (statistically significant if less than cut off α at 0.05 level) • Power (probability of detecting a true difference, 0.80) • T test • 3 types of t tests • T test can be performed in excel and Graphpad Prism • Power analysis of t test in G*Power new drug effect = old drug effect new drug effect ≠ or > old drug effect

Analysis of variance (ANOVA) Overview • History of ANOVA • Ronald Fisher proposed ANOVA in 1918 • His first application of ANOVA was published in 1921 • What does ANOVA do? • Comparing the means of 3 or more populations without knowing the exact difference between population means • Types of ANOVA • One-way and multi-way ANOVA (# of factors of interest) • Repeated measures ANOVA (repeated measurements on the same subject) • MANOVA (# of different measurements)

ANOVA Overview • How to decide which ANOVA to use? • How many different measurements? =>ANOVA or MANOVA • How many factors of interest? =>One-way or multi-way • Are there any repeated measures? => Repeated or not • Examples • 1: how the tumor size changes among three different mice genotypes within 7 weeks? (tumor volume measured every week) Measurements: factors: repeated measures: • 2: how the tumor size and body weight change among three different mice genotypes within 7 weeks? tumor size=>ANOVA genotype and time=>two-way Decision: two-way repeated ANOVA repeated Measurements: tumor size and body weight=>MANOVA Factors: genotype and time=>two-way Repeated measures: repeated Decision: two-way repeated MANOVA

ANOVA application in biology One-way ANOVA Two-way ANOVA One-way MANOVA Two-way repeated ANOVA

One-way ANOVA • One measurement, no repeats, one factor with multiple levels (“groups”) e.g. comparing the effect of three different drug treatments (control, drug A and drug B) on body weight • Measurement: body weight; Factor: treatment; Level: control, drug A and drug B • Hypothesis • H0: • Having no treatment effect • Ha:at least 1 population mean is different • Having treatment effect

One-way ANOVA • Basic idea • compares 2 types of variation to test equality of population means • Source of variation(error) Total variation (SS(Total)) Variation among treatments (SST) Variation within treatments (SSE) SS(total)=SST+SSE

One-way ANOVA • Source of variation group1 group2 group3 :Mean of group1; n1:sample size in group1 :Mean of group2; n2:sample size in group2 :Mean of group3; n3:sample size in group3 : overall mean of all the samples group1 group2 group3 variation among treatments total variation group1 group2 group3 variation within treatments

One-way ANOVA • F test statistic • F statistic is a ratio of two variations • variation among treatments/(k-1) • variation within treatments/(n-k) • k=number of populations, treatment groups, or levels • n=total sample size • Large F values indicating variation among treatments is significantly greater than variation within treatments and there is an treatment effect • P-value computed from F statistic F= = p

One-way ANOVA • Assumptions • Sampling should be independent and randomized. • Homogeneity of variance • Populations (for each condition) have equal variances • Check by Bartlett’s or Levene’s test • Normality • Populations (for each condition) are normally distributed • Check by normality test (such as Kolmogorov-Smirnov test) • Normal distribution=Gaussian distribution=>“bell-shaped” curve Note: equal sample size is preferred

One-way ANOVA example • Goal: determining whether there is a significant effect of different mice groups on their performance on rotarod. • Measurement: number of seconds staying on a rotarod

Data summarization using descriptive statistics

Statistical test decision tree Relationship between variables Two sample comparison One measurement Multiple sample comparison

Statistical test decision tree Number of measurement Number of factors

Normality check in graphpad prism

Variance check and ANOVA analysis in graphpad prism

Variance check and ANOVA analysis in graphpad prism SST SSE SS(Total)

ANOVA analysis of example data in excel

ANOVA analysis results of example data • Conclusion • There is at least one group of mice with different performance on rotarod than that of the other group of mice • Next • What are the exact differences between different groups? • Are groups 1 and 2 significantly different? Are groups 1 and 3 significantly different?

Post hoc/multiple comparison tests introduction • ANOVA does not differ groups from each other • Problem of doing multiple t tests • High combined false positive error rate • 6 multiple t tests will give 26% chance of having at least one false positive error • Multiple comparison tests • They are based on T test or F test by considering the number of comparisons and adjust the p-value obtained from regular T test • Pairwise comparisons • Pairwise comparisons are differences between any two population means. On k populations or groups, there are k(k-1)/2 possible pairwise comparisons

Post hoc/multiple comparison tests introduction • Error rate • Familywise error rate(FWER) • Probability of making at least one false positive among all the comparisons • --Example: 4 groups, #of all possible pairwise comparisons=4(4-1)/2=6 • if the allowed false positive rate for each comparison is 0.05, the probability of making at lease one false positive is 1-(1-0.05)6 =0.26 • control FWER at given α level such as 0.05 • False discovery rate (FDR) • Used for large data sets such as microarray • Probability of declared significant results that are actually false positives • --Example: If 1000 genes were declared to have statistical significance, and FDR=0.05, then 50 genes would be expected to be false positives • Control FDR at given α level such as 0.05

Common post hoc/multiple comparison tests • Fisher’s protected least significant difference(fisher’s LSD) • The first post hoc developed by Fisher to study pairwise comparisons • Assumptions: equal variances, normality, and significant F test results • Does not control FWER • Tukey’s test • Most popular post hoc and suitable for all pairwise comparisons (6 or more) • Assumptions: equal variances and normality • Control FWER and suitable for both equal and unequal sample size • Bonferroni test • Suitable for any set of preplanned comparison • αper comparison= α/total number of comparisons • Control FWER, too conservative in many situations

Common post hoc/multiple comparison tests • Benjamini-hochberg false discovery rate(FDR) • Proposed by Benjamini and Hochberg in 1995 • Four step procedures • Conduct m separate t-tests for m genes, each at common significance level 0.05 • Order p-value of m genes from smallest to largest • Find the largest K such that Pk ≤(k/m)*α • Declare genes 1…K are statistically significantly different between two comparing groups. • Less conservative than Bonferroni test and more powerful than familywise error rate

Post hoc /multiple comparison tests example • Goal: determining whether there is an overall significant difference of performance among the four mice groups and how the performance of each mice group is different from each other. • Measurement: number of seconds staying on a rotarod # of possible pairwise comparison: 4x(4-1)/2=6 Multiple comparison test: Tukey’s test(6 or more comparisons)

Post hoc /multiple comparison tests in graphpad prism

Comparing different multiple comparison tests

Post hoc /multiple comparison tests example • Goal: determining whether there is a statistically significant difference between WT and KO for a set of m=12 genes at FDR (Q)=0.05. Order by p-value in increasing order Find largest K Pk ≤ (k/m)Q Conclusion: two genes (SLU7 and LGI1 ) are statistically significant different between WT and KO among the 12 genes tested at FDR(Q)=0.05

Power analysis • Power depends on: • Sample size ( ) • Standard deviation ( or ) • Minimal detectable difference ( ) • False positive rate ( ) • What you can do with power analysis • Minimal sample size required • Minimal detectable difference or effect size • Power of the test effect size

Power analysis software/packages • G*Power (free!!!) • Optimal design (free!!!) • SPSS sample power • PASS • SAS proc power, Stata sampsi, etc • Mplus for more advanced/complicated analysis • Many free on-line programs • http://www.stat.uiowa.edu/~rlenth/Power/

One-way ANOVA power analysis in G*Power • Test family • F test • Statistical test • ANOVA, fixed effect, omnibus, one-way • Type of power analysis • Compute sample size-given α, power, and effect size • Input parameters • Effect size ( f ) • False positive rate ( ) • usually 0.05 • Minimum Power ( ) • usually 0.80 • Number of groups • Output parameters • Noncetrality parameter ( ) • Critical F • Degree of freedom • Total sample size • Actual power

One-way ANOVA sample size calculation • Goal: how many mice should I use so as to have an 80% of probability of detecting observed difference at the 0.05 level of significance?

One-way ANOVA sample size calculation in G*Power

One-way ANOVA sample size calculation in G*Power ANOVA analysis results

One-way ANOVA sample size calculation in G*Power

Data Transformation • Why? • Many biological variables do not follow normal distribution • How? • Applying a mathematical function on each observation • Performing statistical tests using transformed data • Interpreting results using back transformation • Common data transformation methods in biology • Log transformation • Square root transformation • Arcsine transformation • Reciprocal transformation

Log transformation • Usage • Convert a right skewed distribution into a symmetrical one • Applicable when there are unequal variances and standard deviations are proportional to the means • Mathematical function • Logarithms in any base are satisfactory • Back transformation:

Square root transformation • Usage • Applicable when the group variances are proportional to the means • Samples taken from Poisson distribution such as counting data • Mathematical function • Back transformation:

Arcsine transformation • Usage • Applicable when data (proportions or percentages) was taken from a binomial distribution • Mathematical function • Back transformation: • Shortcoming • Not good at the ends of the range (near 0 and 100%) • Adjustment needed when p near 0 and 100%

Choosing transformation methods based on data distribution Shape Figure Transformation Reverse J A 1/X Severe skew right B Log (X) Moderate skew right C sqrt (X)

Choosing transformation based on data distribution Shape Figure Transformation Moderate skew left D 1/sqrt(X) Severe skew left E -1/Log (X) J shape F -1/X

Data transformation example frequency weight Histogram

Data transformation example

Choosing transformation based on data distribution Shape Figure Transformation Reverse J A 1/X Severe skew right B Log (X) Moderate skew right C sqrt (X)

Data transformation example frequency frequency log2_weight weight before transformation after transformation

Advanced Statistics I

Advanced Statistics I

Presentation Transcript

Section I. Statistics

Advanced Statistics for Interventional Cardiologists

Advanced Statistics for Interventional Cardiologists

Advanced Statistics for Interventional Cardiologists

Advanced Statistics

Advanced Statistics for Researchers

Advanced Statistics

AGW 615 Advanced Business Statistics

Advanced statistics for master students

PY 603 – Advanced Statistics II

HCAD Advanced Statistics

EXAM –I Statistics

Advanced Statistics for Linguistics Students

Advanced statistics for master students

Spatial Statistics I

Advanced Statistics using .

ECON 504: Advanced Economic Statistics

Psychology 202a Advanced Psychological Statistics

Advanced statistics for master students

Advanced statistics for master students

Statistics I.