410 likes | 600 Views
Todays Topics. Outliers and Cheating Departures from Normality Assumptions of General Linear Models Checking Independence. Outliers and Cheating. Why you shouldnt worry about outliers. Outliers and Cheating.
E N D
Todays Topics Outliers and Cheating Departures from Normality Assumptions of General Linear Models Checking Independence
Outliers and Cheating Why you shouldnt worry about outliers
Outliers and Cheating • A data point has one or more values that are so extreme that they may not belong in the sample • You must have a legitimate reason for eliminating a data point, such as instrument error or miscoding of data • Generally you will find that your value is not so unusual when you have performed sufficient replicates
Often, data from trials seem to be outliers Repeated experiments show that they are not Outliers and Cheating
Departures from Normality Testing for Normality
Departures from Normality • The z- or t-score relies on sampling a population that conforms to the normal frequency distribution. • If your sample data do not conform to normality, your analysis is wrong • How do you know this is the case with your sample and what do you do if it isn’t?
Departures from Normality • Mathematical methods for testing normality • Goodness of fit tests • Testing for symmetry and kurtosis • Complex, requires the use of statistical tables • Kolmogorov-Smirnov Test • Low power, not recommended • W and KSL tests • Shapiro-Wilk (n < 2000) Kolmogorov-Smirnov-Lillifors(n = 2000) • Complex calculation, but can be done with software • Excellent
Departures from Normality • Normality testing assesses the likelihood that data are derived from a normal population • Compares to a normal distribution with the same moments
Departures from Normality • Transformation can correct departures • Log-use when variance is proportional to y2 • Arcsine-use when analyzing proportions (binomial populations) • Square Root-Use when means are proportional to variances (Poisson Distributions)
Transformations-Fertilizer • Log transformation corrects non-normality of the Fertilizer dataset
Transformations-Fertilizer • Transformations will generally improve Rsquare and F scores, if not post-hoc tests
Transformations • Log transformation • Use when variance is proportional to y2
Transformation • Square Root • Use when means are proportional to variances (Poisson Distributions)
Assessing Normality Do’s Don’ts Don’t worry if a transformation doesn’t fix the distribution Don’t discard an analysis that came from non-normal data The Central Limit Theorem means that ANOVA and two-sample tests are tolerant of small deviations • Check your data before analyzing • Transform you data if necessary • Analyze both untransformed and transformed data to judge the effects
Assumptions of General Linear Models Checking Independence Chpt 8 Grafen and Hails
BIOM 285 Assumptions of GLMs • General Linear Models rely on four basic assumptions of the data to be analyzed • Independence • Homogeneity of Variance • Normality of Error • Linearity/Additivity
BIOM 285 Assumptions of GLMs • The goal of testing is to describe the likelihood that our null hypothesis is false • If the assumptions underlying our arrival at that likelihood are violated, then are our conclusion (p-value) is at risk of being false • When this happens, we cannot tell the degree to which this affects the p=value
BIOM 285 Assumptions of GLMs • Therefore, it is important to check the validity of the underlying assumptions when you analyze your data • You should make any appropriate corrections to violations, or note those that cannot be corrected • This is considered Good Laboratory Practice
BIOM 285 Checking Independence Grafen and Hails Chpt. 8
Checking Independence Heterogeneous Data Repeated Measures Nested Data
BIOM 285 Checking Independence • Independence of Data • Datapoints are independent if knowing the error of one or a subset provides no knowledge of the error of any others
BIOM 285 Why Check Independence? • If data are not independent, then you are not sampling the true population • This violates a fundamental assumption of hypothesis testing and ANOVA • Sources of data non-independence • Heterogeneous data • Data from repeated measures • Nested data
BIOM 285 Heterogeneous Data • Data are not derived from the same population • An Example • Hypothesize the caterpillar weight gain is limited by competition • Weight gain of caterpillars over 5 days • Measure the overall population density on selected plants • Select from three habitats • Tested by the GLM: • Weight Gain=Population Density
BIOM 285 Heterogeneous Data • Regression is significant • Density explains weight
BIOM 285 Heterogeneous Data • Is habitat a factor? • Identifying source of the data shows that individual data provide information on other data
BIOM 285 Heterogeneous Data • After grouping, data are no longer predictable • Appropriate analysis
BIOM 285 Heterogeneous Data • Incorrect GLM: • Weight Gain=Population Density • Conclude density explains weight gain • Correct GLM • Weight Gain=Habitat+Population Density • Conclude that habitat dictates weight gain • ANOVA will show that population density is not significant
BIOM 285 What Happened? • We created an artifactual relationship between two variable because we ignored a third • The data were not randomly drawn from the same population, but from three subsets • Knowing the residual error in one datum predicts the likely error in the others from that group • We violated the assumption of data independence
BIOM 285 Let Residuals Help You • What Happened? • By grouping the data, we can no longer predict the residual error of the other members of the group • Satisfying the assumption of independence
BIOM 285 Heterogeneous Data Regression Grouping
BIOM 285 Repeated Measures • If an individual is measured more than once, these measures cannot be treated as independent • Two methods to correct this • Single summary • Multivariate
BIOM 285 Repeated Measures • An example • Two diets, 5 animals each, measured on 4 different days • 40 datapoints
BIOM 285 Repeated Measures-Single Summary • Summarize the data with a single value • Each day’s weights grouped
BIOM 285 Repeated Measures-Single Summary • Incorrect GLM • Log(weight)=diet+animal+sample • Correct GLM • Log (final weight)=diet • Log (weight1-weight4)=diet • Can summarize in any way that addresses your question • You will discard data • Inefficient experimental design
WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Use Y variables to distinguish between X variables • Use two variables for example • Four Possibilities • Distinguish on Y1, not Y2 • Distinguish on Y2, not Y1 • Distinguish on both • Distinguish on neither
WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y1, not Y2 • Can Distinguish on Weight 60, but not 20
WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y2, not Y1 • Can Distinguish on Weight 20, but not 60
BIOM 285 WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS Repeated Measures-Multivariate • Can identify a relationship between 60 and 20 • Distinguish groups based on a function of Y1, Y2 • NOT independent
BIOM 285 Nested Data • Response=Sample((Leaf)Branch) • Hierarchical arrangement of data • These relationships need to be accounted for in the design
BIOM 285 Nested Data • Nesting can be applied to the Habitat Model
BIOM 285 Checking Independence • Independence is a key assumption, and is the most difficult in practice • Be alert to violations • Check data and residuals • Know what can be done at the analysis stage to correct violations • Mistakes at the design stage are often unrecoverable at analysis