1 / 41

Todays Topics

Todays Topics. Outliers and Cheating Departures from Normality Assumptions of General Linear Models Checking Independence. Outliers and Cheating. Why you shouldnt worry about outliers. Outliers and Cheating.

morrison
Download Presentation

Todays Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Todays Topics Outliers and Cheating Departures from Normality Assumptions of General Linear Models Checking Independence

  2. Outliers and Cheating Why you shouldnt worry about outliers

  3. Outliers and Cheating • A data point has one or more values that are so extreme that they may not belong in the sample • You must have a legitimate reason for eliminating a data point, such as instrument error or miscoding of data • Generally you will find that your value is not so unusual when you have performed sufficient replicates

  4. Often, data from trials seem to be outliers Repeated experiments show that they are not Outliers and Cheating

  5. Departures from Normality Testing for Normality

  6. Departures from Normality • The z- or t-score relies on sampling a population that conforms to the normal frequency distribution. • If your sample data do not conform to normality, your analysis is wrong • How do you know this is the case with your sample and what do you do if it isn’t?

  7. Departures from Normality • Mathematical methods for testing normality • Goodness of fit tests • Testing for symmetry and kurtosis • Complex, requires the use of statistical tables • Kolmogorov-Smirnov Test • Low power, not recommended • W and KSL tests • Shapiro-Wilk (n <  2000) Kolmogorov-Smirnov-Lillifors(n = 2000) • Complex calculation, but can be done with software • Excellent

  8. Departures from Normality • Normality testing assesses the likelihood that data are derived from a normal population • Compares to a normal distribution with the same moments

  9. Departures from Normality • Transformation can correct departures • Log-use when variance is proportional to y2 • Arcsine-use when analyzing proportions (binomial populations) • Square Root-Use when means are proportional to variances (Poisson Distributions)

  10. Transformations-Fertilizer • Log transformation corrects non-normality of the Fertilizer dataset

  11. Transformations-Fertilizer • Transformations will generally improve Rsquare and F scores, if not post-hoc tests

  12. Transformations • Log transformation • Use when variance is proportional to y2

  13. Transformation • Square Root • Use when means are proportional to variances (Poisson Distributions)

  14. Assessing Normality Do’s Don’ts Don’t worry if a transformation doesn’t fix the distribution Don’t discard an analysis that came from non-normal data The Central Limit Theorem means that ANOVA and two-sample tests are tolerant of small deviations • Check your data before analyzing • Transform you data if necessary • Analyze both untransformed and transformed data to judge the effects

  15. Assumptions of General Linear Models Checking Independence Chpt 8 Grafen and Hails

  16. BIOM 285 Assumptions of GLMs • General Linear Models rely on four basic assumptions of the data to be analyzed • Independence • Homogeneity of Variance • Normality of Error • Linearity/Additivity

  17. BIOM 285 Assumptions of GLMs • The goal of testing is to describe the likelihood that our null hypothesis is false • If the assumptions underlying our arrival at that likelihood are violated, then are our conclusion (p-value) is at risk of being false • When this happens, we cannot tell the degree to which this affects the p=value

  18. BIOM 285 Assumptions of GLMs • Therefore, it is important to check the validity of the underlying assumptions when you analyze your data • You should make any appropriate corrections to violations, or note those that cannot be corrected • This is considered Good Laboratory Practice

  19. BIOM 285 Checking Independence Grafen and Hails Chpt. 8

  20. Checking Independence Heterogeneous Data Repeated Measures Nested Data

  21. BIOM 285 Checking Independence • Independence of Data • Datapoints are independent if knowing the error of one or a subset provides no knowledge of the error of any others

  22. BIOM 285 Why Check Independence? • If data are not independent, then you are not sampling the true population • This violates a fundamental assumption of hypothesis testing and ANOVA • Sources of data non-independence • Heterogeneous data • Data from repeated measures • Nested data

  23. BIOM 285 Heterogeneous Data • Data are not derived from the same population • An Example • Hypothesize the caterpillar weight gain is limited by competition • Weight gain of caterpillars over 5 days • Measure the overall population density on selected plants • Select from three habitats • Tested by the GLM: • Weight Gain=Population Density

  24. BIOM 285 Heterogeneous Data • Regression is significant • Density explains weight

  25. BIOM 285 Heterogeneous Data • Is habitat a factor? • Identifying source of the data shows that individual data provide information on other data

  26. BIOM 285 Heterogeneous Data • After grouping, data are no longer predictable • Appropriate analysis

  27. BIOM 285 Heterogeneous Data • Incorrect GLM: • Weight Gain=Population Density • Conclude density explains weight gain • Correct GLM • Weight Gain=Habitat+Population Density • Conclude that habitat dictates weight gain • ANOVA will show that population density is not significant

  28. BIOM 285 What Happened? • We created an artifactual relationship between two variable because we ignored a third • The data were not randomly drawn from the same population, but from three subsets • Knowing the residual error in one datum predicts the likely error in the others from that group • We violated the assumption of data independence

  29. BIOM 285 Let Residuals Help You • What Happened? • By grouping the data, we can no longer predict the residual error of the other members of the group • Satisfying the assumption of independence

  30. BIOM 285 Heterogeneous Data Regression Grouping

  31. BIOM 285 Repeated Measures • If an individual is measured more than once, these measures cannot be treated as independent • Two methods to correct this • Single summary • Multivariate

  32. BIOM 285 Repeated Measures • An example • Two diets, 5 animals each, measured on 4 different days • 40 datapoints

  33. BIOM 285 Repeated Measures-Single Summary • Summarize the data with a single value • Each day’s weights grouped

  34. BIOM 285 Repeated Measures-Single Summary • Incorrect GLM • Log(weight)=diet+animal+sample • Correct GLM • Log (final weight)=diet • Log (weight1-weight4)=diet • Can summarize in any way that addresses your question • You will discard data • Inefficient experimental design

  35. WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Use Y variables to distinguish between X variables • Use two variables for example • Four Possibilities • Distinguish on Y1, not Y2 • Distinguish on Y2, not Y1 • Distinguish on both • Distinguish on neither

  36. WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y1, not Y2 • Can Distinguish on Weight 60, but not 20

  37. WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y2, not Y1 • Can Distinguish on Weight 20, but not 60

  38. BIOM 285 WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS Repeated Measures-Multivariate • Can identify a relationship between 60 and 20 • Distinguish groups based on a function of Y1, Y2 • NOT independent

  39. BIOM 285 Nested Data • Response=Sample((Leaf)Branch) • Hierarchical arrangement of data • These relationships need to be accounted for in the design

  40. BIOM 285 Nested Data • Nesting can be applied to the Habitat Model

  41. BIOM 285 Checking Independence • Independence is a key assumption, and is the most difficult in practice • Be alert to violations • Check data and residuals • Know what can be done at the analysis stage to correct violations • Mistakes at the design stage are often unrecoverable at analysis

More Related