Todays Topics

Todays Topics Outliers and Cheating Departures from Normality Assumptions of General Linear Models Checking Independence

Outliers and Cheating Why you shouldnt worry about outliers

Outliers and Cheating • A data point has one or more values that are so extreme that they may not belong in the sample • You must have a legitimate reason for eliminating a data point, such as instrument error or miscoding of data • Generally you will find that your value is not so unusual when you have performed sufficient replicates

Often, data from trials seem to be outliers Repeated experiments show that they are not Outliers and Cheating

Departures from Normality Testing for Normality

Departures from Normality • The z- or t-score relies on sampling a population that conforms to the normal frequency distribution. • If your sample data do not conform to normality, your analysis is wrong • How do you know this is the case with your sample and what do you do if it isn’t?

Departures from Normality • Mathematical methods for testing normality • Goodness of fit tests • Testing for symmetry and kurtosis • Complex, requires the use of statistical tables • Kolmogorov-Smirnov Test • Low power, not recommended • W and KSL tests • Shapiro-Wilk (n < 2000) Kolmogorov-Smirnov-Lillifors(n = 2000) • Complex calculation, but can be done with software • Excellent

Departures from Normality • Normality testing assesses the likelihood that data are derived from a normal population • Compares to a normal distribution with the same moments

Departures from Normality • Transformation can correct departures • Log-use when variance is proportional to y2 • Arcsine-use when analyzing proportions (binomial populations) • Square Root-Use when means are proportional to variances (Poisson Distributions)

Transformations-Fertilizer • Log transformation corrects non-normality of the Fertilizer dataset

Transformations-Fertilizer • Transformations will generally improve Rsquare and F scores, if not post-hoc tests

Transformations • Log transformation • Use when variance is proportional to y2

Transformation • Square Root • Use when means are proportional to variances (Poisson Distributions)

Assessing Normality Do’s Don’ts Don’t worry if a transformation doesn’t fix the distribution Don’t discard an analysis that came from non-normal data The Central Limit Theorem means that ANOVA and two-sample tests are tolerant of small deviations • Check your data before analyzing • Transform you data if necessary • Analyze both untransformed and transformed data to judge the effects

Assumptions of General Linear Models Checking Independence Chpt 8 Grafen and Hails

BIOM 285 Assumptions of GLMs • General Linear Models rely on four basic assumptions of the data to be analyzed • Independence • Homogeneity of Variance • Normality of Error • Linearity/Additivity

BIOM 285 Assumptions of GLMs • The goal of testing is to describe the likelihood that our null hypothesis is false • If the assumptions underlying our arrival at that likelihood are violated, then are our conclusion (p-value) is at risk of being false • When this happens, we cannot tell the degree to which this affects the p=value

BIOM 285 Assumptions of GLMs • Therefore, it is important to check the validity of the underlying assumptions when you analyze your data • You should make any appropriate corrections to violations, or note those that cannot be corrected • This is considered Good Laboratory Practice

BIOM 285 Checking Independence Grafen and Hails Chpt. 8

Checking Independence Heterogeneous Data Repeated Measures Nested Data

BIOM 285 Checking Independence • Independence of Data • Datapoints are independent if knowing the error of one or a subset provides no knowledge of the error of any others

BIOM 285 Why Check Independence? • If data are not independent, then you are not sampling the true population • This violates a fundamental assumption of hypothesis testing and ANOVA • Sources of data non-independence • Heterogeneous data • Data from repeated measures • Nested data

BIOM 285 Heterogeneous Data • Data are not derived from the same population • An Example • Hypothesize the caterpillar weight gain is limited by competition • Weight gain of caterpillars over 5 days • Measure the overall population density on selected plants • Select from three habitats • Tested by the GLM: • Weight Gain=Population Density

BIOM 285 Heterogeneous Data • Regression is significant • Density explains weight

BIOM 285 Heterogeneous Data • Is habitat a factor? • Identifying source of the data shows that individual data provide information on other data

BIOM 285 Heterogeneous Data • After grouping, data are no longer predictable • Appropriate analysis

BIOM 285 Heterogeneous Data • Incorrect GLM: • Weight Gain=Population Density • Conclude density explains weight gain • Correct GLM • Weight Gain=Habitat+Population Density • Conclude that habitat dictates weight gain • ANOVA will show that population density is not significant

BIOM 285 What Happened? • We created an artifactual relationship between two variable because we ignored a third • The data were not randomly drawn from the same population, but from three subsets • Knowing the residual error in one datum predicts the likely error in the others from that group • We violated the assumption of data independence

BIOM 285 Let Residuals Help You • What Happened? • By grouping the data, we can no longer predict the residual error of the other members of the group • Satisfying the assumption of independence

BIOM 285 Heterogeneous Data Regression Grouping

BIOM 285 Repeated Measures • If an individual is measured more than once, these measures cannot be treated as independent • Two methods to correct this • Single summary • Multivariate

BIOM 285 Repeated Measures • An example • Two diets, 5 animals each, measured on 4 different days • 40 datapoints

BIOM 285 Repeated Measures-Single Summary • Summarize the data with a single value • Each day’s weights grouped

BIOM 285 Repeated Measures-Single Summary • Incorrect GLM • Log(weight)=diet+animal+sample • Correct GLM • Log (final weight)=diet • Log (weight1-weight4)=diet • Can summarize in any way that addresses your question • You will discard data • Inefficient experimental design

WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Use Y variables to distinguish between X variables • Use two variables for example • Four Possibilities • Distinguish on Y1, not Y2 • Distinguish on Y2, not Y1 • Distinguish on both • Distinguish on neither

WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y1, not Y2 • Can Distinguish on Weight 60, but not 20

WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS BIOM 285 Repeated Measures-Multivariate • Distinguish on Y2, not Y1 • Can Distinguish on Weight 20, but not 60

BIOM 285 WEIGHT AT 60 WEEKS WEIGHT AT 20 WEEKS Repeated Measures-Multivariate • Can identify a relationship between 60 and 20 • Distinguish groups based on a function of Y1, Y2 • NOT independent

BIOM 285 Nested Data • Response=Sample((Leaf)Branch) • Hierarchical arrangement of data • These relationships need to be accounted for in the design

BIOM 285 Nested Data • Nesting can be applied to the Habitat Model

BIOM 285 Checking Independence • Independence is a key assumption, and is the most difficult in practice • Be alert to violations • Check data and residuals • Know what can be done at the analysis stage to correct violations • Mistakes at the design stage are often unrecoverable at analysis

Todays Topics

Todays Topics

Presentation Transcript

Todays Schedule

TODAYS QUIZ

Todays agenda:

TOPICS FOR TODAYS DISCUSSION

Todays schedule

Todays lecture

Todays Lesson (1)

Todays Lesson (1)

What's todays topic?

Todays Scenarios

Todays Drill

Todays topic

TODAYS PLAN

Todays Agenda

Todays Lesson

Todays Lesson is……..

Todays Presentation

Todays presentation

TODAYS AGENDA

Todays Session

Computing for Todays

Todays Big Question: