Ensuring Data Analysis Integrity: Assumptions, Tests, and Remedies

Problems in Data AnalysesMostafa

General case in data analysis • Assumptions distortion • Missing data

General Assumption of Anova • The error terms are randomly, and normally distributed Populations (for each condition) are Normally Distributed • The variance of different population are homogeneous (Homo-scedasticity) Populations (for each condition) have Equal Variances • Variances and means of different populations are not correlated (independent)

CRD ANOVA F-Test Assumptions • Randomness & Normality • Homogeneity of Variance • Independence of Errors

Randomized Block F Test Assumptions 1. Normality Populations are normally distributed 2. Homogeneity of Variance Populations have equal variances 3. Independence of Errors Independent random samples are drawn 4. No Interaction Between Blocks & Treatments

Diagnosis: Normality • The points on the normality plot must more or less follow a line to claim “normal distributed”. • There are statistic tests to verify it scientifically. • The ANOVA method we learn here is not sensitive to the normality assumption. That is, a mild departure from the normal distribution will not change our conclusions much. Normality plot: normal scores vs. residuals

Normality Tests • Wide variety of tests can be performed to test if the data follows a normal distribution. • Mardia (1980) provides an extensive list for both the uni-variate and multivariate cases and it is categorized into two types: • Properties of normal distribution, more specifically, the first four moments of the normal distribution • Shapiro-Wilk’s W (compares the ratio of the standard deviation to the variance multiplied by a constant to one) • Lilliefors-Kolmogorov-Smirnov Test • Graphical methods based on residual error (Residual Plotts) • Goodness-of-fit tests, • Kolmogorov-Smirnov D • Cramer-von Mises W2 • Anderson-Darling A2

Formal Tests of Normality • Kolmogorov-Smirnov test; Anderson-Darling test (both based on the empirical CDF). • Shapiro-Wilk’s test; Ryan-Joiner test (both are correlation based tests applicable for n < 50). • D’Agostino’s test (n>=50). All quite conservative – they fail to reject the null hypothesis of normality more often than they should.

The Consequences of Non-Normality • F-test is very robust against non-normal data, especially in a fixed-effects model • Large sample size will approximate normality by Central Limit Theorem (recommended sample size > 50) • Simulations have shown unequal sample sizes between treatment groups magnify any departure from normality • A large deviation from normality leads to hypothesis test conclusions that are too liberal and a decrease in power and efficiency

Remedial Measures for Non-Normality • Data transformation • Be aware - transformations may lead to a fundamental change in the relationship between the dependent and the independent variable and is not always recommended. • Don’t use the standard F-test. • Modified F-tests • Adjust the degrees of freedom • Rank F-test (capitalizes the F-tests robustness) • Randomization test on the F-ratio • Other non-parametric test if distribution is unknown • Make up our own test using a likelihood ratio if distribution is known

Homogeneity of Variances • Eisenhart (1947) describes the problem of unequal variances as follows • the ANOVA model is based on the proportion of the mean squares of the factors and the residual mean squares • The residual mean square is the unbiased estimator of 2, the variance of a single observation • The between treatment mean squares takes into account not only the differences between observations, 2,just like the residual mean squares, but also the variance between treatments • If there was non-constant variance among treatments, the residual mean square can be replaced with some overall variance,  a2, and a treatment variance,  t2, which is some weighted version of  a2 • The “neatness” of ANOVA is lost

Homogeneity of Variances • The overall F-test is very robust against heterogeneity of variances, especially with fixed effects and equal sample sizes. • Tests for treatment differences like t-tests and contrasts are severely affected, resulting in inferences that may be too liberal or conservative • Unequal variances can have a marked effect on the level of the test, especially if smaller sample sizes are associated with groups having larger variances • Unequal variances will lead to bias conclusion

Tests for Homogeneity of Variances • Bartley’s Test • Levene’s Test Computes a one-way-anova on the absolute value (or sometimes the square) of the residuals, |yij – ŷi| with t-1, N – t degrees of freedom Considered robust to departures of normality, but too conservative • Brown-Forsythe Test A slight modification of Levene’s test, where the median is substituted for the mean (Kuehl (2000) refers to it as the Levene (med) Test) • The Fmax Test (Hartley Test) Proportion of the largest variance of the treatment groups to the smallest and compares it to a critical value table

Levene’s Test More work but powerful result. = sample median of i-th group Let df1 = t -1 df2 = nT - t Reject H0 if Essentially an Anova on the zij

Independence It is a special case and the most common cause of heterogeneity of variance • Independent observations • No correlation between error terms • No correlation between independent variables and error • Positively correlated data inflates standard error • The estimation of the treatment means are more accurate than the standard error shows.

Independence Tests • If some notion of how the data was collected is understandable, check can be done if there exists any autocorrelation. • The Durbin-Watson statistic looks at the correlation of each value and the value before it • Data must be sorted in correct order for meaningful results • For example, samples collected at the same time would be ordered by time if suspect results could be depent on time

Independence • A positive correlation between means and variances is often encountered when there is a wide range of sample means • Data that often show a relation between variances and means are data based on counts and data consisting of proportion or percentages • Transformation data can frequently solve the problems

Remedial Measures for Dependent Data • First defense against dependent data is proper study design and randomization • Designs could be implemented that takes correlation into account, e.g., crossover design • Look for environmental factors unaccounted for • Add covariates to the model if they are causing correlation, e.g., quantified learning curves • If no underlying factors can be found attributed to the autocorrelation • Use a different model, e.g., random effects model • Transform the independent variables using the correlation coefficient

Missing data Observations that intended to be made but did not make. Reason of missing data: • An animal may die • An experimental plot may be flooded out • A worker may be ill and not turn up on the job • A jar of jelly may be dropped on the floor • The recorded data may be lost Since most experiment are designed with at least some degree of balance/symmetry, any missing observations will destroy the balance

Missing data • In the presence of missing data, the research goal remains making inferences that apply to the population targeted by the complete sample - i.e. the goal remains what it was if we had seen the complete data. • However, both making inferences and performing the analysis are now more complex. • Making assumptions in order to draw inferences, and then use an appropriate computational approach for the analysis is required • Consider the causes and pattern of the missing data for making appropriate changes in the planned analysis of the data

Missing data • Avoid adopting computationally simple solutions (such as just analyzing complete data or carrying forward the last observation in a longitudinal study) which generally lead to misleading inferences. • In one factor experiment, the data analysis can be executed with good estimated value, but in the factorial experiment theoretically can not be analyzed • In CRD one factor experiment, if there are missing data, data can be analyzed with different replication numbers • In the RCBD one factor experiment, if 1 – 2 complete block or treatment is missing but there are still 2 blocks complete, data analysis simply can be proceeded

Missing data • In the RCBD/LS one factor is experiment, if there 1 – 2 missing observations in the block or treatment , data can be treated by : a. the appropriate method of unequal frequencies b. the use of estimating unknown value from the observed data • The estimate of the missing observation most frequently is the value that minimizes the experimental error sum of square when the regular analysis is performed

Imputation • The error df should be reduced by one, since M was estimated • SAS can compute the F statistic, but the p-value will have to be computed separately • The method is efficient only when a couple cells are missing • The usual Type III analysis is available, but be careful of interpretation • Little and Rubin use MLE and simulation-based approaches • PROC MI in SAS v9 implements Little and Rubin approaches

Ensuring Data Analysis Integrity: Assumptions, Tests, and Remedies

Ensuring Data Analysis Integrity: Assumptions, Tests, and Remedies

Presentation Transcript

Gene Expression Data Analyses (1)

Analyses of qualitative data

Model Independent fMRI Data Analyses

BY ALAN AND MOSTAFA

Ongoing microRNA data analyses

Group analyses of fMRI data

Problems in Data Analyses

Data Analyses Skills (ID6020 Module)

Mostafa Salah Apil

Group analyses of fMRI data

NGS data analyses with BioUML

Ant Research Project: Data Analyses

Rural Analyses of Commuting Data

Data Analyses Skills (ID6020 Module)

Data analyses

Data Analyses

Psychometric analyses of ADNI data

Mostafa El-Asiry