150 likes | 226 Views
Statistics in MSmcDESPOT. Jason Su (borrowed heavily from STATS191 and Prof. Jonathan Taylor). Comparison of 2 Populations. Null hypothesis (H 0 ): The populations are the same Alternative hypothesis (H A ): The populations are different. t-test is the standard tool used here
E N D
Statistics in MSmcDESPOT Jason Su (borrowed heavily from STATS191 and Prof. Jonathan Taylor)
Comparison of 2 Populations • Null hypothesis (H0): The populations are the same • Alternative hypothesis (HA): The populations are different. • t-test is the standard tool used here • Assumes the two populations are Gaussian distributed but that the data follows a t-dist. since we must estimate the mean and standard deviation • Wilcoxon rank-sum (or Mann–Whitney U) test • Is a non-parametric version, does not assume a distribution • Compares the medians instead of means of population • Reject at p-value < 0.05 level typically. • Interpretation: assuming the null hypothesis the p-value is the chance that we would observe something as extreme as the 2nd sample • Rejection at 0.05, means we would tolerate being wrong 5% of the time if they are actually the same
Simple Linear Regression • y = a*x + b • Least squares fit of the predictor to the outcome, equiv. to maximum likelihood if assumptions correct • Assumptions: full column rank, residuals are independent N(0, σ^2) constant variance • In MSmcDESPOT predictor is log(DV), outcome is EDSS • EDSS = a*log(DV) + b • R^2 is a measure of how much of the variability of the outcome is explained by the predictor
Diagnostics • What can go wrong? • Wrong regression function • Incorrect model for errors • Not normal • Not independent • Non-constant variance • Tools • Q-Q Plot, plot the quantiles of the residuals vs. that of a normal, should be a linear relationship • Plot residuals vs. predictor
Multiple Linear Regression • Y = X*a • a = pinv(X)*Y, LS solution, pinv(X) = inv(X’X)X’ • X is now a matrix of columns of predictors • The outcome is linear in a predictor after accounting for all the others • Same assumptions from simple lin. reg. also • Adding even random noise to X improves R^2 • Adjusted R^2, instead of sum of square error, use mean square error: favors simpler models
Diagnostics • Old tools are still good • New tools to measure the influence of an observation, useful for determining outliers • DFFITS: measures how much the regression function changes at the i-th observation when the i-th row is removed from X • Cook’s distance: how much the entire regression function changes when i-th row removed • DFBETAS: how much coefficients change when i-th row removed
Model Selection • As suggested by Adjusted R^2, what we really want is a parsimonious model • One that predicts the outcome well with only a few predictors • This is a combinatorially hard problem • Models are evaluated with a criterion • Adjusted R^2 • Mallow’s Cp – estimated predictive power of model • Akaike information criterion (AIC) – related to Cp • Bayesian information criterion (BIC) • Cross validation with MSE
Search Strategy • If the model is small enough, can search all • In MSmcDESPOT this is probably feasible, our predictors are: age, PVF, log(DV), gender, PP, SP, RR, CIS • 127 possibilities • Stepwise • This is a popular search method where the algorithm is giving a starting point then adds or removes predictors one at a time until there is no improvement in the criterion
Discussion • All the relevant rank sum tests (Normals vs. classes of MS, RR vs. SP) are still below the p < 0.01 threshold as before • The drop in correlation is probably due to N024, who shows an unusually high amount of demyelination at half the level of the lowest CIS patients, could be an outlier • I’m not certain if log() is the correct transform for DV, need to run more diagnostics • How accurate is EDSS?
Stepwise Model Selection • Stepwise model selection keeps age, PVF, and log(DV), shown on the left • On the right is a Q-Q plot of a model with age, PP, and SP • MATLAB function for exhaustive model search?