1 / 46

ANOVA

SM339 ANOVA - Spring 2007. 2. ANOVA. Think of an extension to 2 sample t problemNeed an estimate of SDIn 2 sample t, we always assumed the two distns had the same SDHad to combine our 2 estimates (one from each sample) to get a single est. SM339 ANOVA - Spring 2007. 3. ANOVA. Review 2 sampl

cyma
Download Presentation

ANOVA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. SM339 ANOVA - Spring 2007 1 ANOVA Analysis of Variance Method to compare more than 2 means As with comparing more than 2 p’s, we can’t use subtraction Will use sum of squares

    2. SM339 ANOVA - Spring 2007 2 ANOVA Think of an extension to 2 sample t problem Need an estimate of SD In 2 sample t, we always assumed the two distns had the same SD Had to combine our 2 estimates (one from each sample) to get a single est

    3. SM339 ANOVA - Spring 2007 3 ANOVA Review 2 sample t test Given N, Avg, SD for 2 groups SP=pool(s1,n1,s2,n2) PV=tprob(0,SP*?(1/n1+1/n2),av1-av2,999) (double for two sided)

    4. SM339 ANOVA - Spring 2007 4 ANOVA Pool(s1,n1,s2,n2) =?[ (n1-1)s12 + (n2-1) s22 ]/ (n1-1 + n2-1)] Concentrate on the numerator and ignore sqrt ? (n-1)*s2 = SSE, Sum of Squares Error (Error not in sense of mistake)

    5. SM339 ANOVA - Spring 2007 5 ANOVA If we ignored the fact that our data came from (potentially) different groups, we might have started with ? (x-avg)2 =SSTotal Note that ?SSTotal/(N-1) would be our estimator of SD using all the data (and ignoring grouping)

    6. SM339 ANOVA - Spring 2007 6 ANOVA The Sum of Squares Identity SSTotal = SSE + SSTreatment Where SSTr = ? nj * (avgj-AVG)2 Except for nj this is like the sum of squares of the averages

    7. SM339 ANOVA - Spring 2007 7 ANOVA See Ex 62, p. 488 3 levels of blockage Measure flow rate where collapse occurs In book as fig11.4 Not a good form for Excel See ANOVA1.xls First col tells which blockage level Second col is the flowrate

    8. SM339 ANOVA - Spring 2007 8 ANOVA To extract the values that correspond to flow rate=1, use Dt(:,1)==1 “:” means all rows Equality is == Single = is assignment Use above expression to select rows Dt( Dt(:,1)==1, : )

    9. SM339 ANOVA - Spring 2007 9 ANOVA To count how many, use length() Length(Dt( Dt(:,1)==1, : )) Average is mean() SD is std() Most convenient to store these in lists

    10. SM339 ANOVA - Spring 2007 10 ANOVA For i=1:3, Ns(i) = Length(Dt( Dt(:,1)==i, : )); Avs(i) = … Sds(i) = … End

    11. SM339 ANOVA - Spring 2007 11 ANOVA These are rows We can display conveniently by making them cols of a matrix [ns’ avs’ sds’] 11.0000 11.2091 1.8987 14.0000 15.0857 2.1504 10.0000 17.3300 2.1680

    12. SM339 ANOVA - Spring 2007 12 ANOVA Alternate plan is to compute Ns, Avs, SDs as a matrix in the first place Write a function SUMSTATS(X, grp) which returns such a matrix The number of rows of the answer will be the number of groups Assume “grp” contains 1,…#groups

    13. SM339 ANOVA - Spring 2007 13 ANOVA Notice that SDs are quite similar in our example Good if we are going to assume the underlying SDs are all equal Easy to get SSE from here >> sse=sum((ns-1).*sds.^2) 138.4672 (Close to p. 497 in book)

    14. SM339 ANOVA - Spring 2007 14 ANOVA Exercises: 1. Fig 11.5, p. 489, has data on roadway aggregate. Do sumstats. 2. Fig 11.6, p. 491, has data on carpet fibers. Do sumstats.

    15. SM339 ANOVA - Spring 2007 15 ANOVA To get SSTr, we need overall average Generally NOT the avg of the avgs Have to weight by the N’s >> grand=sum(ns.*avs)/sum(ns) 14.5086 >> sstr=sum(ns.*(avs-grand).^2) 204.0202

    16. SM339 ANOVA - Spring 2007 16 ANOVA Don’t actually need SST, but it would be SSTr+SSE = 342.4874 >> (length(d114)-1)*std(dt(:,2)).^2 342.4874

    17. SM339 ANOVA - Spring 2007 17 ANOVA H0: all means equal Ha: not all means equal (Not the same as “all means different”) It can be shown that when H0 is true, then each of the SS’s is a (multiple of) ?2

    18. SM339 ANOVA - Spring 2007 18 ANOVA Degrees of Freedom Usual est of SD has df=N-1 So, if H0 true, Total df = N-1 Each term of SSE has df= ni-1 Fact: Sums of (ind) Chi^2 has df given by sum of df So df for SSE = S (ni-1) Df for SSTr is difference = # groups-1 = G-1

    19. SM339 ANOVA - Spring 2007 19 ANOVA

    20. SM339 ANOVA - Spring 2007 20 ANOVA Recall that the mean of a ?2 = df For comparison, we should divide SS/df Called Mean Square MSTr, MSE (don’t usually do MST)

    21. SM339 ANOVA - Spring 2007 21 ANOVA

    22. SM339 ANOVA - Spring 2007 22 ANOVA

    23. SM339 ANOVA - Spring 2007 23 ANOVA If H0 is true, then MSTr = MSE Consider their ratio (to get rid of the common multiple) F = MSTr / MSE Near 1 -> H0 F large -> avgs are more spread out than expected -> Ha

    24. SM339 ANOVA - Spring 2007 24 ANOVA F distn Depends on df for numerator and df for denominator FPROB(df1, df2, lo, hi) Can calculate p-value

    25. SM339 ANOVA - Spring 2007 25 ANOVA For our problem, F=102/4.33 = 23.57 >> fprob(2,32,23.57,999) = 5.1058e-007 So if H0 were true, it would be nearly impossible to get F this large. Can be quite sure that H0 is not true. Means are not all exactly the same.

    26. SM339 ANOVA - Spring 2007 26 ANOVA Write a function that takes SUMSTATS and returns ANOVA Since ANOVA table is not a full matrix, suggest we return a matrix for SS, df, MS and then return F and PV as scalars [F,pv,SS] = ANOVA(sumstats)

    27. SM339 ANOVA - Spring 2007 27 ANOVA Exercises: 1. Fig 11.5, p. 489, has data on roadway aggregate. Do ANOVA. 2. Fig 11.6, p. 491, has data on carpet fibers. Do ANOVA.

    28. SM339 ANOVA - Spring 2007 28 ANOVA If we only have 2 groups, should we do ANOVA or 2 sample t test? Consider the following: N1=8, Av1=12.3, SD1=5.3 N2=12, Av2=16.4, SD2=6.6

    29. SM339 ANOVA - Spring 2007 29 ANOVA SP=6.1273 For the two sided Ha >>pv=tprob(0,sp*sqrt(1/N1+1/N2),N1-1+N2-1,Av2-Av1,999)*2 0.1599

    30. SM339 ANOVA - Spring 2007 30 ANOVA For ANOVA sm = 8.0000 12.3000 5.3000 12.0000 16.4000 6.6000 >> [f,pv,ssa]=anova1(sm) f = 2.1492 pv = 0.1599 So the (two-sided) pvalue is exactly the same by either method

    31. SM339 ANOVA - Spring 2007 31 ANOVA But there’s more!! ssa = 80.6880 1.0000 80.6880 675.7900 18.0000 37.5439 756.4780 19.0000 39.8146 and SP^2= 37.5439 So vMSE = RMSE is the analog of SP

    32. SM339 ANOVA - Spring 2007 32 ANOVA If we reject H0, then some means are different. Which ones? Problem of multiple comparisons If H0 were true (all means equal) and we did 20 tests with alpha=0.05, then we should expect that 1 of the comparisons would appear to be different If we are going to do a number of 2 sample t tests, we will have to use a lower ? than usual

    33. SM339 ANOVA - Spring 2007 33 ANOVA The book uses the Studentized range method There are a number of different methods for addressing this problem We will use the Bonferroni method

    34. SM339 ANOVA - Spring 2007 34 ANOVA The probability of a union of sets could be as high as the sum of the probabilities (if the sets are mutually disjoint) In our cases, the sets are “making a Type I error” for each test we are doing If we are doing N tests and each test has a Prob(Type I) = 0.05/N, then the overall Prob(Type I) = 0.05

    35. SM339 ANOVA - Spring 2007 35 ANOVA Therefore, if we want to end up at an overall level of 5%, say, then we need to conduct each test at a level of 0.05/# tests For G groups, there are G*(G-1)/2 pairs For flowrate example, G=3 and there are 3*2/2 = 3 pairs to compare Do 2 sample t tests with ?/3 = 0.05/3

    36. SM339 ANOVA - Spring 2007 36 ANOVA If we use p-values, then we should multiply the (apparent) p-value by the # of comparisons If we do confidence intervals, we should get wider intervals For 95% intervals, we usually solve for 0.025 Now we solve for 0.025 / (# groups)

    37. SM339 ANOVA - Spring 2007 37 ANOVA Recall that in the 2 sample t test, we use SD*?(1/n1+1/n2) for the diff of avgs What do we use for the pooled SD? This was the basis for SSE (or MSE) For SD, use ?MSE For df, use df for MSE

    38. SM339 ANOVA - Spring 2007 38 ANOVA The p-value for comparing 1-2 is 0.0082 For 1-3, PV=0.0003 For 2-3, PV=0.0872 Using Bonferoni, we cannot conclude that 2 and 3 are different, but 1 IS diff from both 2 and 3. Note that on p. 504, he gets a slightly diff result using conf intervals and Studentized range The Studentized range is a bit more efficient than Bonferroni, but note that his last CI comes quite close to 0

    39. SM339 ANOVA - Spring 2007 39 Rank procedures The underlying assumption of ANOVA is that the data has a normal distn When we are not comfortable with that assumption there are other methods available Nonparametric methods

    40. SM339 ANOVA - Spring 2007 40 Rank procedures A large class of NP methods are based on ranks Replace the data with ranks and then do calculations We should feel comfortable in knowing which values are larger than which other values, even if we don’t know the probability of such a difference

    41. SM339 ANOVA - Spring 2007 41 Rank procedures Function y=ranks(x) “Midranks” to handle ties in the x’s For comparing multiple groups, we use Kruskal-Wallis Based on the sum of ranks within each group

    42. SM339 ANOVA - Spring 2007 42 Rank procedures Text has a formula on p. 713, but not very meaningful Better formula is based on S(obs-exp)2/exp If we have N total obs and n1 obs in group 1, then the avg rank of all the data is (N+1)/2 We would expect the rank sum for group 1 to be n1*(N+1)/2, etc Can then compute the diff between obs and exp TRICK: There is a fudge factor. Have to multiply by 6/N to get ?2

    43. SM339 ANOVA - Spring 2007 43 Rank procedures For artery flow: r=ranks(dt(:,2)) rs(1)=sum(r(dt(:,1)==1)); etc ns(1)=sum(dt(:,1)==1); etc expt=ns*(n+1)/2 >> c2=6/n*sum((rs-expt).^2./expt) 20.4380 >> pv=chiprob(2,c2,99) 3.6471e-005 NOTE df=# groups-1 = df for Treatment

    44. SM339 ANOVA - Spring 2007 44 Rank procedures Can use pooling as we did for Chi^2 to determine which differences are significant But the pooling can be done on the rank sums (and expected’s) As before, df do not change

    45. SM339 ANOVA - Spring 2007 45 Rank procedures Carpet fiber example Assume we’ve run sumstats >> ns=smst(:,1)' ns = 16 16 13 16 14 15 >> r=ranksums(x(:,2),x(:,1)) r = 495 1177.5 353 854.5 288 927 >> n=sum(ns) n = 90

    46. SM339 ANOVA - Spring 2007 46 Rank procedures >> expt=ns*(n+1)/2 expt = 728 728 591.5 728 637 682.5 >> c2=6/n*sum((r-expt).^2 ./ max(1,expt)) c2 = 49.937 Chi^2 w/ df=5, so extremely large

    47. SM339 ANOVA - Spring 2007 47 Rank procedures >> (r-expt)./expt ans = -0.32005 0.61745 -0.40321 0.17376 -0.54788 0.35824 Note that 1 and 3 look similar >> r=cpool(r,1,3),expt=cpool(expt,1,3), r = 848 1177.5 0 854.5 288 927 expt = 1319.5 728 0 728 637 682.5 >> c2=6/n*sum((r-expt).^2 ./ max(1,expt)) c2 = 49.787 Still large (unchanged), so diff is not due to a diff between 1 & 3

More Related