510 likes | 1.06k Views
SM339 ANOVA - Spring 2007. 2. ANOVA. Think of an extension to 2 sample t problemNeed an estimate of SDIn 2 sample t, we always assumed the two distns had the same SDHad to combine our 2 estimates (one from each sample) to get a single est. SM339 ANOVA - Spring 2007. 3. ANOVA. Review 2 sampl
E N D
1. SM339 ANOVA - Spring 2007 1 ANOVA Analysis of Variance
Method to compare more than 2 means
As with comparing more than 2 p’s, we can’t use subtraction
Will use sum of squares
2. SM339 ANOVA - Spring 2007 2 ANOVA Think of an extension to 2 sample t problem
Need an estimate of SD
In 2 sample t, we always assumed the two distns had the same SD
Had to combine our 2 estimates (one from each sample) to get a single est
3. SM339 ANOVA - Spring 2007 3 ANOVA Review 2 sample t test
Given N, Avg, SD for 2 groups
SP=pool(s1,n1,s2,n2)
PV=tprob(0,SP*?(1/n1+1/n2),av1-av2,999)
(double for two sided)
4. SM339 ANOVA - Spring 2007 4 ANOVA Pool(s1,n1,s2,n2)
=?[ (n1-1)s12 + (n2-1) s22 ]/ (n1-1 + n2-1)]
Concentrate on the numerator and ignore sqrt
? (n-1)*s2
= SSE, Sum of Squares Error
(Error not in sense of mistake)
5. SM339 ANOVA - Spring 2007 5 ANOVA If we ignored the fact that our data came from (potentially) different groups, we might have started with
? (x-avg)2
=SSTotal
Note that ?SSTotal/(N-1) would be our estimator of SD using all the data (and ignoring grouping)
6. SM339 ANOVA - Spring 2007 6 ANOVA The Sum of Squares Identity
SSTotal = SSE + SSTreatment
Where
SSTr = ? nj * (avgj-AVG)2
Except for nj this is like the sum of squares of the averages
7. SM339 ANOVA - Spring 2007 7 ANOVA See Ex 62, p. 488
3 levels of blockage
Measure flow rate where collapse occurs
In book as fig11.4
Not a good form for Excel
See ANOVA1.xls
First col tells which blockage level
Second col is the flowrate
8. SM339 ANOVA - Spring 2007 8 ANOVA To extract the values that correspond to flow rate=1, use
Dt(:,1)==1
“:” means all rows
Equality is ==
Single = is assignment
Use above expression to select rows
Dt( Dt(:,1)==1, : )
9. SM339 ANOVA - Spring 2007 9 ANOVA To count how many, use length()
Length(Dt( Dt(:,1)==1, : ))
Average is mean()
SD is std()
Most convenient to store these in lists
10. SM339 ANOVA - Spring 2007 10 ANOVA For i=1:3,
Ns(i) = Length(Dt( Dt(:,1)==i, : ));
Avs(i) = …
Sds(i) = …
End
11. SM339 ANOVA - Spring 2007 11 ANOVA These are rows
We can display conveniently by making them cols of a matrix
[ns’ avs’ sds’]
11.0000 11.2091 1.8987
14.0000 15.0857 2.1504
10.0000 17.3300 2.1680
12. SM339 ANOVA - Spring 2007 12 ANOVA Alternate plan is to compute Ns, Avs, SDs as a matrix in the first place
Write a function SUMSTATS(X, grp) which returns such a matrix
The number of rows of the answer will be the number of groups
Assume “grp” contains 1,…#groups
13. SM339 ANOVA - Spring 2007 13 ANOVA Notice that SDs are quite similar in our example
Good if we are going to assume the underlying SDs are all equal
Easy to get SSE from here
>> sse=sum((ns-1).*sds.^2)
138.4672
(Close to p. 497 in book)
14. SM339 ANOVA - Spring 2007 14 ANOVA Exercises:
1. Fig 11.5, p. 489, has data on roadway aggregate. Do sumstats.
2. Fig 11.6, p. 491, has data on carpet fibers. Do sumstats.
15. SM339 ANOVA - Spring 2007 15 ANOVA To get SSTr, we need overall average
Generally NOT the avg of the avgs
Have to weight by the N’s
>> grand=sum(ns.*avs)/sum(ns)
14.5086
>> sstr=sum(ns.*(avs-grand).^2)
204.0202
16. SM339 ANOVA - Spring 2007 16 ANOVA Don’t actually need SST, but it would be SSTr+SSE = 342.4874
>> (length(d114)-1)*std(dt(:,2)).^2
342.4874
17. SM339 ANOVA - Spring 2007 17 ANOVA H0: all means equal
Ha: not all means equal
(Not the same as “all means different”)
It can be shown that when H0 is true, then each of the SS’s is a (multiple of) ?2
18. SM339 ANOVA - Spring 2007 18 ANOVA Degrees of Freedom
Usual est of SD has df=N-1
So, if H0 true, Total df = N-1
Each term of SSE has df= ni-1
Fact: Sums of (ind) Chi^2 has df given by sum of df
So df for SSE = S (ni-1)
Df for SSTr is difference
= # groups-1 = G-1
19. SM339 ANOVA - Spring 2007 19 ANOVA
20. SM339 ANOVA - Spring 2007 20 ANOVA Recall that the mean of a ?2 = df
For comparison, we should divide SS/df
Called Mean Square
MSTr, MSE (don’t usually do MST)
21. SM339 ANOVA - Spring 2007 21 ANOVA
22. SM339 ANOVA - Spring 2007 22 ANOVA
23. SM339 ANOVA - Spring 2007 23 ANOVA If H0 is true, then MSTr = MSE
Consider their ratio (to get rid of the common multiple)
F = MSTr / MSE
Near 1 -> H0
F large -> avgs are more spread out than expected
-> Ha
24. SM339 ANOVA - Spring 2007 24 ANOVA F distn
Depends on df for numerator and df for denominator
FPROB(df1, df2, lo, hi)
Can calculate p-value
25. SM339 ANOVA - Spring 2007 25 ANOVA For our problem, F=102/4.33 = 23.57
>> fprob(2,32,23.57,999)
= 5.1058e-007
So if H0 were true, it would be nearly impossible to get F this large.
Can be quite sure that H0 is not true.
Means are not all exactly the same.
26. SM339 ANOVA - Spring 2007 26 ANOVA Write a function that takes SUMSTATS and returns ANOVA
Since ANOVA table is not a full matrix, suggest we return a matrix for SS, df, MS and then return F and PV as scalars
[F,pv,SS] = ANOVA(sumstats)
27. SM339 ANOVA - Spring 2007 27 ANOVA Exercises:
1. Fig 11.5, p. 489, has data on roadway aggregate. Do ANOVA.
2. Fig 11.6, p. 491, has data on carpet fibers. Do ANOVA.
28. SM339 ANOVA - Spring 2007 28 ANOVA If we only have 2 groups, should we do ANOVA or 2 sample t test?
Consider the following:
N1=8, Av1=12.3, SD1=5.3
N2=12, Av2=16.4, SD2=6.6
29. SM339 ANOVA - Spring 2007 29 ANOVA SP=6.1273
For the two sided Ha
>>pv=tprob(0,sp*sqrt(1/N1+1/N2),N1-1+N2-1,Av2-Av1,999)*2
0.1599
30. SM339 ANOVA - Spring 2007 30 ANOVA For ANOVA
sm =
8.0000 12.3000 5.3000
12.0000 16.4000 6.6000
>> [f,pv,ssa]=anova1(sm)
f =
2.1492
pv =
0.1599
So the (two-sided) pvalue is exactly the same by either method
31. SM339 ANOVA - Spring 2007 31 ANOVA But there’s more!!
ssa =
80.6880 1.0000 80.6880
675.7900 18.0000 37.5439
756.4780 19.0000 39.8146
and SP^2= 37.5439
So vMSE = RMSE is the analog of SP
32. SM339 ANOVA - Spring 2007 32 ANOVA If we reject H0, then some means are different. Which ones?
Problem of multiple comparisons
If H0 were true (all means equal) and we did 20 tests with alpha=0.05, then we should expect that 1 of the comparisons would appear to be different
If we are going to do a number of 2 sample t tests, we will have to use a lower ? than usual
33. SM339 ANOVA - Spring 2007 33 ANOVA The book uses the Studentized range method
There are a number of different methods for addressing this problem
We will use the Bonferroni method
34. SM339 ANOVA - Spring 2007 34 ANOVA The probability of a union of sets could be as high as the sum of the probabilities (if the sets are mutually disjoint)
In our cases, the sets are “making a Type I error” for each test we are doing
If we are doing N tests and each test has a Prob(Type I) = 0.05/N, then the overall Prob(Type I) = 0.05
35. SM339 ANOVA - Spring 2007 35 ANOVA Therefore, if we want to end up at an overall level of 5%, say, then we need to conduct each test at a level of 0.05/# tests
For G groups, there are G*(G-1)/2 pairs
For flowrate example, G=3 and there are 3*2/2 = 3 pairs to compare
Do 2 sample t tests with ?/3 = 0.05/3
36. SM339 ANOVA - Spring 2007 36 ANOVA If we use p-values, then we should multiply the (apparent) p-value by the # of comparisons
If we do confidence intervals, we should get wider intervals
For 95% intervals, we usually solve for 0.025
Now we solve for 0.025 / (# groups)
37. SM339 ANOVA - Spring 2007 37 ANOVA Recall that in the 2 sample t test, we use SD*?(1/n1+1/n2) for the diff of avgs
What do we use for the pooled SD?
This was the basis for SSE (or MSE)
For SD, use ?MSE
For df, use df for MSE
38. SM339 ANOVA - Spring 2007 38 ANOVA The p-value for comparing 1-2 is 0.0082
For 1-3, PV=0.0003
For 2-3, PV=0.0872
Using Bonferoni, we cannot conclude that 2 and 3 are different, but 1 IS diff from both 2 and 3.
Note that on p. 504, he gets a slightly diff result using conf intervals and Studentized range
The Studentized range is a bit more efficient than Bonferroni, but note that his last CI comes quite close to 0
39. SM339 ANOVA - Spring 2007 39 Rank procedures The underlying assumption of ANOVA is that the data has a normal distn
When we are not comfortable with that assumption there are other methods available
Nonparametric methods
40. SM339 ANOVA - Spring 2007 40 Rank procedures A large class of NP methods are based on ranks
Replace the data with ranks and then do calculations
We should feel comfortable in knowing which values are larger than which other values, even if we don’t know the probability of such a difference
41. SM339 ANOVA - Spring 2007 41 Rank procedures Function y=ranks(x)
“Midranks” to handle ties in the x’s
For comparing multiple groups, we use Kruskal-Wallis
Based on the sum of ranks within each group
42. SM339 ANOVA - Spring 2007 42 Rank procedures Text has a formula on p. 713, but not very meaningful
Better formula is based on S(obs-exp)2/exp
If we have N total obs and n1 obs in group 1, then the avg rank of all the data is (N+1)/2
We would expect the rank sum for group 1 to be n1*(N+1)/2, etc
Can then compute the diff between obs and exp
TRICK: There is a fudge factor. Have to multiply by 6/N to get ?2
43. SM339 ANOVA - Spring 2007 43 Rank procedures For artery flow:
r=ranks(dt(:,2))
rs(1)=sum(r(dt(:,1)==1)); etc
ns(1)=sum(dt(:,1)==1); etc
expt=ns*(n+1)/2
>> c2=6/n*sum((rs-expt).^2./expt)
20.4380
>> pv=chiprob(2,c2,99)
3.6471e-005
NOTE df=# groups-1 = df for Treatment
44. SM339 ANOVA - Spring 2007 44 Rank procedures Can use pooling as we did for Chi^2 to determine which differences are significant
But the pooling can be done on the rank sums (and expected’s)
As before, df do not change
45. SM339 ANOVA - Spring 2007 45 Rank procedures Carpet fiber example
Assume we’ve run sumstats
>> ns=smst(:,1)'
ns =
16 16 13 16 14 15
>> r=ranksums(x(:,2),x(:,1))
r =
495 1177.5 353 854.5 288 927
>> n=sum(ns)
n =
90
46. SM339 ANOVA - Spring 2007 46 Rank procedures >> expt=ns*(n+1)/2
expt =
728 728 591.5 728 637 682.5
>> c2=6/n*sum((r-expt).^2 ./ max(1,expt))
c2 =
49.937
Chi^2 w/ df=5, so extremely large
47. SM339 ANOVA - Spring 2007 47 Rank procedures >> (r-expt)./expt
ans =
-0.32005 0.61745 -0.40321 0.17376 -0.54788 0.35824
Note that 1 and 3 look similar
>> r=cpool(r,1,3),expt=cpool(expt,1,3),
r =
848 1177.5 0 854.5 288 927
expt =
1319.5 728 0 728 637 682.5
>> c2=6/n*sum((r-expt).^2 ./ max(1,expt))
c2 =
49.787
Still large (unchanged), so diff is not due to a diff between 1 & 3