Discriminant Analysis

Discriminant Analysis Defining and Testing Groups

Goals • Develop classificatory key for groups that have already been defined • Identify important variables in defining clusters after cluster analysis • Classify new observations into an existing classification

Two Approaches • Discriminant Analysis – Assign probabilities of group membership to an unknown specimen • Canonical Discriminant Analysis – Display a picture of the groups in two dimensions using discriminant function scores

Steps • Need more cases than variables – preferably five more cases in each group than the number of variables • Explanatory variables are interval, ratio, or dichotomous • Response variable is nominal (categorical)

Analysis • Discriminant Analysis finds linear combination of the explanatory variables that provides the maximum separation of the group means • Subsequent dimensions must be orthogonal (uncorrelated) • The maximum number of dimensions is k-1 where k is the number of groups

Snodgrass Houses (Again) • Rcmdr does not provide access to discriminant analysis • lda() computes the functions • plot() plots the scores by group • predict() predicts group membership of original or new data

> LdaModel.1 <- lda(Inside~Area, prior=c(.5, .5), data=Snodgrass) > LdaModel.1 Call: lda(Inside ~ Area, data = Snodgrass, prior = c(0.5, 0.5)) Prior probabilities of groups: Inside Outside 0.5 0.5 Group means: Area Inside 317.3711 Outside 179.0566 Coefficients of linear discriminants: LD1 Area 0.01538446 > plot(LdaModel.1)

> PInside<- predict(LdaModel.1) > PInside > str(PInside) List of 3 $ class : Factor w/ 2 levels "Inside","Outside": 2 1 1 1 1... $ posterior: num [1:91, 1:2] 0.0319 0.5634 0.869 0.9987 0.995 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:91] "1" "2" "3" "4" ... .. ..$ : chr [1:2] "Inside" "Outside" $ x : num [1:91, 1] -1.603 0.12 0.889 3.127 2.489 ... ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:91] "1" "2" "3" "4" ... .. ..$ : chr "LD1“ > xtabs(~Snodgrass$Inside+PInside$class) PInside$class Snodgrass$Inside Inside Outside Inside 29 9 Outside 5 48 > (29+48)/(29+9+5+48) [1] 0.8461538

Results • Predictions are the same as when we used logistic regression • Predictions are optimistic since we used the data to generate the model • Could split data – run lda() on one half and predict the other half • Could use cross-validation run lda() n times leaving one case out each time

> LdaModel.2 <- lda(Inside~Area, data=Snodgrass, + prior=c(.5, .5), CV=TRUE) > LdaModel.2 $class [1] Outside Inside InsideInsideInsideInsideInsideInside [9] Outside Outside Inside Outside OutsideOutside Inside Inside . . . $posterior Inside Outside 1 0.0329603669 0.9670396331 2 0.5703984220 0.4296015780 3 0.8664921383 0.1335078617 4 0.9988861784 0.0011138216 5 0.9950634758 0.0049365242 . . . > xtabs(~Snodgrass$Inside+LdaModel.2$class) LdaModel.2$class Snodgrass$Inside Inside Outside Inside 29 9 Outside 6 47 > (29+47)/(29+9+6+47) [1] 0.8351648

Segments • Expand the model to predict Segment (1, 2, 3) • Add variables Total and Types • Accuracy is 68% (against chance of 33%) Segment accuracy is 76%, 68%, and 56% for Segement 3

> LdaModel.3 <- lda(Segment~Area+Types+Total, data=Snodgrass, + prior=rep(1/3,3)) > LdaModel.3 Call: lda(Segment ~ Area + Types + Total, data = Snodgrass, prior = rep(1/3, 3)) Prior probabilities of groups: 1 2 3 0.3333333 0.3333333 0.3333333 Group means: Area Types Total 1 317.3711 7.684211 13.23684 2 166.7946 1.821429 2.00000 3 192.7900 1.680000 2.00000

Coefficients of linear discriminants: LD1 LD2 Area -0.01138796 0.01394856 Types -0.16654527 -0.39272065 Total 0.02337886 0.03327477 Proportion of trace: LD1 LD2 0.9757 0.0243 > plot(LdaModel.3) > plot(LdaModel.3, dimen=1) > LdaPred.3 <- predict(LdaModel.3) > Ptable <- xtabs(~Snodgrass$Segment+LdaPred.3$class) > Ptable LdaPred.3$class Snodgrass$Segment 1 2 3 1 29 1 8 2 1 19 8 3 1 10 14 > sum(diag(Ptable))/sum(Ptable) [1] 0.6813187

Discriminant Analysis