1 / 44

Chapter 8

Chapter 8. Discriminant Analysis. 8.1 Introduction. Classification is an important issue in multivariate analysis and data mining. Classification:

rhoda
Download Presentation

Chapter 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 8 Discriminant Analysis

  2. 8.1 Introduction • Classification is an important issue in multivariate analysis and data mining. • Classification: • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data, i.e., predicts unknown or missing values

  3. Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Prediction: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known

  4. Training Data Classifier (Model) Classification Process : Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’

  5. Classifier Testing Data Unseen Data Classification Process: Use the Model in Prediction (Jeff, Professor, 4) Tenured?

  6. Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning(clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

  7. Discrimination— Introduction Discrimination is a technique concerned with allocating new observations to previously defined groups. There are k samples from k distinct populations: One wants to find the so-called discriminant function and related rule to identify the new observations.

  8. Example 11.3 Bivariate case

  9. Discriminant function and rule

  10. Example 11.1: Riding mowers Consider two groups in city: riding-mower owners and those without riding mowers. In order to identify the best sales prospects for an intensive sales campaign, a riding-mower manufacturer is interested in classifying families as prospective owners or non-owners on the basis of income and lot size.

  11. Example 11.1: Riding mowers

  12. Example 11.1: Riding mowers

  13. 8.2 Discriminant by Distance Assume k=2 for simplicity

  14. 8.2 Discriminant by Distance Consider the Mahalanobis distance

  15. 8.2 Discriminant by Distance Let

  16. 8.2 Discriminant by Distance

  17. a Example Univariate Case with equal variance

  18. Example Univariate Case with equal variance a*

  19. 8.3 Fisher’s Discriminant Function Idea: projection, ANOVA

  20. 8.3 Fisher’s Discriminant Function Training samples

  21. 8.3 Fisher’s Discriminant Function Projection the data on a direction , the F-statistics where

  22. 8.3 Fisher’s Discriminant Function To find such that The solution of is the eigenvector associated with the largest eigenvalue of . Discriminant function:

  23. (B) Two Populations Note We have and There is only one non-zero eigenvalue of as

  24. (B) Two Populations The associated eigenvector is where

  25. (B) Two Populations When is replaced by where

  26. Example Inset Classification Note: data x1 and x2 are the characteristics of insect (Hoel,1947) n.g. means natural group (species), c.g. the classified group, y the value of the discriminant function

  27. Example Inset Classification The eigenvalue of is 1.9187 and the associated eigenvector is

  28. Example Inset Classification The discriminant function is and the associated value of each observation is given in the table. The cutting point is Classification is If we use , we have the same classification.

  29. 8.4 Bayes’ Discriminant Analysis • Idea • There are k populations G1, …, Gk in Rp. • A partition of Rp, R1, …, Rk , is determined based on a training • sample. Rule: if falls into Ri Loss: is from Gi , but falls into Rj The Probability of this misclassification where is the density of .

  30. 8.4 Bayes’ Discriminant Analysis Expected cost of misclassification is where q1, …, qk are prior probabilities. We want to minimize ECM(R1, …, Rk ) w.r.t. R1, …, Rk .

  31. B. Method Theorem 6.4.1 Let Then the optimal Rt’s are

  32. Corollary 1 Take if and 0 if . Then Proof:

  33. Corollary2 In the case of k=2 we have

  34. Corollary 3 In the case of k=2 and

  35. Then

  36. C. Example 11.3: Detection of hemophilia A carriers For the detection of hemophilia A carriers, to construct a procedure for detecting potential hemophilia A carriers, blood samples were assayed for two groups of women and measurements on the two variables. The first group of 30 women were selected from a population of women who did not carry the hemophilia gene. This group was called the normal group. The second group of 22 women was selected from known hemophilia A carriers. This group was called the obligatory carriers.

  37. C. Example 11.3: Detection of hemophilia a carriers Variables: log10 (AHF activity) log10 (AHF-like antigen) Populations: population of women who did not carry the hemophilia gene (n1=30) population of women who are known hemophilia A carriers (n2=45)

  38. C. Example 11.3: Detection of hemophilia a carriers

  39. log10(AHF activity) normal log10(AHF-like antigen) log10(AHF activity) Obligatory carrier log10(AHF-like antigen) C. Example 11.3: Detection of hemophilia a carriers Data set -0.0056 -0.1698 -0.3469 -0.0894 -0.1679 -0.0836 -0.1979 -0.0762 -0.1913 -0.1092 -0.5268 -0.0842 -0.0225 0.0084 -0.1827 0.1237 -0.4702 -0.1519 0.0006 -0.2015 -0.1932 0.1507 -0.1259 -0.1551 -0.1952 0.0291 -0.228 -0.0997 -0.1972 -0.0867 -0.1657 -0.1585 -0.1879 0.0064 0.0713 0.0106 -0.0005 0.0392 -0.2123 -0.119 -0.4773 0.0248 -0.058 0.0782 -0.1138 0.214 -0.3099 -0.0686 -0.1153 -0.0498 -0.2293 0.0933 -0.0669 -0.1232 -0.1007 0.0442 -0.171 -0.0733 -0.0607 -0.056 -0.3478 -0.3618 -0.4986 -0.5015 -0.1326 -0.6911 -0.3608 -0.4535 -0.3479 -0.3539 -0.4719 -0.361 -0.3226 -0.4319 -0.2734 -0.5573 -0.3755 -0.495 -0.5107 -0.1652 -0.2447 -0.4232 -0.2375 -0.2205 -0.2154 -0.3447 -0.254 -0.3778 -0.4046 -0.0639 -0.3351 -0.0149 -0.0312 -0.174 -0.1416 -0.1508 -0.0964 -0.2642 -0.0234 -0.3352 -0.1878 -0.1744 -0.4055 -0.2444 -0.4784 0.1151 -0.2008 -0.086 -0.2984 0.0097 -0.339 0.1237 -0.1682 -0.1721 0.0722 -0.1079 -0.0399 0.167 -0.0687 -0.002 0.0548 -0.1865 -0.0153 -0.2483 0.2132 -0.0407 -0.0998 0.2876 0.0046 -0.0219 0.0097 -0.0573 -0.2682 -0.1162 0.1569 -0.1368 0.1539 0.14 -0.0776 0.1642 0.1137 0.0531 0.0867 0.0804 0.0875 0.251 0.1892 -0.2418 0.1614 0.0282

  40. C. Example 11.3: Detection of hemophilia a carriers SAS output

  41. C. Example 11.3: Detection of hemophilia a carriers

  42. C. Example 11.3: Detection of hemophilia a carriers

  43. C. Example 11.3: Detection of hemophilia a carriers

More Related