870 likes | 1.12k Views
Data Mining Algorithms. Classification. Classification Outline. Goal: Provide an overview of the classification problem and introduce some of the basic algorithms. Classification Problem Overview Classification Techniques Regression Distance Decision Trees Rules Neural Networks.
E N D
Data Mining Algorithms Classification
Classification Outline Goal:Provide an overview of the classification problem and introduce some of the basic algorithms • Classification Problem Overview • Classification Techniques • Regression • Distance • Decision Trees • Rules • Neural Networks
Classification Problem • Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. • Actually divides D into equivalence classes. • Predictionis similar, but may be viewed as having infinite number of classes.
Classification vs. Prediction • Classification: • predicts categorical class labels • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: • models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications: • credit approval • target marketing • medical diagnosis • treatment effectiveness analysis
Classification Examples • Teachers classify students’ grades as A, B, C, D, or F. • Predict when a disaster will strike • Identify individuals with credit risks. • Speech recognition • Pattern recognition
x <90 >=90 <70 <50 >=60 >=70 Classification Ex: Grading • If x >= 90 then grade =A. • If 80<=x<90 then grade =B. • If 70<=x<80 then grade =C. • If 60<=x<70 then grade =D. • If x<50 then grade =F. x A <80 >=80 x B x C D F
Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter A Letter B Letter C Letter D Letter E Letter F
Classification Techniques • Approach: • Create specific model by evaluating training data (or using domain experts’ knowledge). • Apply model developed to new data. • Classes must be predefined • Most common techniques use Decision Trees, Neural Networks, or are based on distances or statistical methods.
Classification—A 2 Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction: training set • The model is represented as classification rules, decision trees, or mathematical formulae
Classification—A 2 Step Process • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set, otherwise over-fitting will occur
Training Data Classifier (Model) Classification Process (1): Model Construction Classification Algorithms IF rank = ‘professor’ OR years > 6 then “REGULAR” = ‘yes’
Classifier Testing Data Unseen Data Classification Process (2): Use the Model in Prediction (Jeff, Professor, 4) REGULAR? YES
Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning (clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
Issues in Classification • Missing Data • Ignore • Replace with assumed value • Measuring Performance • Classification accuracy on test data • Confusion matrix • OC Curve
Classification Performance True Positive False Negative False Positive True Negative
Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment
Classifier Accuracy Measures • Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M • Error rate (misclassification rate) of M = 1 – acc(M) • Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are labeled by the classifier as class j • Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) • This model can also be used for cost-benefit analysis
Evaluating the Accuracy of a Classifier or Predictor (I) • Holdout method • Given data is randomly partitioned into two independent sets • Training set (e.g., 2/3) for model construction • Test set (e.g., 1/3) for accuracy estimation • Random sampling: a variation of holdout • Repeat holdout k times, accuracy = avg. of the accuracies obtained • Cross-validation (k-fold, where k = 10 is most popular) • Randomly partition the data into kmutually exclusive subsets, each approximately equal size • At i-th iteration, use Di as test set and others as training set • Leave-one-out: k folds where k = # of tuples, for small sized data • Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data
Metrics for Performance Evaluation • Focus on the predictive capability of a model • Rather than how fast it takes to classify or build models, scalability, etc. • Confusion Matrix: a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)
Limitation of Accuracy • Consider a 2-class problem • Number of Class 0 examples = 9990 • Number of Class 1 examples = 10 • If model predicts everything to be class 0, accuracy is 9990/10000 = 99.9 % • Accuracy is misleading because model does not detect any class 1 example
Cost Matrix C(i|j): Cost of misclassifying class j e.g.. as class i
Computing Cost of Classification Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255
Accuracy is proportional to cost if1. C(Yes|No)=C(No|Yes) = q 2. C(Yes|Yes)=C(No|No) = p N = a + b + c + d Accuracy = (a + d)/N Cost = p (a + d) + q (b + c) = p (a + d) + q (N – a – d) = q N – (q – p)(a + d) = N [q – (q-p) Accuracy] Cost vs Accuracy
Cost-Sensitive Measures • Precision is biased towards C(Yes|Yes) & C(Yes|No) • Recall is biased towards C(Yes|Yes) & C(No|Yes) • F-measure is biased towards all except C(No|No)
Statistical Based Algorithms - Regression • Assume data fits a predefined function • Determine best values for regression coefficients c0,c1,…,cn. • Assume an estimate : y = c0+c1x1+…+cnxn+e • Estimate error using mean squared error for training set:
Classification Using Regression • Division: Use regression function to divide area into regions. • Prediction: Use regression function to predict a class membership function. Input includes desired class.
Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Bayesian Theorem • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCsuch thatP(C|X) is maximal
Estimating a-posteriori probabilities • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • P(X) is constant for all classes • P(C) = relative freq of class C samples • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum • Problem: computing P(X|C) is unfeasible!
Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • If i-th attribute is categorical:P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian density function • Computationally easy in both cases
Play-tennis example: classifying X • An unseen sample X = <rain, hot, high, false> • P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 • P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 • Sample X is classified in class n (don’t play)
Overview of Naive Bayes • The goal of Naive Bayes is to work out whether a new example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification. • Bayes Rule: E- Event has occurred • P[H] is called the prior probability (of the hypothesis).P[H|E] is called the posterior probability (of the hypothesis given the evidence) 39
Overview of Naive Bayes For each class, k, work out: • Our Hypotheses are: • H1: ‘the example is in class A’ • H2: ‘the example is in class B’ etc. • Our Evidence is the attribute values of a particular new example that is presented: • E1=x : ‘the example has value x for attribute A1’ • E2=y : ‘the example has value y for attribute A2’ • ... • En=z : ‘the example has value z for attribute An’ • Note that, assuming the attributes are equally important and independent, we estimate the joint probability of that combination of attribute values as: • The goal is then to find the hypothesis (i.e. the class k) for which the value of P[Hk|E] is at a maximum. 40
Overview of Naive Bayes • For categorical variables we use simple proportions.P[Ei=x|Hk] = no. of training egs in class k having value x for attribute Ai number of training examples in class k • For continuous variables we assume a normal (Gaussian) distribution, and use the mean () and standard deviation () to compute the conditional probabilities. P[Ei =x|Hk] = 41
Children Income Status Many Medium DEFAULTS Many Low DEFAULTS Few Medium PAYS Few High PAYS ApplicantID City 1 Delhi 2 Delhi 3 Delhi 4 Delhi Worked Example 1 Take the following training data, from bank loan applicants: • P[City=Delhi | Status = DEFAULTS] = 2/2 = 1 • P[City=Delhi | Status = PAYS] = 2/2 = 1 • P[Children=Many | Status = DEFAULTS] = 2/2 = 1 • P[Children=Few | Status = DEFAULTS] = 0/2 = 0 • etc. 42
Worked Example 1 Summarizing, we have the following probabilities: and P[Status = DEFAULTS] = 2/4 = 0.5 P[Status = PAYS] = 2/4 = 0.5 The probability of Income=Medium given the applicant DEFAULTs = the number of applicants with Income=Medium who DEFAULT divided by the number of applicants who DEFAULT = 1/2 = 0.5 43
Worked Example 1 Now, assume a new example is presented where City=Delhi, Children=Many, and Income=Medium: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[H1|E] = P[E|H1].P[H1] (denominator omitted*) P[Status = DEFAULTS | Delhi,Many,Medium] = P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[Medium|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0.5 x 0.5 = 0.25 Then we estimate the likelihood that the example is a payer, given its attributes: P[H2|E] = P[E|H2].P[H2] (denominator omitted*) P[Status = PAYS | Delhi,Many,Medium] = P[Delhi|PAYS] x P[Many|PAYS] x P[Medium|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is higher (because 0.25 > 0), we conclude that the new example is a defaulter. 44
Worked Example 1 Now, assume a new example is presented where City=Delhi, Children=Many, and Income=High: First, we estimate the likelihood that the example is a defaulter, given its attribute values: P[Status = DEFAULTS | Delhi,Many,High] = P[Delhi|DEFAULTS] x P[Many|DEFAULTS] x P[High|DEFAULTS] x P[DEFAULTS] = 1 x 1 x 0 x 0.5 = 0 Then we estimate the likelihood that the example is a payer, given its attributes: P[Status = PAYS | Delhi,Many,High] = P[Delhi|PAYS] x P[Many|PAYS] x P[High|PAYS] x P[PAYS] = 1 x 0 x 0.5 x 0.5 = 0 As the conditional likelihood of being a defaulter is the same as that for being a payer, we can come to no conclusion for this example. 45
TransactionID Income Credit Decision 1 Very High Excellent AUTHORIZE 2 High Good AUTHORIZE 3 Medium Excellent AUTHORIZE 4 High Good AUTHORIZE 5 Very High Good AUTHORIZE 6 Medium Excellent AUTHORIZE 7 High Bad REQUEST ID 8 Medium Bad REQUEST ID High Bad 9 REJECT 10 Low Bad CALL POLICE Worked Example 2 Take the following training data, for credit card authorizations: Assume we’d like to determine how to classify a new transaction, with Income = Medium and Credit=Good. 46
Worked Example 2 Our conditional probabilities are: Our class probabilities are: P[Decision = AUTHORIZE] = 6/10 P[Decision = REQUEST ID] = 2/10 P[Decision = REJECT] = 1/10 P[Decision = CALL POLICE] = 1/10 47
Worked Example 2 Our goal is now to work out, for each class, the conditional probability of the new transaction (with Income=Medium & Credit=Good) being in that class. The class with the highest probability is the classification we choose. Our conditional probabilities (again, ignoring Bayes’s denominator) are: P[Decision = AUTHORIZE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=AUTHORIZE] x P[Credit=Good|Decision=AUTHORIZE] x P[Decision=AUTHORIZE] = 2/6 x 3/6 x 6/10 = 36/360 = 0.1 P[Decision = REQUEST ID | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REQUEST ID] x P[Credit=Good|Decision=REQUEST ID] x P[Decision=REQUEST ID] = 1/2 x 0/2 x 2/10 = 0 48
Worked Example 2 P[Decision = REJECT | Income=Medium & Credit=Good] = P[Income=Medium|Decision=REJECT] x P[Credit=Good|Decision=REJECT] x P[Decision=REJECT] = 0/1 x 0/1 x 1/10 = 0 P[Decision = CALL POLICE | Income=Medium & Credit=Good] = P[Income=Medium|Decision=CALL POLICE] x P[Credit=Good|Decision=CALL POLICE] x P[Decision=CALL POLICE] = 0/1 x 0/1 x 1/10 = 0 The highest of these probabilities is the first, so we conclude that the decision for our new transaction should be AUTHORIZE. 49
Weaknesses • Naive Bayes assumes that variables are equally important and that they are independent which is often not the case in practice. • Naive Bayes is damaged by the inclusion of redundant (strongly dependent) attributes. e.g. if people with high income have expensive houses, then including both income and house-price in the model would unfairly multiply the effect of having low income. • Sparse data: If some attribute values are not present in the data, then a zero probability for P[E|H] might exist. This would lead P[H|E] to be zero no matter how high P[E|H] is for other attribute values. Small positive values which estimate the so-called ‘prior probabilities’ are often used to correct this. 50