970 likes | 1.52k Views
Rule-Based Classifier. Classify records by using a collection of ifthen" rulesRule: (Condition) ? ywhere Condition is a conjunctions of attributes y is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification rules: (Blood Type=Warm) ? (Lay Eggs
E N D
1. Data Mining Classification: Alternative Techniques
2. Rule-Based Classifier Classify records by using a collection of “if…then…” rules
Rule: (Condition) ? y
where
Condition is a conjunctions of attributes
y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) ? (Lay Eggs=Yes) ? Birds
(Taxable Income < 50K) ? (Refund=Yes) ? Evade=No
3. Rule-based Classifier (Example) R1: (Give Birth = no) ? (Can Fly = yes) ? Birds
R2: (Give Birth = no) ? (Live in Water = yes) ? Fishes
R3: (Give Birth = yes) ? (Blood Type = warm) ? Mammals
R4: (Give Birth = no) ? (Can Fly = no) ? Reptiles
R5: (Live in Water = sometimes) ? Amphibians
4. Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
5. Rule Coverage and Accuracy Coverage of a rule:
Fraction of records that satisfy the antecedent of a rule
Accuracy/Confidence of a rule:
Fraction of records for which the rule “fires” (antecedent is satisfied) that yields correct classification
6. How does Rule-based Classifier Work?
7. Characteristics of Rule-Based Classifier Rule Learners may or may not have these properties:
Mutually exclusive rules
Every record is covered by at most one rule
Exhaustive rules
Classifier has exhaustive coverage if it accounts for every possible combination of attribute values
Each record is covered by at least one rule
If both mutually exclusive and exhaustive, then every record covered by exactly one rule
8. From Decision Trees To Rules
9. Rules Can Be Simplified
10. Effect of Rule Simplification Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
11. Ordered Rule Set Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has triggered
If none of the rules fired, it is assigned to the default class
12. Rule Ordering Schemes Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
13. Building Classification Rules Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holte’s 1R
Indirect Method:
Extract rules from other classification models (e.g. decision trees, neural networks, etc).
e.g: C4.5rules
14. Direct Method: Sequential Covering Start from an empty rule
Grow a rule using the Learn-One-Rule function
Remove training records covered by the rule
Repeat Step (2) and (3) until stopping criterion is met
15. Example of Sequential Covering
16. Example of Sequential Covering…
17. Aspects of Sequential Covering Rule Growing
Instance Elimination
Rule Evaluation
Stopping Criterion
Rule Pruning
18. Rule Growing Two common strategies
19. Instance Elimination Why do we need to eliminate instances?
Otherwise, the next rule is identical to previous rule
Why do we remove positive instances?
Ensure that the next rule is different
Why do we remove negative instances?
Prevent underestimating accuracy of rule
Compare rules R2 and R3 in the diagram
20. Rule Evaluation Metrics:
Accuracy
Laplace
Laplace moves the probability estimate towards the probability you would use with no information and assumed all classes were equally likely
We will work through an example when we get to Naïve Bayes Classifiers
21. Rule Pruning Rule Pruning
Similar to post-pruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and after pruning
If error improves, prune the conjunct
22. Summary of Direct Method Grow a single rule
Remove Instances from rule
Prune the rule (if necessary)
Add rule to Current Rule Set
Repeat
23. Direct Method: RIPPER For 2-class problem, have minority class be the positive class and the majority class the negative class
Learn rules for positive class
Negative class will be default class
Good for data sets with unbalanced class distributions
For multi-class problem
Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as negative class
Repeat with next smallest class as positive class
24. Direct Method: RIPPER Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOIL’s information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced error pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in the validation set
n: number of negative examples covered by the rule in the validation set
Pruning method: delete any final sequence of conditions that maximizes v
25. Direct Method: RIPPER Building a Rule Set:
Use sequential covering algorithm
Finds the best rule that covers the current set of positive examples
Eliminate both positive and negative examples covered by the rule
Each time a rule is added to the rule set, compute the new description length
stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far
26. Indirect Methods
27. Indirect Method: C4.5rules Extract rules from an unpruned decision tree
For each rule, r: A ? y,
consider an alternative rule r’: A’ ? y where A’ is obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against all r’s
Prune if one of the r’s has lower pessimistic error rate
Repeat until we can no longer improve generalization error
28. Indirect Method: C4.5rules Instead of ordering the rules, order subsets of rules (class ordering)
Each subset is a collection of rules with the same rule consequent (class)
Compute description length of each subset
Order classes with lowest Minimum Description Length first
29. Advantages of Rule-Based Classifiers As highly expressive as decision trees
Are they more expressive?
Can every DT be turned into a Rule Set?
Can every Rule Set be turned into a DT?
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
30. Instance-Based Classifiers
31. Instance Based Classifiers Examples:
Rote-learner
Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly
No generalization
Nearest neighbor
Uses k “closest” points (nearest neighbors) for performing classification
32. Nearest Neighbor Classifiers Basic idea:
If it walks like a duck, quacks like a duck, then it’s probably a duck
33. Nearest-Neighbor Classifiers
34. Definition of Nearest Neighbor
35. 1 nearest-neighbor Vornoi Diagram
36. Nearest Neighbor Classification Compute distance between two points:
Euclidean distance
Determine the class from nearest neighbor list
take the majority vote of class labels among the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d2
37. Nearest Neighbor Classification… Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from other classes
38. Nearest Neighbor Classification… Scaling issues
Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
39. Nearest Neighbor Classification… Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results if space is very sparse
40. Nearest neighbor Classification… k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree induction and rule-based systems
Classifying unknown records are relatively expensive
41. Remarks on Lazy vs. Eager Learning Instance-based learning: lazy evaluation
Decision-tree and Rule-Based classification: eager evaluation
Key differences
Lazy method may consider query instance xq when deciding how to generalize beyond the training data D
Eager method cannot since they have already chosen global approximation when seeing the query
Efficiency: Lazy - less time training but more time predicting
Accuracy
Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
Eager: must commit to a single hypothesis that covers the entire instance space
42. Bayes Classifier A probabilistic framework for classification problems
Often appropriate because the world is noisy and also some relationships are probabilistic in nature
Is predicting who will win a baseball game probabilistic in nature?
Before getting the heart of the matter, we will go over some basic probability.
I will be much more “gentle” than the text and go over all of the details and reasoning
You may find a few of these basic probability questions on your exam and also should understand how Bayesian classifiers work
Stop me if you have questions!!!!
43. Conditional Probability The conditional probability of an event C given that event A has occurred is denoted P(C|A)
Here are some practice questions that you should be able to answer based on your knowledge from CS1100/1400
Given a standard six-sided die:
What is P(roll a 1|roll an odd number)?
What is P(roll a 1|roll an even number)?
What is P(roll a number >2)?
What is P(roll a number >2|roll a number > 1)?
Given a standard deck of cards:
What is P(pick an Ace|pick a red card)?
What is P(pick ace of clubs|pick an Ace)?
What is P(pick an ace on second card|pick an ace on first card)?
Do you get it? Any questions before we move on?
44. Conditional Probability Continued P(A,C) is the probability that A and C both occur
What is P(pick an Ace, pick a red card) = ???
Note that this is the same a the probability of picking a red ace
Two events are independent if one occurring does not impact the occurrence or non-occurrence of the other one
Please give me some examples of independent events
You can think of some examples from flipping coins if that helps
In the example from above, are A and C independent events?
If two events A and B are independent, then:
P(A, B) = P(A) x P(B)
Note that this helps us answer the P(A,C) above, although most of you probably solved it directly.
45. Conditional Probability Continued
46. An Example Let’s use the example from before, where:
A = “pick and Ace” and
C = “pick a red card”
Using the previous equation:
P(C|A) = P(A,C)/P(A)
We now get:
P(red card|Ace) = P(red ace)/P(Ace)
P(red card|Ace) = (2/52)/(4/52) = 2/4 = .5
Hopefully this makes sense, just like the Venn Diagram should
47. Bayes Theorem Bayes Theorem states that:
P(C|A) =[ P(A|C) P(C)] / P(A)
Prove this equation using the two equations we informally showed to be true using the Venn Diagrams:
P(C|A) = P(A,C)/P(A) and P(A|C) = P(A,C)/P(C)
Start with P(C|A) = P(A,C)/P(A)
Then notice that we are part way there and only need to replace P(A,C)
By rearranging the second equation (after the “and”) to isolate P(A,C), we can substitute in P(A|C)P(C) for P(A,C) and the proof is done!
48. Some more terminology The Prior Probability is the probability assuming no specific information.
Thus we would refer to P(A) as the prior probability of even A occurring
We would not say that P(A|C) is the prior probability of A occurring
The Posterior probability is the probability given that we know something
We would say that P(A|C) is the posterior probability of A (given that C occurs)
49. Example of Bayes Theorem Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20
If a patient has stiff neck, what’s the probability he/she has meningitis?
50. Bayesian Classifiers Given a record with attributes (A1, A2,…,An)
The goal is to predict class C
Actually, we want to find the value of C that maximizes P(C| A1, A2,…,An )
Can we estimate P(C| A1, A2,…,An ) directly (w/o Bayes)?
Yes, we simply need to count up the number of times we see A1, A2,…,An and then see what fraction belongs to each class
For example, if n=3 and the feature vector “4,3,2” occurs 10 times and 4 of these belong to C1 and 6 to C2, then:
What is P(C1|”4,3,2”)?
What is P(C2|”4,3,2”)?
Unfortunately, this is generally not feasible since not every feature vector will be found in the training set
51. Bayesian Classifiers Indirect Approach: Use Bayes Theorem
compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem
Choose value of C that maximizes P(C | A1, A2, …, An)
Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C)
Since the denominator is the same for all values of C
52. Naïve Bayes Classifier How can we estimate P(A1, A2, …, An |C)?
We can measure it directly, but only if the training set samples every feature vector. Not practical!
So, we must assume independence among attributes Ai when class is given:
P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
Then we can estimate P(Ai| Cj) for all Ai and Cj from training data
This is reasonable because now we are looking only at one feature at a time. We can expect to see each feature value represented in the training data.
New point is classified to Cj if P(Cj) ? P(Ai| Cj) is maximal.
53. How to Estimate Probabilities from Data? Class: P(C) = Nc/N
e.g., P(No) = 7/10, P(Yes) = 3/10
For discrete attributes: P(Ai | Ck) = |Aik|/ Nc
where |Aik| is number of instances having attribute Ai and belongs to class Ck
Examples:
P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0
54. How to Estimate Probabilities from Data? For continuous attributes:
Discretize the range into bins
Two-way split: (A < v) or (A > v)
choose only one of the two splits as new attribute
Creates a binary feature
Probability density estimation:
Assume attribute follows a normal distribution and use the data to fit this distribution
Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)
We will not deal with continuous values on HW or exam
Just understand the general ideas above
For the example tax cheating example, we will assume that “Taxable Income” is discrete
Each of the 10 values will therefore have a prior probability of 1/10
55. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No?
Here is the feature vector:
Refund = No, Married, Income = 120K
Now what do we do?
First try writing out the thing we want to measure
56. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No?
Here is the feature vector:
Refund = No, Married, Income = 120K
Now what do we do?
First try writing out the thing we want to measure
P(Evade|[No, Married, Income=120K])
Next, what do we need to maximize?
57. Example of Naïve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No?
Here is the feature vector:
Refund = No, Married, Income = 120K
Now what do we do?
First try writing out the thing we want to measure
P(Evade|[No, Married, Income=120K])
Next, what do we need to maximize?
P(Cj) ? P(Ai| Cj)
58. Example of Naïve Bayes Since we want to maximize P(Cj) ? P(Ai| Cj)
What quantities do we need to calculate in order to use this equation?
Someone come up to the board and write them out, without calculating them
Recall that we have three attributes:
Refund: Yes, No
Marital Status: Single, Married, Divorced
Taxable Income: 10 different “discrete” values
While we could compute every P(Ai| Cj) for all Ai, we only need to do it for the attribute values in the test example
59. Values to Compute Given we need to compute P(Cj) ? P(Ai| Cj)
We need to compute the class probabilities
P(Evade=No)
P(Evade=Yes)
We need to compute the conditional probabilities
P(Refund=No|Evade=No)
P(Refund=No|Evade=Yes)
P(Marital Status=Married|Evade=No)
P(Marital Status=Married|Evade=Yes)
P(Income=120K|Evade=No)
P(Income=120K|Evade=Yes)
60. Computed Values Given we need to compute P(Cj) ? P(Ai| Cj)
We need to compute the class probabilities
P(Evade=No) = 7/10 = .7
P(Evade=Yes) = 3/10 = .3
We need to compute the conditional probabilities
P(Refund=No|Evade=No) = 4/7
P(Refund=No|Evade=Yes) 3/3 = 1.0
P(Marital Status=Married|Evade=No) = 4/7
P(Marital Status=Married|Evade=Yes) =0/3 = 0
P(Income=120K|Evade=No) = 1/7
P(Income=120K|Evade=Yes) = 0/7 = 0
61. Finding the Class Now compute P(Cj) ? P(Ai| Cj) for both classes for the test example [No, Married, Income = 120K]
For Class Evade=No we get:
.7 x 4/7 x 4/7 x 1/7 = 0.032
For Class Evade=Yes we get:
.3 x 1 x 0 x 0 = 0
Which one is best?
Clearly we would select “No” for the class value
Note that these are not the actual probabilities of each class, since we did not divide by P([No, Married, Income = 120K])
62. Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero
This is not ideal, especially since probability estimates may not be very precise for rarely occurring values
We use the Laplace estimate to improve things. Without a lot of observations, the Laplace estimate moves the probability towards the value assuming all classes equally likely
63. Naïve Bayes (Summary) Robust to isolated noise points
Robust to irrelevant attributes
Independence assumption may not hold for some attributes
But works surprisingly well in practice for many problems
64. More Examples There are two more examples coming up
Go over them before trying the HW, unless you are clear on Bayesian Classifiers
You are not responsible for Bayesian Belief Networks
66. Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false>
P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (don’t play)
67. Example of Naïve Bayes Classifier
68. Artificial Neural Networks (ANN) The study of ANNs was inspired by human brain
ANNs do not try to exactly mimic the brain
Relevant terms:
Neurons: nerve cells in the brain
synapse: gap between neurons over which electrical signal travels
The brain learns by changing the strength of synaptic connections upon repeated stimulation by same impulse
How do you think the brain and a ANN differ?
Neurons very slow but brain massively parallel
Can you think of tasks that show this difference?
Adding numbers: sequential so brain slower
Seeing an image: brain can exploit parallelism
69. Artificial Neural Networks (ANN)
70. Artificial Neural Networks (ANN)
71. Artificial Neural Networks (ANN) Model is an assembly of inter-connected nodes and weighted links
Output node sums up each of its input value according to the weights of its links
Compare output node against some threshold t
Perceptron model has no hidden units and simple activation function
What is Y for X=1,0,1 and 0,1,1?
72. Algorithm for learning ANN Initialize the weights (w0, w1, …, wk)
Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples
Objective function:
Find the weights wi’s that minimize the above objective function
e.g., backpropagation algorithm
Make small corrections to the weights for each example
Iterate over the data many times. Each cycle is called an epoch. Will make many passes over the training data.
73. Weight Update formula Weight update formula for iteration k
wj(k+1) = wj(k) + ?(yi -y'(k))xij
xij is the value for the jth attribute of training example xi
? is the learning rate and controls how big the correction is
What are the advantages of making ? big or small?
74. General Structure of ANN
75. Learning Power of a ANN Perceptron is guaranteed to converge if data is linearly separable
It will learn a hyperplane that separates the classes
The XOR function on page 250 is not linearly separable
A mulitlayer ANN has no such guarantee of convergence but can learn functions that are not linearly separable
An ANN with a hidder layer can learn the XOR function by constructing two hyperplanes (see page 253)
76. History of ANNs 1940's McCulloch and Pitts introduced the first neural network computing model
In 1950's, Rosenblatt introduced perceptron
Because perceptrons have limited expressive power, the initial interest in ANNs waned
Later people realized that the lack of convergence isn't necessarily such a big deal and that multilayer perceptrons are powerful
referred to as universal approximators
Great interest in 1980's in ANNs
77. Characteristics of an ANN Very expressive and can approximate any function
Much more expressive than decision trees
Relatively time consuming
Much slower than decision trees
But classification of examples is very quick
Do you think that neural networks are interpretable?
Try looking at someone’s brain to see what they are thinking!
Handle continues inputs quite well and naturally
Seem popular with sensory applications (e.g., recognizing objects)
78. Final Notes on ANNs More complex ANNs exist
Recurrent ANNs allows links from one layer back to the previous layers or to same layer
If this is not allowed, then we have a feedforward ANN. Perceptrons are feedforward only.
Most ANNs use an activation function that is more complex than the sign function.
Sigmoid functions are popular (see page 252)
These functions compute values that are non-linear in terms of the input values
79. Genetic Algorithms (in 1 Slide) GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly generated rules
e.g., If A1 and Not A2 then C2 encoded as 100
Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy on a set of training examples
Offsprings are generated by crossover and mutation
GA’s are a general search/optimization method, not just a classification method. This can be contrasted with other methods
80. Ensemble Methods Construct a set of classifiers from the training data
Predict class label of previously unseen records by aggregating predictions made by multiple classifiers
81. General Idea
82. Why does it work? Suppose there are 25 base classifiers
Each classifier has error rate, ? = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:
Practice has shown that even when independence does not hold results are good
83. Methods for generating Multiple Classifiers Manipulate the training data
Sample the data differently each time
Examples: Bagging and Boosting
Manipulate the input features
Sample the featurres differently each time
Makes especially good sense if there is redundancy
Example: Random Forest
Manipulate Class labels
Convert multi-class problem to ensemble of binary classifiers
Manipulate the learning algorithm
Vary some parameter of the learning algorithm
For example: amount of pruning, ANN network topology, etc.
84. Background Classifier performance can be impacted by:
Bias: assumptions made to help with generalization
"Simpler is better" is a bias
Variance: a learning method will give different results based on small changes (e.g., in training data).
When I run experiments and use random sampling with repeated runs, I get different results each time.
Noise: measurements may have errors or the class may be inherently probabilistic
85. How Ensembles Help Ensemble methods can assist with the bias and variance
Averaging the results over multiple runs will reduce the variance
I observe this when I use 10 runs with random sampling and see that my learning curves are much smoother
Ensemble methods especially helpful for unstable classifier algorithms
Decision trees are unstable since small changes in the training data can greatly impact the structure of the learned decision tree
If you combine different classifier methods into an ensemble, then you are using methods with different biases
You are more likely to use a classifier with a bias that is a good match for the problem
You may even be able to identify the best methods and weight them more
86. Examples of Ensemble Methods How to generate an ensemble of classifiers?
Bagging
Boosting
These methods have been shown to be quite effective
A technique ignored by the textbook is to combine classifiers built separately
By simple voting
By voting and factoring in the reliability of each classifier
Note that Enterprise Miner allows you to do this by combining diverse classifiers
87. Bagging Sampling with replacement
Build classifier on each bootstrap sample
Each sample has probability (1 – 1/n)n of being selected (about 63% for large n)
Some values will be picked more than once
Combine the resulting classifiers, such as by majority voting
Greatly reduces the variance when compared to a single base classifier
88. Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of boosting round
89. Boosting Records that are wrongly classified will have their weights increased
Records that are classified correctly will have their weights decreased
90. Ensembles and Enterprise Miner Enterprise Miner has support for:
Bagging
Boosting
Combining different models
For example, a ANN and DT model
C5 also has support for boosting
C4.5 has no support for boosting
91. Class Imbalance Class imbalance occurs when the classes in the distribution are very unevenly distributed
Examples would include fraud prediction and identification of rare diseases
If there is class imbalance, accuracy may be high even if the rare class is never predicted
This could be okay, but only if both classes are equally important
This is usually not the case. Typically the rare class is more important
The cost of a false negative is usually much higher than the cost of a false positive
The rare class is designated the positive class
92. Confusion Matrix The following abbreviations are standard:
FP: false positive
TP: True Positive
FN: False Negative
TN: True Negative
93. Classifier Evaluation Metrics Accuracy = (TP + TN)/(TP + TN + FN + FP)
Precision and Recall
Recall = TP/(TP + FN)
Recall measure the fraction of positive class examples that are correctly identified
Precision = TP/(TP + FP)
Precision is essentially the accuracy of the examples that are classified as positive
We would like to maximize precision and recall
They are usually competing goals
These measure are appropriate for class imbalance since recall explicitly addresses the rare class
94. Classifier Evaluation Metrics Cont. One issue with precision and recall is that they are two numbers, not one
That makes simple comparisons more difficulty and it is not clear how to determine the best classifier
Solution: combine the two
F-measure combines precision and recall
The F1 measure is defined as:
(2 x recall x precision)/(recall + precision)
95. ROC Curves Plots TP Rate on y-axis vs. FP Rate on x –axis
TP Rate = TP/(TP + FN)
FP Rate = FP/(TN + FP)
Interpretation of critical points
(0,0): Model always predicts negative
(1,1): Model always predicts positive
(1,0): Ideal mode: all positive predicted correctly and no negatives mistakenly classified as positive
ROC Curves can be used to compare classifiers
Area under the ROC curve (AUC) generates a single statistic of overall robustness of a model
AUC becoming almost as common as accuracy
The ROC curve will not depend to the class distribution of the population
96. What you need to know For evaluation metrics, you need to know
What the confusion matrix is and how to generate it
How to define accuracy and compute it from a confusion matrix
You should be able to figure out the formula for precision and recall and compute the values from a confusion matrix
Understand the F-meaure
I would provide formula
Understand the basics of the ROC curve but not the details
97. Cost-Sensitive Learning Cost-sensitive learning will factor in cost information
For example, you may be given the relative cost of a FN vs. FP
You may be given the costs or utilities associated with each quadrant in the confusion matrix
We did this with Enterprise Miner to figure out the optimal classifier
Cost-sensitive learning can be implemented by sampling the data to reflect the costs
If the cost of a FN is twice that of a FP, then you can increase the ratio of positive examples by a factor of 2 when constructing the training set
This is a good general area for research and relates to some of my research interests