Data Mining Classification: Alternative Techniques

1. Data Mining Classification: Alternative Techniques

2. Rule-Based Classifier Classify records by using a collection of �if�then�� rules Rule: (Condition) ? y where Condition is a conjunctions of attributes y is the class label LHS: rule antecedent or condition RHS: rule consequent Examples of classification rules: (Blood Type=Warm) ? (Lay Eggs=Yes) ? Birds (Taxable Income < 50K) ? (Refund=Yes) ? Evade=No

3. Rule-based Classifier (Example) R1: (Give Birth = no) ? (Can Fly = yes) ? Birds R2: (Give Birth = no) ? (Live in Water = yes) ? Fishes R3: (Give Birth = yes) ? (Blood Type = warm) ? Mammals R4: (Give Birth = no) ? (Can Fly = no) ? Reptiles R5: (Live in Water = sometimes) ? Amphibians

4. Application of Rule-Based Classifier A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

5. Rule Coverage and Accuracy Coverage of a rule: Fraction of records that satisfy the antecedent of a rule Accuracy/Confidence of a rule: Fraction of records for which the rule �fires� (antecedent is satisfied) that yields correct classification

6. How does Rule-based Classifier Work?

7. Characteristics of Rule-Based Classifier Rule Learners may or may not have these properties: Mutually exclusive rules Every record is covered by at most one rule Exhaustive rules Classifier has exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one rule If both mutually exclusive and exhaustive, then every record covered by exactly one rule

8. From Decision Trees To Rules

9. Rules Can Be Simplified

10. Effect of Rule Simplification Rules are no longer mutually exclusive A record may trigger more than one rule Solution? Ordered rule set Unordered rule set � use voting schemes Rules are no longer exhaustive A record may not trigger any rules Solution? Use a default class

11. Ordered Rule Set Rules are rank ordered according to their priority An ordered rule set is known as a decision list When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered If none of the rules fired, it is assigned to the default class

12. Rule Ordering Schemes Rule-based ordering Individual rules are ranked based on their quality Class-based ordering Rules that belong to the same class appear together

13. Building Classification Rules Direct Method: Extract rules directly from data e.g.: RIPPER, CN2, Holte�s 1R Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc). e.g: C4.5rules

14. Direct Method: Sequential Covering Start from an empty rule Grow a rule using the Learn-One-Rule function Remove training records covered by the rule Repeat Step (2) and (3) until stopping criterion is met

15. Example of Sequential Covering

16. Example of Sequential Covering�

17. Aspects of Sequential Covering Rule Growing Instance Elimination Rule Evaluation Stopping Criterion Rule Pruning

18. Rule Growing Two common strategies

19. Instance Elimination Why do we need to eliminate instances? Otherwise, the next rule is identical to previous rule Why do we remove positive instances? Ensure that the next rule is different Why do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagram

20. Rule Evaluation Metrics: Accuracy Laplace Laplace moves the probability estimate towards the probability you would use with no information and assumed all classes were equally likely We will work through an example when we get to Na�ve Bayes Classifiers

21. Rule Pruning Rule Pruning Similar to post-pruning of decision trees Reduced Error Pruning: Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct

22. Summary of Direct Method Grow a single rule Remove Instances from rule Prune the rule (if necessary) Add rule to Current Rule Set Repeat

23. Direct Method: RIPPER For 2-class problem, have minority class be the positive class and the majority class the negative class Learn rules for positive class Negative class will be default class Good for data sets with unbalanced class distributions For multi-class problem Order the classes according to increasing class prevalence (fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative class Repeat with next smallest class as positive class

24. Direct Method: RIPPER Growing a rule: Start from empty rule Add conjuncts as long as they improve FOIL�s information gain Stop when rule no longer covers negative examples Prune the rule immediately using incremental reduced error pruning Measure for pruning: v = (p-n)/(p+n) p: number of positive examples covered by the rule in the validation set n: number of negative examples covered by the rule in the validation set Pruning method: delete any final sequence of conditions that maximizes v

25. Direct Method: RIPPER Building a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so far

26. Indirect Methods

27. Indirect Method: C4.5rules Extract rules from an unpruned decision tree For each rule, r: A ? y, consider an alternative rule r�: A� ? y where A� is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all r�s Prune if one of the r�s has lower pessimistic error rate Repeat until we can no longer improve generalization error

28. Indirect Method: C4.5rules Instead of ordering the rules, order subsets of rules (class ordering) Each subset is a collection of rules with the same rule consequent (class) Compute description length of each subset Order classes with lowest Minimum Description Length first

29. Advantages of Rule-Based Classifiers As highly expressive as decision trees Are they more expressive? Can every DT be turned into a Rule Set? Can every Rule Set be turned into a DT? Easy to interpret Easy to generate Can classify new instances rapidly Performance comparable to decision trees

30. Instance-Based Classifiers

31. Instance Based Classifiers Examples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly No generalization Nearest neighbor Uses k �closest� points (nearest neighbors) for performing classification

32. Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it�s probably a duck

33. Nearest-Neighbor Classifiers

34. Definition of Nearest Neighbor

35. 1 nearest-neighbor Vornoi Diagram

36. Nearest Neighbor Classification Compute distance between two points: Euclidean distance Determine the class from nearest neighbor list take the majority vote of class labels among the k-nearest neighbors Weigh the vote according to distance weight factor, w = 1/d2

37. Nearest Neighbor Classification� Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes

38. Nearest Neighbor Classification� Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M

39. Nearest Neighbor Classification� Problem with Euclidean measure: High dimensional data curse of dimensionality Can produce counter-intuitive results if space is very sparse

40. Nearest neighbor Classification� k-NN classifiers are lazy learners It does not build models explicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive

41. Remarks on Lazy vs. Eager Learning Instance-based learning: lazy evaluation Decision-tree and Rule-Based classification: eager evaluation Key differences Lazy method may consider query instance xq when deciding how to generalize beyond the training data D Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: Lazy - less time training but more time predicting Accuracy Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function Eager: must commit to a single hypothesis that covers the entire instance space

42. Bayes Classifier A probabilistic framework for classification problems Often appropriate because the world is noisy and also some relationships are probabilistic in nature Is predicting who will win a baseball game probabilistic in nature? Before getting the heart of the matter, we will go over some basic probability. I will be much more �gentle� than the text and go over all of the details and reasoning You may find a few of these basic probability questions on your exam and also should understand how Bayesian classifiers work Stop me if you have questions!!!!

43. Conditional Probability The conditional probability of an event C given that event A has occurred is denoted P(C|A) Here are some practice questions that you should be able to answer based on your knowledge from CS1100/1400 Given a standard six-sided die: What is P(roll a 1|roll an odd number)? What is P(roll a 1|roll an even number)? What is P(roll a number >2)? What is P(roll a number >2|roll a number > 1)? Given a standard deck of cards: What is P(pick an Ace|pick a red card)? What is P(pick ace of clubs|pick an Ace)? What is P(pick an ace on second card|pick an ace on first card)? Do you get it? Any questions before we move on?

44. Conditional Probability Continued P(A,C) is the probability that A and C both occur What is P(pick an Ace, pick a red card) = ??? Note that this is the same a the probability of picking a red ace Two events are independent if one occurring does not impact the occurrence or non-occurrence of the other one Please give me some examples of independent events You can think of some examples from flipping coins if that helps In the example from above, are A and C independent events? If two events A and B are independent, then: P(A, B) = P(A) x P(B) Note that this helps us answer the P(A,C) above, although most of you probably solved it directly.

45. Conditional Probability Continued

46. An Example Let�s use the example from before, where: A = �pick and Ace� and C = �pick a red card� Using the previous equation: P(C|A) = P(A,C)/P(A) We now get: P(red card|Ace) = P(red ace)/P(Ace) P(red card|Ace) = (2/52)/(4/52) = 2/4 = .5 Hopefully this makes sense, just like the Venn Diagram should

47. Bayes Theorem Bayes Theorem states that: P(C|A) =[ P(A|C) P(C)] / P(A) Prove this equation using the two equations we informally showed to be true using the Venn Diagrams: P(C|A) = P(A,C)/P(A) and P(A|C) = P(A,C)/P(C) Start with P(C|A) = P(A,C)/P(A) Then notice that we are part way there and only need to replace P(A,C) By rearranging the second equation (after the �and�) to isolate P(A,C), we can substitute in P(A|C)P(C) for P(A,C) and the proof is done!

48. Some more terminology The Prior Probability is the probability assuming no specific information. Thus we would refer to P(A) as the prior probability of even A occurring We would not say that P(A|C) is the prior probability of A occurring The Posterior probability is the probability given that we know something We would say that P(A|C) is the posterior probability of A (given that C occurs)

49. Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what�s the probability he/she has meningitis?

50. Bayesian Classifiers Given a record with attributes (A1, A2,�,An) The goal is to predict class C Actually, we want to find the value of C that maximizes P(C| A1, A2,�,An ) Can we estimate P(C| A1, A2,�,An ) directly (w/o Bayes)? Yes, we simply need to count up the number of times we see A1, A2,�,An and then see what fraction belongs to each class For example, if n=3 and the feature vector �4,3,2� occurs 10 times and 4 of these belong to C1 and 6 to C2, then: What is P(C1|�4,3,2�)? What is P(C2|�4,3,2�)? Unfortunately, this is generally not feasible since not every feature vector will be found in the training set

51. Bayesian Classifiers Indirect Approach: Use Bayes Theorem compute the posterior probability P(C | A1, A2, �, An) for all values of C using the Bayes theorem Choose value of C that maximizes P(C | A1, A2, �, An) Equivalent to choosing value of C that maximizes P(A1, A2, �, An|C) P(C) Since the denominator is the same for all values of C

52. Na�ve Bayes Classifier How can we estimate P(A1, A2, �, An |C)? We can measure it directly, but only if the training set samples every feature vector. Not practical! So, we must assume independence among attributes Ai when class is given: P(A1, A2, �, An |C) = P(A1| Cj) P(A2| Cj)� P(An| Cj) Then we can estimate P(Ai| Cj) for all Ai and Cj from training data This is reasonable because now we are looking only at one feature at a time. We can expect to see each feature value represented in the training data. New point is classified to Cj if P(Cj) ? P(Ai| Cj) is maximal.

53. How to Estimate Probabilities from Data? Class: P(C) = Nc/N e.g., P(No) = 7/10, P(Yes) = 3/10 For discrete attributes: P(Ai | Ck) = |Aik|/ Nc where |Aik| is number of instances having attribute Ai and belongs to class Ck Examples: P(Status=Married|No) = 4/7P(Refund=Yes|Yes)=0

54. How to Estimate Probabilities from Data? For continuous attributes: Discretize the range into bins Two-way split: (A < v) or (A > v) choose only one of the two splits as new attribute Creates a binary feature Probability density estimation: Assume attribute follows a normal distribution and use the data to fit this distribution Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) We will not deal with continuous values on HW or exam Just understand the general ideas above For the example tax cheating example, we will assume that �Taxable Income� is discrete Each of the 10 values will therefore have a prior probability of 1/10

55. Example of Na�ve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure

56. Example of Na�ve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure P(Evade|[No, Married, Income=120K]) Next, what do we need to maximize?

57. Example of Na�ve Bayes We start with a test example and want to know its class. Does this individual evade their taxes: Yes or No? Here is the feature vector: Refund = No, Married, Income = 120K Now what do we do? First try writing out the thing we want to measure P(Evade|[No, Married, Income=120K]) Next, what do we need to maximize? P(Cj) ? P(Ai| Cj)

58. Example of Na�ve Bayes Since we want to maximize P(Cj) ? P(Ai| Cj) What quantities do we need to calculate in order to use this equation? Someone come up to the board and write them out, without calculating them Recall that we have three attributes: Refund: Yes, No Marital Status: Single, Married, Divorced Taxable Income: 10 different �discrete� values While we could compute every P(Ai| Cj) for all Ai, we only need to do it for the attribute values in the test example

59. Values to Compute Given we need to compute P(Cj) ? P(Ai| Cj) We need to compute the class probabilities P(Evade=No) P(Evade=Yes) We need to compute the conditional probabilities P(Refund=No|Evade=No) P(Refund=No|Evade=Yes) P(Marital Status=Married|Evade=No) P(Marital Status=Married|Evade=Yes) P(Income=120K|Evade=No) P(Income=120K|Evade=Yes)

60. Computed Values Given we need to compute P(Cj) ? P(Ai| Cj) We need to compute the class probabilities P(Evade=No) = 7/10 = .7 P(Evade=Yes) = 3/10 = .3 We need to compute the conditional probabilities P(Refund=No|Evade=No) = 4/7 P(Refund=No|Evade=Yes) 3/3 = 1.0 P(Marital Status=Married|Evade=No) = 4/7 P(Marital Status=Married|Evade=Yes) =0/3 = 0 P(Income=120K|Evade=No) = 1/7 P(Income=120K|Evade=Yes) = 0/7 = 0

61. Finding the Class Now compute P(Cj) ? P(Ai| Cj) for both classes for the test example [No, Married, Income = 120K] For Class Evade=No we get: .7 x 4/7 x 4/7 x 1/7 = 0.032 For Class Evade=Yes we get: .3 x 1 x 0 x 0 = 0 Which one is best? Clearly we would select �No� for the class value Note that these are not the actual probabilities of each class, since we did not divide by P([No, Married, Income = 120K])

62. Na�ve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero This is not ideal, especially since probability estimates may not be very precise for rarely occurring values We use the Laplace estimate to improve things. Without a lot of observations, the Laplace estimate moves the probability towards the value assuming all classes equally likely

63. Na�ve Bayes (Summary) Robust to isolated noise points Robust to irrelevant attributes Independence assumption may not hold for some attributes But works surprisingly well in practice for many problems

64. More Examples There are two more examples coming up Go over them before trying the HW, unless you are clear on Bayesian Classifiers You are not responsible for Bayesian Belief Networks

66. Play-tennis example: classifying X An unseen sample X = <rain, hot, high, false> P(X|p)�P(p) = P(rain|p)�P(hot|p)�P(high|p)�P(false|p)�P(p) = 3/9�2/9�3/9�6/9�9/14 = 0.010582 P(X|n)�P(n) = P(rain|n)�P(hot|n)�P(high|n)�P(false|n)�P(n) = 2/5�2/5�4/5�2/5�5/14 = 0.018286 Sample X is classified in class n (don�t play)

67. Example of Na�ve Bayes Classifier

68. Artificial Neural Networks (ANN) The study of ANNs was inspired by human brain ANNs do not try to exactly mimic the brain Relevant terms: Neurons: nerve cells in the brain synapse: gap between neurons over which electrical signal travels The brain learns by changing the strength of synaptic connections upon repeated stimulation by same impulse How do you think the brain and a ANN differ? Neurons very slow but brain massively parallel Can you think of tasks that show this difference? Adding numbers: sequential so brain slower Seeing an image: brain can exploit parallelism

69. Artificial Neural Networks (ANN)

70. Artificial Neural Networks (ANN)

71. Artificial Neural Networks (ANN) Model is an assembly of inter-connected nodes and weighted links Output node sums up each of its input value according to the weights of its links Compare output node against some threshold t Perceptron model has no hidden units and simple activation function What is Y for X=1,0,1 and 0,1,1?

72. Algorithm for learning ANN Initialize the weights (w0, w1, �, wk) Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples Objective function: Find the weights wi�s that minimize the above objective function e.g., backpropagation algorithm Make small corrections to the weights for each example Iterate over the data many times. Each cycle is called an epoch. Will make many passes over the training data.

73. Weight Update formula Weight update formula for iteration k wj(k+1) = wj(k) + ?(yi -y'(k))xij xij is the value for the jth attribute of training example xi ? is the learning rate and controls how big the correction is What are the advantages of making ? big or small?

74. General Structure of ANN

75. Learning Power of a ANN Perceptron is guaranteed to converge if data is linearly separable It will learn a hyperplane that separates the classes The XOR function on page 250 is not linearly separable A mulitlayer ANN has no such guarantee of convergence but can learn functions that are not linearly separable An ANN with a hidder layer can learn the XOR function by constructing two hyperplanes (see page 253)

76. History of ANNs 1940's McCulloch and Pitts introduced the first neural network computing model In 1950's, Rosenblatt introduced perceptron Because perceptrons have limited expressive power, the initial interest in ANNs waned Later people realized that the lack of convergence isn't necessarily such a big deal and that multilayer perceptrons are powerful referred to as universal approximators Great interest in 1980's in ANNs

77. Characteristics of an ANN Very expressive and can approximate any function Much more expressive than decision trees Relatively time consuming Much slower than decision trees But classification of examples is very quick Do you think that neural networks are interpretable? Try looking at someone�s brain to see what they are thinking! Handle continues inputs quite well and naturally Seem popular with sensory applications (e.g., recognizing objects)

78. Final Notes on ANNs More complex ANNs exist Recurrent ANNs allows links from one layer back to the previous layers or to same layer If this is not allowed, then we have a feedforward ANN. Perceptrons are feedforward only. Most ANNs use an activation function that is more complex than the sign function. Sigmoid functions are popular (see page 252) These functions compute values that are non-linear in terms of the input values

79. Genetic Algorithms (in 1 Slide) GA: based on an analogy to biological evolution Each rule is represented by a string of bits An initial population is created consisting of randomly generated rules e.g., If A1 and Not A2 then C2 encoded as 100 Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation GA�s are a general search/optimization method, not just a classification method. This can be contrasted with other methods

80. Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made by multiple classifiers

81. General Idea

82. Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ? = 0.35 Assume classifiers are independent Probability that the ensemble classifier makes a wrong prediction: Practice has shown that even when independence does not hold results are good

83. Methods for generating Multiple Classifiers Manipulate the training data Sample the data differently each time Examples: Bagging and Boosting Manipulate the input features Sample the featurres differently each time Makes especially good sense if there is redundancy Example: Random Forest Manipulate Class labels Convert multi-class problem to ensemble of binary classifiers Manipulate the learning algorithm Vary some parameter of the learning algorithm For example: amount of pruning, ANN network topology, etc.

84. Background Classifier performance can be impacted by: Bias: assumptions made to help with generalization "Simpler is better" is a bias Variance: a learning method will give different results based on small changes (e.g., in training data). When I run experiments and use random sampling with repeated runs, I get different results each time. Noise: measurements may have errors or the class may be inherently probabilistic

85. How Ensembles Help Ensemble methods can assist with the bias and variance Averaging the results over multiple runs will reduce the variance I observe this when I use 10 runs with random sampling and see that my learning curves are much smoother Ensemble methods especially helpful for unstable classifier algorithms Decision trees are unstable since small changes in the training data can greatly impact the structure of the learned decision tree If you combine different classifier methods into an ensemble, then you are using methods with different biases You are more likely to use a classifier with a bias that is a good match for the problem You may even be able to identify the best methods and weight them more

86. Examples of Ensemble Methods How to generate an ensemble of classifiers? Bagging Boosting These methods have been shown to be quite effective A technique ignored by the textbook is to combine classifiers built separately By simple voting By voting and factoring in the reliability of each classifier Note that Enterprise Miner allows you to do this by combining diverse classifiers

87. Bagging Sampling with replacement Build classifier on each bootstrap sample Each sample has probability (1 � 1/n)n of being selected (about 63% for large n) Some values will be picked more than once Combine the resulting classifiers, such as by majority voting Greatly reduces the variance when compared to a single base classifier

88. Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records Initially, all N records are assigned equal weights Unlike bagging, weights may change at the end of boosting round

89. Boosting Records that are wrongly classified will have their weights increased Records that are classified correctly will have their weights decreased

90. Ensembles and Enterprise Miner Enterprise Miner has support for: Bagging Boosting Combining different models For example, a ANN and DT model C5 also has support for boosting C4.5 has no support for boosting

91. Class Imbalance Class imbalance occurs when the classes in the distribution are very unevenly distributed Examples would include fraud prediction and identification of rare diseases If there is class imbalance, accuracy may be high even if the rare class is never predicted This could be okay, but only if both classes are equally important This is usually not the case. Typically the rare class is more important The cost of a false negative is usually much higher than the cost of a false positive The rare class is designated the positive class

92. Confusion Matrix The following abbreviations are standard: FP: false positive TP: True Positive FN: False Negative TN: True Negative

93. Classifier Evaluation Metrics Accuracy = (TP + TN)/(TP + TN + FN + FP) Precision and Recall Recall = TP/(TP + FN) Recall measure the fraction of positive class examples that are correctly identified Precision = TP/(TP + FP) Precision is essentially the accuracy of the examples that are classified as positive We would like to maximize precision and recall They are usually competing goals These measure are appropriate for class imbalance since recall explicitly addresses the rare class

94. Classifier Evaluation Metrics Cont. One issue with precision and recall is that they are two numbers, not one That makes simple comparisons more difficulty and it is not clear how to determine the best classifier Solution: combine the two F-measure combines precision and recall The F1 measure is defined as: (2 x recall x precision)/(recall + precision)

95. ROC Curves Plots TP Rate on y-axis vs. FP Rate on x �axis TP Rate = TP/(TP + FN) FP Rate = FP/(TN + FP) Interpretation of critical points (0,0): Model always predicts negative (1,1): Model always predicts positive (1,0): Ideal mode: all positive predicted correctly and no negatives mistakenly classified as positive ROC Curves can be used to compare classifiers Area under the ROC curve (AUC) generates a single statistic of overall robustness of a model AUC becoming almost as common as accuracy The ROC curve will not depend to the class distribution of the population

96. What you need to know For evaluation metrics, you need to know What the confusion matrix is and how to generate it How to define accuracy and compute it from a confusion matrix You should be able to figure out the formula for precision and recall and compute the values from a confusion matrix Understand the F-meaure I would provide formula Understand the basics of the ROC curve but not the details

97. Cost-Sensitive Learning Cost-sensitive learning will factor in cost information For example, you may be given the relative cost of a FN vs. FP You may be given the costs or utilities associated with each quadrant in the confusion matrix We did this with Enterprise Miner to figure out the optimal classifier Cost-sensitive learning can be implemented by sampling the data to reflect the costs If the cost of a FN is twice that of a FP, then you can increase the ratio of positive examples by a factor of 2 when constructing the training set This is a good general area for research and relates to some of my research interests

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Presentation Transcript

Data Mining: Concepts and Techniques Slides for Textbook Chapter 7

Feature Selection in Nonlinear Kernel Classification

Data Mining: Concepts and Techniques Mining Text Data

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 8 Alternative Classification A

Introduction to Spatial Data Mining

Data Mining

Data Mining

Data Mining Technique

Data Mining Classification: Basic Concepts,

DATA MINING

Classification II

Data Mining

Combing GLM and Data Mining Techniques Daniel Finnegan, President ISO Innovative Analytics

Data Mining Techniques for Malware Detection R. K. Agrawal School of Computer and Systems Sciences

Course Overview

Parallel and Distributed Computing for Data Mining

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Chapter 1 — — Introduction —

Data Mining: Introduction

Data Mining: Concepts and Techniques

Data Mining