1 / 33

Mining with Rare Cases

Mining with Rare Cases. Paper by Gary M. Weiss Presenter: Indar Bhatia INFS 795 April 28, 2005. Presentation Overview. Motivation and Introduction to problem Why Rare Cases are Problematic Techniques for Handling Rare Cases Summary and Conclusion. Motivation and Introduction.

decima
Download Presentation

Mining with Rare Cases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining with Rare Cases Paper by Gary M. Weiss Presenter: Indar Bhatia INFS 795 April 28, 2005

  2. Presentation Overview • Motivation and Introduction to problem • Why Rare Cases are Problematic • Techniques for Handling Rare Cases • Summary and Conclusion

  3. Motivation and Introduction • What are rare cases? • A case corresponds to a region in the instance space that is meaningful to the domain under study. • A rare case is a case that covers a small region of the instance space • Why are they important • Detecting suspicious cargo • Finding sources of rare diseases • Detecting Fraud • Finding terrorists • Identifying rare diseases • Classification Problem • Covers relatively few training examples • Example: Finding association between infrequently purchased supermarket items

  4. Modeling Problem P2 • For a classification problem, the rare cases may manifest themselves as small disjuncts, i.e., those disjuncts in the classifier that cover few training examples • In unsupervised learning, the three rare cases will be more difficult to generalize from because they contain fewer data points • In association rule mining, the problem will be to detect items that co-occur most infrequently. P1 P3 Clustering: showing 1 common And 3 rare classes Two Class Classification: Positive Class contains 1 Common and 2 Rare classes

  5. Modeling Problem • Current research indicates rare cases and small disjuncts pose difficulties for data mining, i.e., rare cases have much higher misclassification rate than common cases. • Small disjuncts collectively cover a substantial fraction of all examples and cannot simply be eliminated – doing so will substantially degrade the performance of a classifier. • In a most thorough study of small disjuncts, ( Weiss & Hirsh, 2000), it was shown that in the classifiers induced from 30 real-world data sets, most classifier errors are contributed by the smaller disjuncts.

  6. Why Rare Cases are Problematic • Problems arise due to absolute rarity • Most fundamental problem is associated lack of data – only few examples related to rare cases are in the data set (Absolute rarity) • Lack of data makes it difficult to detect rare cases, and if detected, makes generalization difficult • Problems arise due to relative rarity • Looking for a needle in a Haystack – rare cases are obscured by common cases (Relative rarity) • Data mining algorithms rely on greedy Search heuristics that examine one variable at a time. Since the detection of rare cases may depend on the conjunction of many conditions, any single condition in isolation may not provide much guidance. • For example , consider Association rule mining problem. Association Analysis has to have a very low support, support =0. This causes a combinatorial explosion in large datasets.

  7. Why Rare Cases are Problematic • The Metrics • The metrics used to evaluate classifier accuracy are more focused on common cases. As a consequence, rare cases may be totally ignored. • Example: • consider decision tree. Most decision trees are grown in a top-down manner, where test conditions are repeatedly evaluated and the best one selected. • The metrics (i.e., the information gain) used to select the best test generally prefers tests that result in a balanced tree where purity is increased for most of the examples. • Rare cases which correspond to high purity branches covering few examples will often not be included in the decision tree.

  8. Why Rare Cases are Problematic • The Bias • The bias of a data mining system is critical to its performance. The extra-evidentiary bias makes it possible to generalize from specific examples. • Bias used by many data mining systems, especially those used to induce classifiers, employ a maximum-generality bias. • This means that when a disjunct that covers some set of training examples is formed, only the most general set of conditions that satisfy those examples are selected. • The maximum-generality bias works well for common cases, but not for rare cases/small disjuncts. • Attempts to address the problems of small disjuncts by selecting an appropriate bias must be considered.

  9. Why Rare Cases are Problematic • Noisy data • Sufficient high level of background noise may prevent the learner to distinguish between noise and rare cases. • Unfortunately, there is not much that can be done to minimize the impact on noise on rare cases. • For example: Pruning and overfitting avoidance techniques, as well as inductive biases that foster generalization, can minimize the overall impact of noise but, because these methods tend to remove both the rare cases and noise-generated ones, they do so at the expense of rare cases.

  10. Techniques For Handling rare Cases • Obtain Additional Training Data • Use a More Appropriate Inductive Bias • Use More Appropriate Metrics • Employ Non-Greedy Search Techniques • Employ Knowledge/Human Interaction • Employ Boosting • Place Rare Cases Into Separate Classes

  11. 1. Obtain Additional Training Data • Simply obtaining additional training data will not help much because most of the new data will be also associated with the common cases and may be some associated with rare cases. This may help problems of “absolute rarity” but not with “relative rarity” • Only by selectively obtaining additional data for the rare cases can one address the issues with relative rarity. Such a sampling scheme will also help with absolute rarity. • The selective sampling approach does not seem practical for real-world data sets.

  12. 2. Use a More Appropriate Inductive Bias • Rare cases tend to cause small disjuncts to be formed in a classifier induced from labeled data. This is partly due to bias used by most learners. • Simple strategies that eliminate all small disjuncts or use statistical significance testing to prevent small disjuncts from being formed, have proven to perform poorly. • More sophisticated approaches for adjusting the bias of a learner in order to minimize the problem with small disjuncts have been investigated. • Holte et al. (1989) use a maximum generality bias for large disjuncts and use a maximum specificity bias for small disjuncts. This was shown to have improved performance of small disjuncts but degrade the performance of large disjuncts, yielding poorer overall performance.

  13. 2. Use a More Appropriate Inductive Bias • The approach was refined to ensure that the more specific bias used to induce the small disjuncts does not affect – and therefore cannot degrade – the performance of the large disjuncts. • This was accomplished by using different learners for examples that fall into large disjuncts and examples that fall into small disjuncts. (Ting, 1994) • This hybrid approach was shown to improve the accuracy of small disjuncts, the results were not conclusive. • Carvalho and Frietas(2002a, 2002b) essentially use the same approach, except that the set of training examples falling into each individual small disjunct are used to generate a separate classifier. • Several attempts have been made to perform better on rare cases by using a highly specific bias for the induced small disjuncts. These methods have shown mixed success.

  14. 3. Use More Appropriate Metrics Altering Relative importance of Precision vs. Recall: • Use Evaluation Metrics that, unlike accuracy metrics, do not discount the importance of rare cases. • Given a classification rule R that predicts target class C, the recall of R is the % of examples belonging to C that are correctly identified while the precision of R is the % of times that the rule is correct. • Rare cases can be given more prominence by increasing the importance of precision over recall. • Timeweaver (Weiss, 1999), a genetic-algorithm based classification system, searches for rare cases by carefully altering the relative importance of precision vs. recall

  15. 3. Use More Appropriate Metrics Two-Phase Rule Induction: • PNrule (Joshi, Aggarwal & Kumar, 2001) – uses two-phase rule induction to focus on each measure separately. • The first Phase focuses on recall. In the second phase, precision is optimized. This is accomplished by learning to identify false positives within the rule from phase-1. • In the Needle-in-the-haystack analogy, the first phase identifies regions likely to contain the needle, then in the second phase learns to discard the hay strands within these regions.

  16. PN-rule Learning • P-phase: • Positive examples with good support • Seek good recall • N-phase: • Remove FP from examples of P-phase • High accuracy and significant support

  17. 4. Employ Non-Greedy Search Techniques • Most Greedy algorithms are designed to be locally optimal, so as to avoid local minima. This is done to make sure that the solution remains tractable. Mining algorithms based on Greedy method are not globally optimal. • Greedy algorithms are not suitable for dealing with rare cases because rare cases may depend on the conjunction of many conditions and any single condition in isolation my not provide the needed solution. • Mining solution algorithms for handling rare cases must use more powerful global search methods. • Recommended solution: • Genetic algorithms, which operate on a population of candidate solutions rather than a single solution: • For this reason GA are more appropriate for rare cases. (Goldberg, 1989), (Freitas, 2002), (Weiss, 1999), (Cavallo and Freitas, 2002)

  18. 5. Employ Knowledge/Human Interaction • Interaction and knowledge of domain experts can be used more effectively for rare case mining. • Example: • SAR detection • Rare disease detection • Etc.

  19. 6. Employ Boosting • Boosting algorithms, such as AdaBoost, are iterative algorithms that place different weights on the training distribution at each iteration. • Following each iteration, boosting increases the weights associated with the incorrectly classified examples and decreases the weight associated with the correctly classified examples. • This forces the learner to focus more on the incorrectly classified examples in the next iteration, • An algorithm, RareBoost (Joshi, Kumar and Agarwal, 2001) which applies modified weight-update mechanism to improve the performance of rare classes and rare cases.

  20. 7. Place Rare Cases Into Separate Classes • Rare cases complicate classification because different rare cases may have little in common between them, making it difficult to assign same class label to all of them. • Solution: Reformulate the problem so that rare cases are viewed as separate classes. • Approach: • Separate each class into subclasses using clustering • Learn after re-labeling the training examples with the new class labels • Because multiple clustering experiments were used in steps 1, step 2 involves learning multiple models. • These models are combined using voting.

  21. Boosting based algorithms • RareBoost • Updates the weights differently • SMOTEBoost • Combination of SMOTE (Synthetic Minority Oversampling Technique) and boosting

  22. CREDOS • First use ripple down rules to overfit the data • Ripple down rules are often used • Then prune to improve generalization • Different mechanism from decision trees

  23. Cost Sensitive Modeling • Detection rate / False Alarm rate may be misleading • Cost factors: damage cost, response cost, operational cost • Costs for TP, FP, TN, FN • Define cumulative cost

  24. Outlier Detection Schemes • Detect intrusions (data points) that are very different from the “normal” activities (rest of the data points) • General Steps • Identify “normal” behavior • Construct useful set of features • Define similarity function • Use outlier detection algorithm • Statistics based • Distance based • Model based

  25. Distance Based Outlier Detection • Represent data as a vector of features • Major approaches • Nearest neighbor based • Density based • Clustering based • Problem • High dimensionality of data

  26. y’ x’ * * * * * * * * * * * * * * * * * * * * * * p1 p2   Distance Based – Nearest Neighbor • Not enough neighbors  Outliers • Compute distance d to the k-th nearest neighbor • Outlier points • Located in more sparse neighborhoods • Have d larger than a certain threashold • Mahalanobis-distance based approach • More appropriate for computing distance with skewed distributions

  27. p2  p1  Distance Based – Density • Local Outlier Factor (LOF) • Average of the ratios of the density of example p and the density of its nearest neighbors • Compute density of local neighborhood for each point • Compute LOF • Larger LOF  Outliers

  28. Distance Based – Clustering • Radius w of proximity is specified • Two points x1 and x2 are “near” if d(x1, x2)<w • Define N(x) as number of points that are within w of x • Points in small cluster  Outliers • Fixed-width clustering for speedup

  29. Distance Based - Clustering (cont.) • K-Nearst Neighbor + Canopy Clustering • Compute sum of distances to k nearest neighbors • Small K-NN  point in dense region • Canopy clustering for speedup • WaveCluster • Transform data into multidimensional signals using wavelet transformation • Remove Hign/Low frequency parts • Remaining parts  Outliers

  30. Model Based Outlier Detection • Similar to Probabilistic Based schemes • Build prediction model for normal behavior • Deviation from model  potential intrusion • Major approaches • Neural networks • Unsupervised Support Vector Machines (SVMs)

  31. Model Based - Neural Networks • Use a replicator 4-layer feed-forward neural network • Input variables are the target output during training • RNN forms a compressed model for traning data • Outlyingness  reconstruction error

  32. origin Model Based - SVMs • Attempt to separate the entire set of training data from the origin • Regions where most data lies are labeled as one class • Parameters • Expected outlier rates • Good for high quality controlled training data • Variance of Radial Basis Function (RBF) • - Larger  higher detection rate and more false alarm • - Smaller  lower detection rate and fewer false alarm

  33. Summary And Conclusion • Rare classes, which result from highly skewed class distribution, share many of the problems associated with rare cases. Rare classes and rare cases are connected. • Rare cases may occur can occur within both rare classes and common classes, it is expected that rare cases to be more of an issue for rare classers. • (Japkowicz, 2001) views rare classes as a consequence of between-class imbalance and rare cases as a consequence of within-class imbalances. • Thus, both forms of rarity are a type of data imbalance • Modeling improvements presented in this paper are applicable to both types of rarity.

More Related