1 / 13

Efficient Data Reduction Techniques for Improved Data Mining Performance

This chapter explores instance selection as a nontrivial process aiming to identify valid, novel, and useful patterns in data. It delves into methods for choosing data subsets to enhance data mining outcomes, focusing on functions like enabling, cleaning, and tradeoffs for mining quality. Classifications into critical points, boundary points, and sampling methods like tree-based and delegate sampling are discussed. Clustering techniques involving prototypes and hierarchical data description are also covered. Future directions include developing universal models for instance selection tailored to various mining algorithms. Evaluation metrics and issues, platform compatibility, and database protectability are highlighted, along with related topics such as feature selection and boosting. The chapter concludes with the significance of instance selection in optimizing data mining processes.

Download Presentation

Efficient Data Reduction Techniques for Improved Data Mining Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Reduction via Instance Selection Chapter 1

  2. Background • KDD • Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. • Instance selection • Choosing a subset of data to achieve the original purpose of a data mining application as if the whole data is used. • The ideal outcome of instance selection is model independent

  3. Instance Selection • Instance selection has the following prominent functions. • Enabling • When a data set is too huge, it may not be possible to run a data mining algorithm • Focusing • One application is normally only about one aspect of the domain • Cleaning • Remove irrelevant ones as well as noise and/or redundant data. • Tradeoff between the sample size and mining quality. • Instance selection methods associated with data mining tasks such as classification and clustering

  4. Methods of Classification (1) • Four types of selected instance • Critical point • The points that matter the most to a classifier • Originated from the learning method of NN. • Keep only the critical points so that noisy data points are removed as well as the data set is reduced. • Boundary points • Support vectors • Prototypes • Representatives of groups of instances via averaging

  5. Methods of Classification (2) • Tree based sampling • Instance selection can be done via the decision tree built • Delegate sampling • Construct a decision tree such that instances at the leaves of the tree are approximately uniformly distributed • Samples instances from the leaves in inverse proportion to the density at the leaf and assigns weights to the sample points that are proportional to the leaf density

  6. Methods of Clustering • Prototypes • Pseudo data points generated from the formed clusters • Prototypes & sufficient statistics • Representing a cluster using both defiant points and pseudo points • Data description in a hierarchy • When the clustering produces a hierarchy, the prototype approach to instance selection will not work as there are many ways of choosing appropriate clusters. • Squashed Data • Each data point has a weight and the sum of the weights is equal to the number of instances in the original data set. • Obtaining squashed data • Model free, model dependent (or likelihood based)

  7. Instance Labeling • Problem of selecting which unlabeled data for labeling

  8. Towards the Next Step • Universal model of instance selection. • This would allow the selected instances to be useful for a group of mining algorithms in solving real world problems. • Not yet found. • Different groups of learning algorithms need different instance selectors in order to suit their learning/search bias well.

  9. Evaluation Issues • Sampling • Performance is of sufficient statistics and can depend on the data distribution. • If it is of normal distribution, means and variances are the two major measure. • Classification • Performance is more about predictive accuracy as well as aspects such as comprehensibility and simplicity. • Clustering • Performance is naturally about clusters: inter- and intra-cluster similarity, shapes of clusters, number of clusters etc.

  10. Evaluation Measures • Direct Measure • Keep as much resemblance as possible between the selected data and the original data. • Ex) Entropy, moments, and histograms. • Indirect Measure • For example, a classifier can be used to check whether instance selection results in better, equal, or worse predictive accuracy. • Conventional evaluation methods in sampling, classification, and clustering can be used in assessing the performance of instance selection. • Ex) Precision, recall.

  11. Platform Compatibility & Database Protectability • Platform compatibility • The subset of selected instances should be compatible with the mining algorithm used in application. • Database protectability • Under any circumstances, the original data should be kept intact.

  12. Related Work • Feature selection • We can transpose an instance selection problem to a problem of attribute selection. • Boosting • Active Learning

  13. Conclusion and Future Work • The central point of instance selection is approximation

More Related