130 likes | 147 Views
This chapter explores instance selection as a nontrivial process aiming to identify valid, novel, and useful patterns in data. It delves into methods for choosing data subsets to enhance data mining outcomes, focusing on functions like enabling, cleaning, and tradeoffs for mining quality. Classifications into critical points, boundary points, and sampling methods like tree-based and delegate sampling are discussed. Clustering techniques involving prototypes and hierarchical data description are also covered. Future directions include developing universal models for instance selection tailored to various mining algorithms. Evaluation metrics and issues, platform compatibility, and database protectability are highlighted, along with related topics such as feature selection and boosting. The chapter concludes with the significance of instance selection in optimizing data mining processes.
E N D
Data Reduction via Instance Selection Chapter 1
Background • KDD • Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. • Instance selection • Choosing a subset of data to achieve the original purpose of a data mining application as if the whole data is used. • The ideal outcome of instance selection is model independent
Instance Selection • Instance selection has the following prominent functions. • Enabling • When a data set is too huge, it may not be possible to run a data mining algorithm • Focusing • One application is normally only about one aspect of the domain • Cleaning • Remove irrelevant ones as well as noise and/or redundant data. • Tradeoff between the sample size and mining quality. • Instance selection methods associated with data mining tasks such as classification and clustering
Methods of Classification (1) • Four types of selected instance • Critical point • The points that matter the most to a classifier • Originated from the learning method of NN. • Keep only the critical points so that noisy data points are removed as well as the data set is reduced. • Boundary points • Support vectors • Prototypes • Representatives of groups of instances via averaging
Methods of Classification (2) • Tree based sampling • Instance selection can be done via the decision tree built • Delegate sampling • Construct a decision tree such that instances at the leaves of the tree are approximately uniformly distributed • Samples instances from the leaves in inverse proportion to the density at the leaf and assigns weights to the sample points that are proportional to the leaf density
Methods of Clustering • Prototypes • Pseudo data points generated from the formed clusters • Prototypes & sufficient statistics • Representing a cluster using both defiant points and pseudo points • Data description in a hierarchy • When the clustering produces a hierarchy, the prototype approach to instance selection will not work as there are many ways of choosing appropriate clusters. • Squashed Data • Each data point has a weight and the sum of the weights is equal to the number of instances in the original data set. • Obtaining squashed data • Model free, model dependent (or likelihood based)
Instance Labeling • Problem of selecting which unlabeled data for labeling
Towards the Next Step • Universal model of instance selection. • This would allow the selected instances to be useful for a group of mining algorithms in solving real world problems. • Not yet found. • Different groups of learning algorithms need different instance selectors in order to suit their learning/search bias well.
Evaluation Issues • Sampling • Performance is of sufficient statistics and can depend on the data distribution. • If it is of normal distribution, means and variances are the two major measure. • Classification • Performance is more about predictive accuracy as well as aspects such as comprehensibility and simplicity. • Clustering • Performance is naturally about clusters: inter- and intra-cluster similarity, shapes of clusters, number of clusters etc.
Evaluation Measures • Direct Measure • Keep as much resemblance as possible between the selected data and the original data. • Ex) Entropy, moments, and histograms. • Indirect Measure • For example, a classifier can be used to check whether instance selection results in better, equal, or worse predictive accuracy. • Conventional evaluation methods in sampling, classification, and clustering can be used in assessing the performance of instance selection. • Ex) Precision, recall.
Platform Compatibility & Database Protectability • Platform compatibility • The subset of selected instances should be compatible with the mining algorithm used in application. • Database protectability • Under any circumstances, the original data should be kept intact.
Related Work • Feature selection • We can transpose an instance selection problem to a problem of attribute selection. • Boosting • Active Learning
Conclusion and Future Work • The central point of instance selection is approximation