Efficient Data Reduction Techniques for Improved Data Mining Performance

Data Reduction via Instance Selection Chapter 1

Background • KDD • Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. • Instance selection • Choosing a subset of data to achieve the original purpose of a data mining application as if the whole data is used. • The ideal outcome of instance selection is model independent

Instance Selection • Instance selection has the following prominent functions. • Enabling • When a data set is too huge, it may not be possible to run a data mining algorithm • Focusing • One application is normally only about one aspect of the domain • Cleaning • Remove irrelevant ones as well as noise and/or redundant data. • Tradeoff between the sample size and mining quality. • Instance selection methods associated with data mining tasks such as classification and clustering

Methods of Classification (1) • Four types of selected instance • Critical point • The points that matter the most to a classifier • Originated from the learning method of NN. • Keep only the critical points so that noisy data points are removed as well as the data set is reduced. • Boundary points • Support vectors • Prototypes • Representatives of groups of instances via averaging

Methods of Classification (2) • Tree based sampling • Instance selection can be done via the decision tree built • Delegate sampling • Construct a decision tree such that instances at the leaves of the tree are approximately uniformly distributed • Samples instances from the leaves in inverse proportion to the density at the leaf and assigns weights to the sample points that are proportional to the leaf density

Methods of Clustering • Prototypes • Pseudo data points generated from the formed clusters • Prototypes & sufficient statistics • Representing a cluster using both defiant points and pseudo points • Data description in a hierarchy • When the clustering produces a hierarchy, the prototype approach to instance selection will not work as there are many ways of choosing appropriate clusters. • Squashed Data • Each data point has a weight and the sum of the weights is equal to the number of instances in the original data set. • Obtaining squashed data • Model free, model dependent (or likelihood based)

Instance Labeling • Problem of selecting which unlabeled data for labeling

Towards the Next Step • Universal model of instance selection. • This would allow the selected instances to be useful for a group of mining algorithms in solving real world problems. • Not yet found. • Different groups of learning algorithms need different instance selectors in order to suit their learning/search bias well.

Evaluation Issues • Sampling • Performance is of sufficient statistics and can depend on the data distribution. • If it is of normal distribution, means and variances are the two major measure. • Classification • Performance is more about predictive accuracy as well as aspects such as comprehensibility and simplicity. • Clustering • Performance is naturally about clusters: inter- and intra-cluster similarity, shapes of clusters, number of clusters etc.

Evaluation Measures • Direct Measure • Keep as much resemblance as possible between the selected data and the original data. • Ex) Entropy, moments, and histograms. • Indirect Measure • For example, a classifier can be used to check whether instance selection results in better, equal, or worse predictive accuracy. • Conventional evaluation methods in sampling, classification, and clustering can be used in assessing the performance of instance selection. • Ex) Precision, recall.

Platform Compatibility & Database Protectability • Platform compatibility • The subset of selected instances should be compatible with the mining algorithm used in application. • Database protectability • Under any circumstances, the original data should be kept intact.

Related Work • Feature selection • We can transpose an instance selection problem to a problem of attribute selection. • Boosting • Active Learning

Conclusion and Future Work • The central point of instance selection is approximation

Efficient Data Reduction Techniques for Improved Data Mining Performance

Efficient Data Reduction Techniques for Improved Data Mining Performance

Presentation Transcript

NOx Reduction via Combustion Optimization

Dimension Reduction and Feature Selection

Data Reduction via Instance Selection

Fuzzy-Rough Instance Selection

Reduction in Downtime via Interlocking

Drift Reduction And Nozzle Selection

NIFS Data Reduction

The Landmark Model: An Instance Selection Method for Time Series Data

Genetic-Algorithm-Based Instance and Feature Selection

OWL Instance Data Evaluation

Competence Guided Instance Selection for Case-based Reasoning

Instance Construction via Likelihood-Based Data Squashing

Semantic Web Instance Data Evaluation

Evolution via natural selection

Instance Selection

Instance Construction via Likelihood-Based Data Squashing

Dimension Reduction and Feature Selection

Competence Guided Instance Selection for Case-based Reasoning

Data Reduction

Genetic-Algorithm-Based Instance and Feature Selection