220 likes | 357 Views
Identifying Competence-Critical Instances for Instance-Based Learners. 2001. 5. 9 Presenter: Kyu-Baek Hwang. Abstract. The basic nearest neighbor classifier with a large dataset Classification accuracy and response time Review on past works tackling these problems No consistent method
E N D
Identifying Competence-Critical Instances for Instance-Based Learners 2001. 5. 9 Presenter: Kyu-Baek Hwang
Abstract • The basic nearest neighbor classifier with a large dataset • Classification accuracy and response time • Review on past works tackling these problems • No consistent method • Insight into the problem characteristics • Iterative case filtering (ICF) algorithm
Introduction • Harmful and superfluous instances are stored. • Selectively store instances (or delete stored instances) • The data miner have to gain an insight into the structure of the classes in the instance space. • The experimental comparison of RT3 and ICF • Neither algorithm performs better in all cases.
Defining the Problem • Two practical issues that arise in this area • Instance removal (retain only the critical instances) • Different approaches according to the type of the classification problem • The same (or higher) accuracy and the less storage • Which instance should be deleted?
Four Cases Where NNC Fails • Noisy instance • Close to the interclass border • Border instances are critical in general. • Small region defining the class • Small k values cope with this kind of problem. • Unsolvable problem
Instance Space Structure • Two categories of instance space structure • Homogeneous region (locality) • Non-homogeneous region (no locality)
Which Instances Are Critical? • Prototypes • For non-homogeneous regions • Instances with high utility • Needs classification feedback • Instances which lie on borders are almost always critical.
Review • Competence enhancement • By removing noisy or corrupt instances • Competence preservation • By removing superfluous instances • Hybrid approach • Many modern approaches
Competence Enhancement • Stochastic noise • Wilson Editing • All instances which are incorrectly classified by their nearest neighbors are assumed to be nosy instances. • Smoothing effect • Empirically tested • Noisy instances and genuine exceptions
Competence Preservation • Condensed nearest neighbor (CNN) • Look for cases for which removal does not lead to additional miss-classification • Chang’s algorithm (Korean) • Merging two instances into one synthetic instance (the prototype) • Footprint deletion policy • Local-set of a case c • The set of cases contained in the largest hypersphere centered on c such that only cases in the same class as c are contained in the hypersphere.
Footprint Deletion Policy • For a case-base CB = {c1, c2, …, cn} • Coverage(c) = {c’ CB: c’ Local-set(c)} • Reachable(c) = {c’ CB: c Local-set(c’)} • Pivotal group • With an empty reachable set • Delete the instance with large local-set
Hybrid Approaches (1/2) • IB2 (on-line) • If a new case to be added can already be classified correctly on the basis of the current case-base, the case is discarded. • IB3 • IB2 with time delay • The order of presentation is crucial for IB2 and IB3. • RT1 • k nearest neighbor • Associates of the case p are the cases that have p as their k nearest neighbor. • The instance which has many associates is tested and removed.
Hybrid Approaches (2/2) • RT2 is identical to RT1 and additionally, • Cases furthest from their nearest enemy are removed first. • Removed associates still guide the deletion process. • RT3 is identical to RT2 and additionally, • Wilson’s noise filtering step is executed first. • RT algorithms are analogous to the footprint deletion policy.
An Iterative Case Filtering Algorithm • Coverage set and reachable set • RTn algorithm • Associate set of fixed size • Remove cases which have a reachable set size greater than the coverage set size. • Intuitively, this approach removes the cases that are far from the border. • A noisy case will have a singleton reachable set and a singleton coverage set. • This property protects the noisy case from being removed. • Wilson Editing
Experiments • Experiments on 30 datasets from UCI repository • Maximum number of iterations: 17 • switzerland database • In general, 3 iterations are required.
Reduction Profiles • The percentage of cases removed after each iteration • switzerland database: 17 iterations, 2 – 13% (complicated) • zoo database: 2 iterations, 37% (simple structure)
Comparative Evaluation • (1) Early approaches • CNN, RNN, SNN, Chang, Wilson Editing, repeated Wilson Editing, and all k-NN • (2) Recent editions • IB2, IB3, TIBLE, and IBL-MDL • (3) State of the art • RT3 and ICF
Conclusions • The structure of the instance space is important. • ICF and RT3 behave in very similar way. • The intrinsic properties of them are similar. • 80% of removal and the little degradation of accuracy. • The reduction profile provides some insights into the property of the problem.