1 / 19

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams. Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005. Outline. Related Work ANNCAD Properties of ANNCAD Conclusion. Classifying Data Streams.

Download Presentation

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005

  2. Outline • Related Work • ANNCAD • Properties of ANNCAD • Conclusion

  3. Classifying Data Streams • Problem Statement: We seek an algorithm for classifying data streams with numerical attributes---will work for totally ordered domains too. • Desiderata: • Fast update speed for newly arriving records. • Only require single pass of data. • Incremental algorithms are needed. • Coping with concept changes. • Classical mining algorithms were not designed for data streams and need to replaced or modified.

  4. Classifying Data Streams: Related Work • Hoeffding trees: • VFDT and CVFDT: build decision tree incrementally. • Require a large amount of examples to obtain a fair performance classifier. • Unsatisfied performance when training set is small. • Ensemble: • Combine base models by voting technique. • Suitable for coping with concept drift. • Fail to provide a simple model and understanding of the problem.

  5. State of the Art: NearestNeighborhood Classifiers • Pros and cons: • +: Strong intuitive appeal and simple to implement • -: Fail to provide simple models/rules • -: Expensive Computations • ANN: Approximate Nearest Neighborhood with error guarantee 1+ε: • Idea: pre-processing the data by devising a data structure (e.g. ring-cover tree) to speed up the searchings. • Designed for stored data only. • Time for update the pre-processing step depends on size of data set which may be infinite.

  6. Our Algorithm: ANNCAD Adaptive NNClassification Algorithm for Data Streams • Model building: • Pre-assign classes to obtain an approximate result and provide simple models/rules. • Decompose the feature space to make classification decisions. • Akin to wavelets. • Classification: • Find NN for classification adaptively. • progressively expand the searching of nearby area of a test point (star).

  7. (8+9+10+0)/4 Quantize Feature Space and Compute Multi-resolution Coefficients Quantize Feature Space and record information into data arrays A set of 100 two-class training points Multi-resolution representation of a two-class data set.

  8. B=6.75; R=0.6  Blue B=2; R=4.25 M(ix) B=3; R=3.25 M(ix) Building a Classifier Label each block with its majority class Label block only if |C1st|-|C2nd| > 80% Hierarchical structure of ANNCAD Classifier

  9. Classified block  Label class I Classified block  Label class II Block with tag “M”, go back to prev. level. Unclassified block, go to next level. Compute the distance between the test point and the center of every nonempty neighboring block. Decision Algorithm on the ANNCAD Hierarchy The combined classifier over multiple levels

  10. New training point Incremental Update

  11. Concept Drift: Adaptation by Exponential Forgetting • Data Array , Factor 01: new  old • No effect if no concept changes • Adapt quickly (exponentially) if concept changes • No extra memory needed (sliding window required.)

  12. Example: 4 different grids for building 4 classifiers. Grid Position and Resolution • Problem: Neighborhood decision strongly depends on grid position • Solution: Build several classifiers by shifting grid position by 1/n. Then combine the results by voting. • Thm.x: test point, ndclassifiers, b(x): Blocks containing x, then: zb(x),yb(x): dist(x,y)<(1+1/n-1)*dist(x,z). • In practice, only 2-3 classifiers can achieve a good result.

  13. Properties of ANNCAD • Compact support: locality property allows fast update • Dealing with noise: can set a threshold for classification decision • Multi-resolution: to control the fineness of the result, or optimize the system resources. • Low complexity (gd = total number of cells) • Building classifier: O(min(N,gd)) • Testing: O(log2(g)+2d). • Updating: log2(g)+1.

  14. Experiments • Synthetic Data • 3-d unit cube: • Class distribution: class 0 inside sphere with radius 0.5 class 1 outside • 3000 training examples • 1000 test examples • Exact ANN: • Expand the searching area by double the radius until reaching some training point. • Classify the test point with the majority class. (a) different initial resolutions. (b) different # ensembles.

  15. Experiments (Cont’) • Real Data 1 -- Letter Recognition • Objective: identify a pixel displays as one of the 26 letter. • 16 numerical attributes to describe its pixel displays. • 15,000 training examples • 5,000 test examples • Add 5 % noise by randomly assign class. • Grid size: 16 units • #Classifiers: 2

  16. ANNCAD Vs VFDT • Real Data 2 – Forest Cover Type • Objective: predict forest cover type. • 10 numerical attributes. • 12,000 training examples • 9,000 test examples • Grid size: 32 unit • #Classifiers: 2

  17. Concept Shift: ANNCAD vs CVFDT • Real Data 3 – Adult • Objective: determine a person with salary>50K • Concept Shift Simulation: Group by races •  = 0.98 • Grid Size: 64 • #Classifier: 2

  18. Conclusion and Future Work • ANNCAD • an incremental classification algorithm to find adaptive NN • Suitable for mining data streams: fast update speed • Exponential forgetting for concept shift/drift. • Future Work: Detect concept shift/drift by changes in class label of blocks.

  19. THANK YOU!

More Related