300 likes | 311 Views
This presentation discusses a scalable and fast nearest neighbor-based classification algorithm for image recognition, addressing challenges of large training sets. It introduces SMART TV (SMall Absolute diffeRence of ToTal Variation) for efficient classification. The method involves unclassified object search and K-Nearest Neighbors voting. The algorithm aims to enhance the speed and scalability of the classification process in scenarios with millions of objects in the training set. Various techniques like predicate trees, file structures, and Total Variation calculations are explored. The presentation delves into the concept of functional contours to optimize the nearest neighbor set and improve scanning efficiency. The algorithm's implementation and the process of deriving attributes using dual functions are also detailed.
E N D
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University
Outline • Nearest Neighbors Classification • Problems • SMART TV (SMall Absolute diffeRence of ToTal Variation): A Fast and Scalable Nearest Neighbors Classification Algorithm • SMART TV in Image Classification
Unclassified Object Search for the K-Nearest Neighbors Vote the class Training Set Classification Given a (large) TRAINING SET, R(A1,…,An, C), with C=CLASSES and {A1…An}=FEATURES Classification is: labeling unclassified objects based on the class label assignment pattern of objects in the training set kNN classification goes as follows:
Can we make it faster (more scalable)? Problems with KNN • Finding k-Nearest Neighbor Set can be expensive when the training set contains millions of objects (very large training set) • linear to the size of the training set
A file, R(A1..An), containing horizontal structures (records) is Predicate trees: vertically partition; compress each vertical bit slice into a basic Ptree; R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontal structures (records) Scanned vertically R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 1. Whole file is not pure1 0 2. 1st half is not pure1 0 0 0 0 0 1 01 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 3. 2nd half is not pure1 0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 4. 1st half of 2nd half not 0 0 0 1 0 1 01 5. 2nd half of 2nd half is 1 0 1 0 6. 1st half of 1st of 2nd is 1 Eg, to count, 111 000 001 100s, use “pure111000001100”: 0 23-level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level=2 01 21-level 7. 2nd half of 1st of 2nd not 0 horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) R11 0 0 0 0 1 0 1 1 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: But it is pure (pure0) so this branch ends
Total Variation The Total Variation of a set X about (the mean), , measures total squared separation of objects in X about , defined as follows: We will use the concept of functional contours in this presentation to determine small pruned superset of the nearest neighbor set (which will then be scanned) First we will discuss functional contours in general then consider the specific TV contours.
A1 A2 An : : . . . graph(f) = {(a1,...,an,f(a1.an))| (a1..an)R } Y S contour(f,S) A1..An space R* R f A1 A2 An x1 x2 xn : . . . Y f(x) A1 A2 An Af x1 x2 xn f(x) : . . . R f YS Given f:R(A1..An)Y (any range) and SY (any subset of the range) , definecontour(f,S) f-1(S). There is aDUALITYbetween functions, f:R(A1..An)Y and derived attributes, Af of R given by x.Af f(x) where Dom(Af)=Y From the derived attribute point of view, Contour(Af,S) = SELECT A1..An FROM R* WHERE R*.Af S. If S={a}, f-1({a}) is Isobar(f, a)
2xRd=1..nad(k2kxdk) + |R||a|2 = xRd=1..n(k2kxdk)2 - 2xRd=1..nad(k2kxdk) + |R||a|2 = xd(i2ixdi)(j2jxdj) - |R||a|2 = xdi,j 2i+jxdixdj- 2 x,d,k2k adxdk + |R|dadad |R||a|2 = x,d,i,j 2i+j xdixdj- = x,d,i,j 2i+j xdixdj- 2|R| dadd + 2 dadx,k2kxdk + TV(a) = i,j,d 2i+j |Pdi^dj| - k2k+1 dad |Pdk| + |R||a|2 dadad ) = x,d,i,j 2i+j xdixdj+ |R|( -2dadd + R(A1..An) TV(a)=xR(x-a)o(x-a) If we use d for a index variable over the dimensions, = xRd=1..n(xd2 - 2adxd + ad2) i,j,k bit slices indexes The first term does not depend upon a, thus, the simpler derived attribute, TV-TV() does not have that term but has with identical contours as TV (just lowered the graph by the constant, TV() ). We also find it useful to apply a log to this simpler Total Variation function (to reduce the number of bit slices. The resulting functional is called the High-Dimension-ready Total Variation or HDTV(a).
- 2ddad = |R| |a-|2 so = |R|( dad2 + dd2) f()=0 and letting g(a) HDTV(a) = ln( f(a) )= ln|R| + ln|a-|2 2/| a- |2 (a -) The Gradient of g at a = a -)d 2( Taking g / ad (a) = | a- |2 f(c) To get an -contour, we move in and out along a- by to inner point, b=(1-/|a-|)(a-) and What inteval endpts gives an exact -contour in feature space? outer point c=(1+/|a-|)(a-). Then take f(b) and f(c) as lower and upper endpoints of the red vertical interval. Then we use formulas on that interval to get a P-tree for the -contour (which is a well-pruned superset of the -nbrhd of a f(b) a b c -contour (radius about a) dadad ) TV(a) = x,d,i,j 2i+j xdixdj + |R| ( -2dadd + From equation 7, f(a)=TV(a)-TV() d(adad- dd) ) = |R| ( -2d(add-dd) + The gradient =0 iff a= and gradient length depends only on the length of a- so isobars are hyper-circles The gradient function is has the form, h(r) = 2/r in along any ray from , Integrating, we get that g(a) has the form, 2ln|a-| along any coordinate direction (in fact any radial direction from ), so the shape of graph(g) is a funnel:
f(c) f(b) a b c -contour (radius about a) For additional vertical pruning we can use any other functional contours that are can easily computed (e.g., the dimension projection functionals). To classify a, then 1. Calculate basic P-trees for the derived attribute column of each training point 2. Calculate b and c (which depend upon a and ) 3. Get the feature space P-tree for those points with derived attribute value in [f(b),f(c)] (Note, when the camera ready paper was submitted we were still doing this step by sorting TV(a) values and then forming the predicate tree. Now we use the contour approach which speeds up that step considerably). 4. User that P-tree to prune out the candidate NNS. 5. If the root count of the candidate set is now small, proceed to scan and assign votes using Gaussian vote weights, else look for another pruning functional (e.g., dimension projection function for the major a- dimensions).
HDTV TV-TV() TV(x15)-TV() 1 1 2 2 3 3 4 4 5 5 Y X TV TV(x15) TV()=TV(x33) 1 1 2 2 3 3 4 4 5 5 Y X Graph of TV, TV-TV() and HDTV
Dataset • KDDCUP-99 Dataset (Network Intrusion Dataset) • 4.8 millions records, 32 numerical attributes • 6 classes, each contains >10,000 records • Class distribution: • Testing set: 120 records, 20 per class • 4 synthetic datasets (randomly generated): • 10,000 records (SS-I) • 100,000 records (SS-II) • 1,000,000 records (SS-III) • 2,000,000 records (SS-IV)
Dataset (Cont.) • OPTICS dataset • 8,000 points, 8 classes (CL-1, CL-2,…,CL-8) • 2 numerical attributes • Training set: 7,920 points • Testing set: 80 points, 10 per class
Dataset (Cont.) • IRIS dataset • 150 samples • 3 classes (iris-setosa, iris-versicolor, and iris-virginica) • 4 numerical attributes • Training set: 120 samples • Testing set: 30 samples, 10 per class
Speed and Scalability Speed and Scalability Comparison (k=5, hs=25) Machine used: Intel Pentium 4 CPU 2.6 GHz machine, 3.8GB RAM, running Red Hat Linux
Classification Accuracy (Cont.) Classification Accuracy Comparison (SS-III), k=5, hs=25
Overall Accuracy Overall Classification Accuracy Comparison
Summary • A nearest-based classification algorithm that starts its classification steps by approximating a number of candidates of nearest neighbors • The absolute difference of total variation between data points in the training set and the unclassified point is used to approximate the candidates • The algorithm is fast, and it scales well in very large dataset. The classification accuracy is very comparable to that of KNN algorithm.
Appendix: Image Preprocessing • We extracted color and texture features from the original pixel of the images • Color features: We used HVS color space and quantized the images into 52 bins i.e. (6 x 3 x 3) bins • Texture features: we used multi-resolutions Gabor filter with two scales and four orientation (see B.S. Manjunath, IEEE Trans. on Pattern Analysis and Machine Intelligence, 1996)
Image Dataset Corel images (http://wang.ist.psu.edu/docs/related) • 10 categories • Originally, each category has 100 images • Number of feature attributes: - 54 from color features - 16 from texture features • We randomly generated several bigger size datasets to evaluate the speed and scalability of the algorithms. • 50 images for testing set, 5 for each category
Results Classification Time
Results Preprocessing Time
Store the root count and HDTV values Compute Root Count Measure HDTV of each object Large Training Set Preprocessing Phase Classifying Phase Search the K-nearest neighbors for the candidate set Approximate the candidate set of NNs Unclassified Object Vote Appendix:Overview of SMART-TV
Preprocessing Phase • Compute the root counts of each class Cj, 1 j number of classes. Store the results. Complexity: O(kdb2) where k is the number of classes, d is the total of dimensions, and b is the bit-width. • Compute , 1 j number of classes. Complexity: O(n) where n is the cardinality of the training set. Also, retain the results.
Stored values of root count and TV Classifying Phase Search the K-nearest neighbors from the candidate set Approximate the candidate set of NNs Unclassified Object Vote Classifying Phase
Classifying Phase • For each class Cj with nj objects, 1 j number of classes, do the followings: a. Compute , where is the unclassified object • Find hs objects in Cj such that the absolute difference between the total variation of the objects in Cj and the total variation of about Cj are the smallest, i.e. Let A be an array and , where • Store all objectIDs in A into TVGapList
Classifying Phase (Cont.) • For each objectIDt, 1 t Len(TVGapList) where Len(TVGapList) is equal to hs times the total number of classes, retrieve the corresponding object features from the training set and measure the pair-wise Euclidian distance between and , i.e. and determine the k nearest neighbors of • Vote the class label forusing the k nearest neighbors