Converting Categories to Numbers for Approximate Nearest Neighbor Search

Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20

Outline • Introduction • Motivation • Measurement • Algorithms • Experiments • Conclusion

Introduction • Memory-Based Reasoning • Case-Based Reasoning • Instance-Based Learning • Given a training dataset and a new object, predict the class (target value) of the new object. • Focus on table data

Introduction • K Nearest Neighbor Search • Compute similarity between the new object and each object in the training dataset. • Linear time to the size of the dataset • Similarity: Euclidean distance • Multi-dimension Index • Spatial data structure, such as R-tree • Numeric data

Introduction • Indexing on Categorical Data? • Linear order of the categories • Existing correct ordering? • Best ordering? • Store the mapped data on a multi-dimensional data structure as filtering mechanism

Measurement for Ordering Ordering Problem Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

Measurement for Ordering • Relationship ScoringReasonable Ordering Score • In an ordering path <v1, v2, …, vn>, 3-tuple <vi-1, vi, vi+1> is reasonable if and only if dist(vi-1, vi+1) ≧dist(vi-1, vi) anddist(vi-1, vi+1) ≧dist(vi, vi+1).

Measurement for Mapping • Pairwise Difference Scoring • Normalized distance matrix • Mapping values of categories • Distm(vi, vj) = |mapping(vi) - mapping(vj)|

Algorithms • Prim-like Ordering • Kruskal-like Ordering • Divisive Ordering • GA Approach Ordering • A vertex is a category • A graph represent a distance matrix

Prim-like Ordering Algorithm • Prim’s Minimum Spanning Tree • Initially, choose a least edge (u, v) • Add the edge to the tree; S = {u, v} • Choose a least edge connecting a vertex in S and a vertex, w, not in S • Add the edge to the tree; Add w to S • Repeat until all vertices are in S

Prim-like Ordering Algorithm • Prim-like Ordering • Choose a least edge (u, v) • Add the edge to the ordering path; S = {u, v} • Choose a least edge connecting a vertex in S and a vertex, w, not in S • If the edge creates a circle on the path, discard the edge, and choose again • Else, add the edge to the ordering path; Add w to S • Repeat until all vertices are in S

Kruskal-like Ordering Algorithm • Kruskal Minimum Spanning Tree • Initially, choose a least edge (u, v) • Add the edge to the tree; S = {u, v} • Choose a least edge as long as the edge does not create a circle in the tree • Add the edge to the tree; Add the two vertiecs to S • Repeat until all vertices are in S

Kruskal-like Ordering Algorithm • Kruskal-like Ordering • Initially, choose a least edge (u, v) and add it to the ordering path; S = {u, v} • Choose a least edge as long as the edge does not create a circle in the tree, anddegree of either vertex on the path is <= 2 • Add the edge to the ordering path; Add the two vertices to S • Repeat until all vertices are in S • Heap array can be used to speed up choosing least edge

Divisive Ordering Algorithm • Idea: • Pick a central vertex, and split the rest vertices • Building a binary tree: vertices are the leaves • Central Vertex:

P B A AL BL AR BR Divisive Ordering Algorithm • AR is closer to P than AL is. • BL is closer to P than BR is.

Clustering • Splitting a Set of Vertices into Two Groups • Each group has at least one vertex • Close (similar) vertices in same groupDistant vertices in different groups • Clustering Algorithms • Two clusters

Clustering • Clustering • Grouping a set of objects into classes of similar objects • Agglomerative Hierarchical Clustering Algorithm • Singleton clusters • Merge similar clusters

Clustering • Clustering Algorithm: Cluster Similarity • Single linkdist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj • Complete linkdist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj • Average link -- adopted in our studydist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj • others

Clustering • Clustering Implementation Issues • Which pair of clusters to be merged:Keep cluster-to-cluster similarity for each pair • Recursively partition sets of vertices while building the binary tree:Non-recursive version with a stack

GA Ordering Algorithm • Genetic Algorithm for Optimal Problems • Chromosome: solution • Population: pool of solutions • Genetic Operations • Crossover • Mutation

GA Ordering Algorithm • Encoding a Solution • Binary string • Ordered list of categories – in our ordering problem • Fitness Function • Reasonable ordering score • Selecting Chromosomes for crossover • High fitness value => high probability

GA Ordering Algorithm • Crossover • Single point • Multiple points • Mask • Crossover AB | CDE and BD | AEC • Results in ABAEC and BDCDE => Illegal

GA Ordering Algorithm • Repair Illegal Chromosome ABAEC • AB*EC => fill D in * position • Repair Illegal Chromosome ABABC • AB**C • D and E are missing • Which one is closest to B, fill it in first * position

Mapping Function • Ordering Path <v1, v2, …, vn> • Mapping(vi) =

Experiments • Synthetic Data (width/length = 5)

Experiments • Synthetic Data:Reasonable Ordering Score for Divisive Algorithm • width/length = 5 => 0.82 • width/length = 10 => 0.9 • No Ordering => 1/3 • Divisive algorithm is better than Prim-like algorithm when number of categories > 100

Experiments • Divisive Ordering is best among the three ordering algorithms • For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10. • Prim-like ordering algorithm: 0.12 and 0.1, respectively.

Experiments • “Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive • 33 nominal attributes, 7 continuous attributes • Sample 5000 records for training dataset. • Sample 2000 records for approximate KNN search experiment.

Experiments • Distance Matrix: distance between two categories • V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999 • D = {d1, d2, …, dn} of n tuples. • D is subset of D1 * D2 * … * Dk, where Di is a categorical domain, for 1 ≦ i ≦ k. • di = <ci1, ci2, …, cik>.

Experiments

Experiments • Approximate KNN – nominal attributes

Experiments • Approximate KNN – all attributes

Conclusion • Developed Ordering Algorithms • Prim-like • Krusal-like • Divisive • GA-based • Devised Measurement • Reasonable ordering score • Root mean squared error

Conclusion • What next? • New categories, new mapping function • New index structure? • Training mapping function for a given ordering path.

Thank you.

Converting Categories to Numbers for Approximate Nearest Neighbor Search