Using Trees to Depict a Forest

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor

Motivation – Too Many Results • In interactive database querying, we often get more results than we can comprehend immediately • Try search a popular keyword • When do you actually click over 2-3 pages of results? • 85% of users never go to the second page [1,2]

Why IR Solutions Do NOT Apply • Sorting and ranking are standard IR techniques • Search engines show most relevant hits in the first page • However, for a database query, all tuples in the query result set are equally relevant • For example, Select * from Cars where price < 13,000 • All matching results should be available to user • What to do when there are millions of results?

Make the First Page Count • If no user preference information available, how to best arrange results? • Sort by attribute? • Random selection? • Others? • Show the most “representative” results • Best help users learn what is in the result set • User can decide further actions based on representatives

Our Proposal – MusiqLens Experience

Suppose a user wants a 2005 Civic but there are too many of them…

MusiqLens on the Car Data

Zooming in: 2005 Honda Civics ~ ID 132

Now Suppose User Filters by “Price < 9,500”

After Filtering by “Price < 9,500”

Challenges • Metric challenge • What is the best set of representatives? • Representative finding challenge • How to find them efficiently? • Query challenge • How to efficiently adapt to user’s query operations?

Finding a Suitable Metric • Users should be the ultimate judge • Which metric generates the representatives that I can learn the most from • User study • Use a set of candidates • Users observe the representatives • Users estimate more data points in the data • Representatives lead to best estimation wins

Metric Candidates • Sort by attributes • Uniform random sampling • Density-biased sampling [3] • Sort by typicality [4] • K-medoids • Average • Maximum

Density-biased Sampling • Proposed by C. R. Palmer and C. Faloutsos [3] • Sample more from sparse regions, less from dense regions • To counter the weakness of uniform sampling where small clusters are missed

Sort by Typicality Proposed by Ming Hua, Jian Pei, et al [4] Figure source: slides from Ming Hua

Metric Candidates - K-medoids • A medoid of a cluster is the object whose average or maximum dissimilarity to others is smallest • Average medoid and max medoid • K-medoids are kobjects, each from a cluster where the object is the medoid • Why not K-means • K-means cluster centers do not exist in database • We must present real objects to users C

Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Normalized price and mileage to 0-1.

Plotting the Candidates - Typicality

Plotting the Candidates – k-medoids

User Study Procedure • Users are given • 7 sets of data, generated using the 7 candidate methods • Each set consists of 8 representative points • Users predict 4 more data points • That are most likely in the data set • Should not pick those already given • Measure the predication error

Predication Quality Measurement P1 D1 D2 P2 So MinDist: D1 For data point So: MaxDist: D2 AvgDist: (D1+D2)/2

Performance – AvgDist and MaxDist For AvgDist: Avg-Medoid is the winner. • For MaxDist: Max-Medoid is the winter.

Performance – MinDist Avg-Medoid seems to be the winner

Verdict • Statistical Significance of Result • Although result is insignificant in MinDist, overall AvgMeoid is better than Density • Based on AvgDist and MinDist: Avg-Medoid • Based on MaxDist: Max-Medoid • In this paper, we choose average k-medoids • Our algorithm can extend to max-medoids with small changes

Cover Tree Based Algorithm • Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 [5] • Briefly discuss Cover Tree properties • Cover Tree based algorithms for computing k-medoids

Cover Tree Properties (1) • Assume all pair-wise distance <= 1. Nesting: for all i, Ci Ci+1 Points in the Data (One Dimension) Figure modified from slides of Cover Tree authors

Cover Tree Properties (2) Covering: node in Ciis within distance of to its children in Ci+1 Ci Ci+1 Distance from node to any descendant is less than This value is called the “span” of the node.

Cover Tree Properties (3) Separation: nodes in Ci are separated by at least Ci Ci+1 Points in the Data Figure modified from slides of Cover Tree authors

Additional Stats for Cover Tree (2D Example) DS = 10 DS = 3 p Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree

k-medoid Algorithm Outline • We descend the cover tree to a level with more than k nodes • Choose an initial k points as first set of medoids (seeds) • Bad seeds can lead to local minimums with a high distance cost • Assigning nodes and repeated update until medoids converge

Cover Tree Based Seeding • Descend the cover tree to a level with more than knodes (denote as level m) • Use the parent level (m-1) as starting point for seeds • Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) • Expand nodes using a priority queue • Fetch the first k nodes from the queue as seeds

A Simple Example: k = 4 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds

Update Process • Initially, assign all nodes to closest seed to form k clusters • For each cluster, calculate the geometric center • Use centroid and density information to approximate subtree • Find the node that is closest to the geometric center, designate as a new medoid • Repeat from step 1 until medoids converge

Query Adaptation • Handle user actions • Zooming • Selection (filtering) • Zooming • Expand all nodes assigned to the medoid • Run k-medoid algorithm on the new set of nodes

Selection • Effect of selection on a node • Completely invalid • Fully valid • Partially valid • Estimate the validity percentage (VG) of each node • Multiply the VG with weight of each node

Experiments – Initial Medoid Quality • Compare with R-tree based method [6] • Data sets • Synthetic dataset: 2D points with zipf distribution • Real dataset: LA data set from R-tree Portal, 130kpoints • Measurement • Time to compute the medoids • Average distance from a data point to its medoid

Results on Synthetic Data Distance Time For various sizes of data, Cover-tree based method outperforms R-tree based method

Result on Real Data For various k values, Cover-tree based method outperforms R-tree based method on real data

Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data Time cost of re-building is orders-of-magnitude higher than incremental computation.

Related Work • Classic/textbook k-medoid methods • Partition Around Medoids (PAM) and Clustering LARge Applications (CLARA), L. Kaufman and P. Rousseeuw, 1990 • CLARANS, R. T. Ng and J. Han, TKDE 2002 • Tree-based methods • Focusing on Representatives (FOR), M. Ester, H. Kriegel, and X. Xu, KDD 1996 • Tree-based Partitioning Querying (TPAQ), K. Mouratidis, D. Papadias, and S. Papadimitriou, VLDBJ 2008

Related Work (2) • Clustering methods • For example, BIRCH, T. Zhang, R. Ramakrishnan, and M. Livny, SIGMOD 1996 • Result presentation methods • Automatic result categorization, K.Chakrabarti, S.Chaudhuri, and S.wonHwang, SIGMOD 2004 • DataScope, T. Wu, et al, VLDB 2007 • Other recent work • Finding representative set from massive data, ICDM 2005 • Generalized group by, C. Li, et al, SIGMOD 2007 • Query result diversiﬁcation, E. Vee et al., ICDE 2008

Conclusion • We proposed MusiqLens framework for solving the many-answer problem • We conducted user study to select a metric for choosing representatives • We proposed efficient method for computing and maintaining the representatives under user actions • Part of the database usabilityproject at Univ. of Michigan • Led by Prof. H.V. Jagadish • http://www.eecs.umich.edu/db/usable/

Thank you. Questions? Bin Liu, binliu@umich.edu

References [1] E. Agichtein, E. Brill, S. T. Dumais, and R. Ragno, Learning user interaction models for predicting web search result preferences. SIGIR, 2006 [2] B. J. Jansen and A. Spink. How are we searching the world wide web? a comparison of nine search engine transaction logs. Inf. Process. Manage., 42(1), 2006 [3] C. R. Palmer and C. Faloutsos. Density biased sampling: An improved method for data mining and clustering. In SIGMOD Conference, 2000 [4] M. Hua, J. Pei, A. W.-C. Fu, X. Lin, and H. FungLeung. Efficiently answering top-k typicality queries on large databases. In VLDB, pages 890{901, 2007. [5] A. Beygelzimer, S. Kakade, and J. Langford. Cover trees for nearest neighbor. In ICML, 2006. [6] K. Mouratidis, D. Papadias, and S. Papadimitriou. Tree-based partition querying: a methodology for computing medoids in large spatial datasets. VLDB J., 17(4):923-945, 2008.

Using Trees to Depict a Forest