380 likes | 486 Views
Using Trees to Depict a Forest. Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich. Motivation. In interactive database querying, we often get more results than we can comprehend immediately.
E N D
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich
Motivation • In interactive database querying, we often get more results than we can comprehend immediately • When do you actually click over 2-3 pages of results? • 85% of users never go to the second page! • What to display on the first page?
Standard solutions • Sorting by attributes • Computationally expensive • Similar results can be distributed many pages apart • Ranking • Hard to estimate of the user's preference. • In database queries, all tuples are equally relevant! • What to do when there are millions of results?
Make the First Page Count • Human beings are very capable of learning from examples • Show the most “representative” results • Best help users learn what is in the result set • User can decide further actions based on representatives
(Model-driven Usable Systems for Information Querying) The Proposal:MusiqLensExperience
Suppose a user wants a 2005 Civic but there are too many of them…
Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?
Finding a Suitable Metric • Users should be the ultimate judge • Which metric generates the representatives that I can learn the most from? • User study to evaluate different representation modeling
Metric Candidates • Sort by attributes • Uniform random sampling • Small clusters are missed • Density-biased sampling • Sample more from sparse regions, less from dense regions • Sort by typicality • Based on probabilistic modeling • K-medoids
Metric Candidates - K-medoids • A medoid of a cluster is the object whose dissimilarity to others is smallest • Average medoid and max medoid • K-medoids are k objects, each from a different cluster where the object is the medoid • Why not K-means? • K-means cluster centers do not exist in database • We must present real objects to users
Plotting the Candidates Data: Yahoo! Autos, 3922 data points. Price and mileage are normalized to 0..1
User Study Procedure • Users are given: • 7 sets of data, generated using the 7 candidate methods • Each set consists of 8 representative points • Users predict 4 more data points • That are most likely in the data set • Should not pick those already given • Measure the predication error
Verdict • K-meoids is the winner • In this paper, authors choose average k-medoids • Proposed algorithm can be extended to max-medoids with small changes
Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?
Cover Tree Based Algorithm • Cover Tree was proposed by Beygelzimer, Kakade, and Langford in 2006 • Briefly discuss Cover Tree properties • See Cover Tree based algorithms for computing k-medoids
Cover Tree Properties (1) Nesting: for all , Points in the Data (One Dimension)
Cover Tree Properties (2) Covering: node in is within distance of to its children in Distance from node to any descendant is less than . This value is called the “span” of the node.
Cover Tree Properties (3) Separation: nodes in are separated by at least Note: allowed to be negative to satisfy above conditions.
Additional Stats for Cover Tree (2D Example) DS = 10 DS = 3 p Density (DS): number of points in the subtree Centroid (CT): geometric center of points in the subtree
k-medoid Algorithm Outline • We descend the cover tree to a level with more than nodes • Choose an initial points as first set of medoids (seeds) • Bad seeds can lead to local minimums with a high distance cost • Assigning nodes and repeated update until medoids converge
Cover Tree Based Seeding • Descend the cover tree to a level with more than nodes (denote as level m) • Use the parent level as starting point for seeds • Each node has a weight, calculated as product of span and density (the contribution of the subtree to the distance cost) • Expand nodes using a priority queue • Fetch the first nodes from the queue as seeds
A Simple Example: k = 4 Span = 2 Span = 1 Span = 1/2 Span = 1/4 Priority Queue on node weight (density * span): S3 (5), S8 (3), S5 (2) S8 (3/2), S5 (1), S3 (1), S7 (1), S2 (1/2) Final set of seeds
Update Process • Initially, assign all nodes to closest seed to form clusters • For each cluster, calculate the geometric center • Use centroid and density information to approximate subtree • Find the node that is closest to the geometric center, designate as a new medoid • Repeat from step 1 until medoids converge
Challenges • Representation Modeling: finding a suitable metric • What is the best set of representatives? • Representative finding • How to find them efficiently? • Query Refinement • How to efficiently adapt to user’s query operations?
Query Adaptation • Handle user actions • Zooming • Selection (filtering)
Zooming • Zooming • Expand all nodes assigned to the medoid • Run k-medoid algorithm on the new set of nodes
Selection • Effect of selection on a node • Completely invalid • Fully valid • Partially valid • Estimate the validity percentage (VG) of each node • Multiply the VG with weight of each node
Experiments – Initial Medoid Quality • Compare with R-tree based method by M. Ester, H. Kriegel, and X. Xu • Data sets • Synthetic dataset: 2D points with zipf distribution • Real dataset: LA data set from R-tree Portal, 130k points • Measurement • Time to compute the medoids • Average distance from a data point to its medoid
Results on Synthetic Data Distance Time For various sizes of data, Cover-tree based method outperforms R-tree based method
Results on Real Data For various k values, Cover-tree based method outperforms R-tree based method on real data
Query Adaptation Compare with re-building the cover tree and running the k-medoid algorithm from scratch. Synthetic Data Real Data Time cost of re-building is orders-of-magnitude higher than incremental computation.
Conclusion • Authors proposed MusiqLens framework for solving the many-answer problem • Authors conducted user study to select a metric for choosing representatives • Authors proposed efficient method for computing and maintaining the representatives under user actions • Part of the database usability project at Univ. of Michigan • Led by Prof. H.V. Jagadish • http://www.eecs.umich.edu/db/usable/