360 likes | 522 Views
Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich. Presented by: Ashish Chawla CSE 6339 Spring 2009. Overview. Opportunity:
E N D
Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009
Overview • Opportunity: • Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker. • Incorporate social behavior into processing search queries • Network Aware Search • Results determined by opinion of network. • Existing top-k are too space intensive • Dependence of scores on seeker’s network • Investigate clustering seekers based on behavior of networks • Del.icio.us datasets were used for experiments
Introduction • What is Network Aware Search? • Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook • Users • contribute content • annotate items (photos, videos, URLs, …) with tags • form social networks • friends/family, interest-based • need help discovering relevant content • What is Relevance of an item?
Claims • Define what is network-aware search. • Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy. • Refine score upper-bounds based on the user’s network and tagging behavior
Link (user u, user v) Tagged(user u,item i,tag t) Roger, i1, music Roger, i3, music Roger, i5, sports … Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news … Data Model Link(u1,v1): directed edge Network (u) = { v | Link (u,v) } For seeker u1ε Seekers, Network(u1) = neighbors of u1 Seekers = uLink Taggers = uTagged
What are Scores? • Query is a set of tags • Q = {t1,t2,…,tn} example: fashion, www, sports, artificial intelligence • For a seeker u, a tag t, and a item I (Score per tag) score(i,u,t) = f(|Network(u) {v, |Tagged(v,i,t)}|) • Overall Score of the query score(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u, tn)) f and g are monotone, where f = COUNT, g = SUM
Problem Statement Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall score
Standard Top-k Processing • Q = {t1,t2,…,tn} • Inverted lists per tag, IL1, IL2, … ILn, sorted on scores • score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3)) • Intuition • high-scoring items are close to the top of most lists Fagin-style processing: NRA(no random access) • access all lists sequentially in parallel • maintain a heap sorted on partial scores • stop when partial score of kth item > best case score of unseen/incomplete items
worst score best-score NRA 0.6+0.6+0.9=2.1 List 1 List 2 List 3 Candidates Min top-2 score : 0.6 Threshold (Max of unseen tuples): 2.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?
worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 0.9 Threshold (Max of unseen tuples): 1.8 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?
worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.3 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue
worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue
NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.6 Threshold (Max of unseen tuples): 0.8 Pruning Candidates: Min top-2 < best score of candidate
NRA • NRA performs only sorted accesses (SA) (No Random Access) • Random access (RA) • lookup actual (final) score of an item • often very useful • Problems with NRA • high bookkeeping overhead • for “high” values of k, gain in even access cost not significant
List 2 List 1 List 3 TA (a1 , a2 , a3) lists sorted by score
List 2 List 1 List 3 TA TA Algorithm: round 1 read one item from every list Candidates min top-2 score: 1.6 maximum score for unseen items: 2.1 lists sorted by score Random access Random access
score score 53 99 80 36 30 78 15 75 14 72 tag = music item item score score item item 10 63 10 60 i5 i5 i1 i1 30 73 5 50 i2 i9 65 i2 i8 29 tag = photos i2 i8 27 62 i3 i4 i7 i6 40 25 i4 i2 i5 i1 i3 i5 39 23 i6 i8 i6 i6 20 18 i7 i4 i7 i7 15 16 i3 i3 i9 i8 13 16 seeker Jane seeker Jane seeker Ann seeker Ann Computing Exact Scores: Naïve • Typical: Maintain single inverted list per (seeker, tag), items ordered by score + can use standard top-k algorithms -- high space overhead
Computing Score Upper-Bounds • Space saving strategy. • Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag. • Here every item is stored at most once. • Q now: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. • This is Global Upper-Bound strategy • Limitation: • Time to dynamically computing exact scores at query time.
Miguel,… i1 73 Kath, … i2 65 Sam, … i3 62 Miguel, … i5 53 Peter, … i4 40 Jane, … i9 36 Mary, … i6 18 Miguel, … i7 16 Kath, … i8 16 Score Upper-Bounds tag = music Global Upper-Bound (GUB): 1 list per tag • + low space overhead • -- item upper-bounds, and list order(!) may differ from EXACT for most users • -- time to dynamically computing exact scores at query time. item taggers upper-bound • How do we do top-k processing with score upper-bounds? • Q: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. all seekers
Top-k with Score Upper-Bounds gNRA- “generalized no random access” • access all lists sequentially in parallel • maintain a heap with partial exact scores • stop when partial exact score of kthitem > highest possible score from unseen/incomplete items (computed using current list upper-bounds)
space baseline time baseline Performance of Global Upper Bound (GUB) and Exact • Space overhead • total # number of entries in all inverted lists • Query processing time • # of cursor moves
Clustering and Query-Processing • We want to reduce the distance between score upper-bound and the exact score. • Greater the distance, more processing may be required • Core idea • Cluster users into groups and compute upper-bound for the group. • Intuition • group users whose behavior is similar
Clustering Seekers • Cluster the seekers based on similarity in their scores (because score of an item depends on the network). • Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster). • Query processing for Q = t1.. tn and seeker u, we • First find the cluster C(u) • And then perform aggregation over the collection • Global Upper-Bound (GUB) is where all seekers fall into the same cluster.
Miguel,… Kath, … Sam, … item upper-bound taggers Miguel, … Bob,… Peter, … gucci 65 Jane, … Peter, … 40 versace Mary, … Mary, … 18 chanel Chris, … Chris, … 16 prada Alice, … 10 puma item upper-bound taggers Miguel,… puma 73 Sam, … 62 adidas Miguel, … 53 diesel Jane, … 36 nike Kath, … 5 gucci Clustering Seekers Global Upper-Bound Example of Clusters item upper-bound taggers 73 puma • assign each seeker to a cluster • compute an inverted list per cluster • ub(i,t,C) = maxuC|Network(u) {v|Tagged(v,i,tj)}| • + tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound • -- space overhead still high (trade-off) 65 gucci 62 adidas 53 diesel 40 versace 36 nike 18 chanel 16 prada C1: seekers Bob & Alice C2: seekers Sam & Miguel
How do we cluster seekers? • Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard. • Proofs by reduction from independent task scheduling problem and minimum sum of squares problem • Authors present some heuristics • Use some form of Normalized Discounted Cumulative Gain (NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword. • The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order
Clustering Taggers • For each tag t we partition the taggers into separate clusters. • We form inverted list and an item i in the list for cluster C gets the score as • maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}| • How to cluster taggers? • Graph with nodes as taggers and an edge exists between nodes v1 and v2 iff: • Items(v1,t) ∩ Items(v2,t) ≥ threshold
Clustering Seekers Metrics • Space • Global Upper Bound has the lowest overhead. • ASC and NCT achieve an order of magnitude improvement in space overhead over Exact. • Time • Both gNRA and gTA outperform Global Upper-bound. • ASC outperforms NCT on both sequential and total accesses in all cases for gTA and in all cases except one for gNRA. • Inverted lists are shorter • Score upper-bound order similar to exact score order for many users • Average % improvement over Global Upper-Bound • Normalized Cut: 38-72% • Ratio Association 67-87%
Clustering Seekers Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users
Clustering Taggers • Space • Overhead is significantly lower than that of Exact and of Cluster-Seekers • Time • Best Case: all taggers relevant to a seeker will reside in a single cluster • Worst Case: All taggers will reside in separate clusters. • Idea: cluster taggers based on overlap in tagging • assign each tagger to a cluster • compute cluster upper-bounds: • ub(i,t,C) = maxuSeekers, vC|Network(u) {v |Tagged(v,i,tj)}|
Conclusion and Next Steps • Cluster-Taggers worked best forseekers whose network fell into at most 3 * #tagsclusters • For others, query execution time degraded due to the number of inverted lists that had to be processed • For these seekers • Cluster-Taggers outperformed Cluster-Seekers in all cases • Cluster-Taggers outperforms Global Upper-Bound by 94-97%, in all cases. • Extended traditional top-k algorithms • Achieved a balance between time and space consumption.
Thank You! Questions?WebCT/achawla@ieee.org