1 / 36

Presented by: Ashish Chawla CSE 6339 Spring 2009

Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich. Presented by: Ashish Chawla CSE 6339 Spring 2009. Overview. Opportunity:

jayden
Download Presentation

Presented by: Ashish Chawla CSE 6339 Spring 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009

  2. Overview • Opportunity: • Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker. • Incorporate social behavior into processing search queries • Network Aware Search • Results determined by opinion of network. • Existing top-k are too space intensive • Dependence of scores on seeker’s network • Investigate clustering seekers based on behavior of networks • Del.icio.us datasets were used for experiments

  3. Introduction • What is Network Aware Search? • Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook • Users • contribute content • annotate items (photos, videos, URLs, …) with tags • form social networks • friends/family, interest-based • need help discovering relevant content • What is Relevance of an item?

  4. What is Network-Aware Search?

  5. Claims • Define what is network-aware search. • Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy. • Refine score upper-bounds based on the user’s network and tagging behavior

  6. Link (user u, user v) Tagged(user u,item i,tag t) Roger, i1, music Roger, i3, music Roger, i5, sports … Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news … Data Model Link(u1,v1): directed edge Network (u) = { v | Link (u,v) } For seeker u1ε Seekers, Network(u1) = neighbors of u1 Seekers = uLink Taggers = uTagged

  7. What are Scores? • Query is a set of tags • Q = {t1,t2,…,tn} example: fashion, www, sports, artificial intelligence • For a seeker u, a tag t, and a item I (Score per tag) score(i,u,t) = f(|Network(u)  {v, |Tagged(v,i,t)}|) • Overall Score of the query score(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u, tn)) f and g are monotone, where f = COUNT, g = SUM

  8. Problem Statement Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall score

  9. Standard Top-k Processing • Q = {t1,t2,…,tn} • Inverted lists per tag, IL1, IL2, … ILn, sorted on scores • score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3)) • Intuition • high-scoring items are close to the top of most lists Fagin-style processing: NRA(no random access) • access all lists sequentially in parallel • maintain a heap sorted on partial scores • stop when partial score of kth item > best case score of unseen/incomplete items

  10. worst score best-score NRA 0.6+0.6+0.9=2.1 List 1 List 2 List 3 Candidates Min top-2 score : 0.6 Threshold (Max of unseen tuples): 2.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?

  11. worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 0.9 Threshold (Max of unseen tuples): 1.8 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?

  12. worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.3 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue

  13. worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue

  14. NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.6 Threshold (Max of unseen tuples): 0.8 Pruning Candidates: Min top-2 < best score of candidate

  15. NRA • NRA performs only sorted accesses (SA) (No Random Access) • Random access (RA) • lookup actual (final) score of an item • often very useful • Problems with NRA • high bookkeeping overhead • for “high” values of k, gain in even access cost not significant

  16. List 2 List 1 List 3 TA (a1 , a2 , a3) lists sorted by score

  17. List 2 List 1 List 3 TA TA Algorithm: round 1 read one item from every list Candidates min top-2 score: 1.6 maximum score for unseen items: 2.1 lists sorted by score Random access Random access

  18. score score 53 99 80 36 30 78 15 75 14 72 tag = music item item score score item item 10 63 10 60 i5 i5 i1 i1 30 73 5 50 i2 i9 65 i2 i8 29 tag = photos i2 i8 27 62 i3 i4 i7 i6 40 25 i4 i2 i5 i1 i3 i5 39 23 i6 i8 i6 i6 20 18 i7 i4 i7 i7 15 16 i3 i3 i9 i8 13 16 seeker Jane seeker Jane seeker Ann seeker Ann Computing Exact Scores: Naïve • Typical: Maintain single inverted list per (seeker, tag), items ordered by score + can use standard top-k algorithms -- high space overhead

  19. Computing Score Upper-Bounds • Space saving strategy. • Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag. • Here every item is stored at most once. • Q now: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. • This is Global Upper-Bound strategy • Limitation: • Time to dynamically computing exact scores at query time.

  20. Miguel,… i1 73 Kath, … i2 65 Sam, … i3 62 Miguel, … i5 53 Peter, … i4 40 Jane, … i9 36 Mary, … i6 18 Miguel, … i7 16 Kath, … i8 16 Score Upper-Bounds tag = music Global Upper-Bound (GUB): 1 list per tag • + low space overhead • -- item upper-bounds, and list order(!) may differ from EXACT for most users • -- time to dynamically computing exact scores at query time. item taggers upper-bound • How do we do top-k processing with score upper-bounds? • Q: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. all seekers

  21. Top-k with Score Upper-Bounds gNRA- “generalized no random access” • access all lists sequentially in parallel • maintain a heap with partial exact scores • stop when partial exact score of kthitem > highest possible score from unseen/incomplete items (computed using current list upper-bounds)

  22. gNRA – NRA Generalization

  23. gTA – TA Generalization

  24. space baseline time baseline Performance of Global Upper Bound (GUB) and Exact • Space overhead • total # number of entries in all inverted lists • Query processing time • # of cursor moves

  25. Clustering and Query-Processing • We want to reduce the distance between score upper-bound and the exact score. • Greater the distance, more processing may be required • Core idea • Cluster users into groups and compute upper-bound for the group. • Intuition • group users whose behavior is similar

  26. Clustering Seekers • Cluster the seekers based on similarity in their scores (because score of an item depends on the network). • Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster). • Query processing for Q = t1.. tn and seeker u, we • First find the cluster C(u) • And then perform aggregation over the collection • Global Upper-Bound (GUB) is where all seekers fall into the same cluster.

  27. Miguel,… Kath, … Sam, … item upper-bound taggers Miguel, … Bob,… Peter, … gucci 65 Jane, … Peter, … 40 versace Mary, … Mary, … 18 chanel Chris, … Chris, … 16 prada Alice, … 10 puma item upper-bound taggers Miguel,… puma 73 Sam, … 62 adidas Miguel, … 53 diesel Jane, … 36 nike Kath, … 5 gucci Clustering Seekers Global Upper-Bound Example of Clusters item upper-bound taggers 73 puma • assign each seeker to a cluster • compute an inverted list per cluster • ub(i,t,C) = maxuC|Network(u)  {v|Tagged(v,i,tj)}| • + tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound • -- space overhead still high (trade-off) 65 gucci 62 adidas 53 diesel 40 versace 36 nike 18 chanel 16 prada C1: seekers Bob & Alice C2: seekers Sam & Miguel

  28. How do we cluster seekers? • Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard. • Proofs by reduction from independent task scheduling problem and minimum sum of squares problem • Authors present some heuristics • Use some form of Normalized Discounted Cumulative Gain (NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword. • The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order

  29. NDCG - Example

  30. Clustering Taggers • For each tag t we partition the taggers into separate clusters. • We form inverted list and an item i in the list for cluster C gets the score as • maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}| • How to cluster taggers? • Graph with nodes as taggers and an edge exists between nodes v1 and v2 iff: • Items(v1,t) ∩ Items(v2,t) ≥ threshold

  31. Clustering Seekers Metrics • Space • Global Upper Bound has the lowest overhead. • ASC and NCT achieve an order of magnitude improvement in space overhead over Exact. • Time • Both gNRA and gTA outperform Global Upper-bound. • ASC outperforms NCT on both sequential and total accesses in all cases for gTA and in all cases except one for gNRA. • Inverted lists are shorter • Score upper-bound order similar to exact score order for many users • Average % improvement over Global Upper-Bound • Normalized Cut: 38-72% • Ratio Association 67-87%

  32. Clustering Seekers Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users

  33. Clustering Taggers • Space • Overhead is significantly lower than that of Exact and of Cluster-Seekers • Time • Best Case: all taggers relevant to a seeker will reside in a single cluster • Worst Case: All taggers will reside in separate clusters. • Idea: cluster taggers based on overlap in tagging • assign each tagger to a cluster • compute cluster upper-bounds: • ub(i,t,C) = maxuSeekers, vC|Network(u)  {v |Tagged(v,i,tj)}|

  34. Clustering Taggers

  35. Conclusion and Next Steps • Cluster-Taggers worked best forseekers whose network fell into at most 3 * #tagsclusters • For others, query execution time degraded due to the number of inverted lists that had to be processed • For these seekers • Cluster-Taggers outperformed Cluster-Seekers in all cases • Cluster-Taggers outperforms Global Upper-Bound by 94-97%, in all cases. • Extended traditional top-k algorithms • Achieved a balance between time and space consumption.

  36. Thank You! Questions?WebCT/achawla@ieee.org

More Related