Presented by: Ashish Chawla CSE 6339 Spring 2009

Efficient network aware search in collaborative taggingSihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009

Overview • Opportunity: • Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker. • Incorporate social behavior into processing search queries • Network Aware Search • Results determined by opinion of network. • Existing top-k are too space intensive • Dependence of scores on seeker’s network • Investigate clustering seekers based on behavior of networks • Del.icio.us datasets were used for experiments

Introduction • What is Network Aware Search? • Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook • Users • contribute content • annotate items (photos, videos, URLs, …) with tags • form social networks • friends/family, interest-based • need help discovering relevant content • What is Relevance of an item?

What is Network-Aware Search?

Claims • Define what is network-aware search. • Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy. • Refine score upper-bounds based on the user’s network and tagging behavior

Link (user u, user v) Tagged(user u,item i,tag t) Roger, i1, music Roger, i3, music Roger, i5, sports … Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news … Data Model Link(u1,v1): directed edge Network (u) = { v | Link (u,v) } For seeker u1ε Seekers, Network(u1) = neighbors of u1 Seekers = uLink Taggers = uTagged

What are Scores? • Query is a set of tags • Q = {t1,t2,…,tn} example: fashion, www, sports, artificial intelligence • For a seeker u, a tag t, and a item I (Score per tag) score(i,u,t) = f(|Network(u)  {v, |Tagged(v,i,t)}|) • Overall Score of the query score(i,u,Q) = g(score(i,u,t1), score(i,u,t2),…, score(i,u, tn)) f and g are monotone, where f = COUNT, g = SUM

Problem Statement Given a user query Q = t1 … tn and a number k, we want to efficiently determine the top k items ie: k items with the highest overall score

Standard Top-k Processing • Q = {t1,t2,…,tn} • Inverted lists per tag, IL1, IL2, … ILn, sorted on scores • score (i) = g(score(i, IL1), score(i, IL2) , …, score(i, IL3)) • Intuition • high-scoring items are close to the top of most lists Fagin-style processing: NRA(no random access) • access all lists sequentially in parallel • maintain a heap sorted on partial scores • stop when partial score of kth item > best case score of unseen/incomplete items

worst score best-score NRA 0.6+0.6+0.9=2.1 List 1 List 2 List 3 Candidates Min top-2 score : 0.6 Threshold (Max of unseen tuples): 2.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?

worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 0.9 Threshold (Max of unseen tuples): 1.8 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ?

worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.3 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue

worst score best-score NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue

NRA List 1 List 2 List 3 Candidates Min top-2 score : 1.6 Threshold (Max of unseen tuples): 0.8 Pruning Candidates: Min top-2 < best score of candidate

NRA • NRA performs only sorted accesses (SA) (No Random Access) • Random access (RA) • lookup actual (final) score of an item • often very useful • Problems with NRA • high bookkeeping overhead • for “high” values of k, gain in even access cost not significant

List 2 List 1 List 3 TA (a1 , a2 , a3) lists sorted by score

List 2 List 1 List 3 TA TA Algorithm: round 1 read one item from every list Candidates min top-2 score: 1.6 maximum score for unseen items: 2.1 lists sorted by score Random access Random access

score score 53 99 80 36 30 78 15 75 14 72 tag = music item item score score item item 10 63 10 60 i5 i5 i1 i1 30 73 5 50 i2 i9 65 i2 i8 29 tag = photos i2 i8 27 62 i3 i4 i7 i6 40 25 i4 i2 i5 i1 i3 i5 39 23 i6 i8 i6 i6 20 18 i7 i4 i7 i7 15 16 i3 i3 i9 i8 13 16 seeker Jane seeker Jane seeker Ann seeker Ann Computing Exact Scores: Naïve • Typical: Maintain single inverted list per (seeker, tag), items ordered by score + can use standard top-k algorithms -- high space overhead

Computing Score Upper-Bounds • Space saving strategy. • Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag. • Here every item is stored at most once. • Q now: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. • This is Global Upper-Bound strategy • Limitation: • Time to dynamically computing exact scores at query time.

Miguel,… i1 73 Kath, … i2 65 Sam, … i3 62 Miguel, … i5 53 Peter, … i4 40 Jane, … i9 36 Mary, … i6 18 Miguel, … i7 16 Kath, … i8 16 Score Upper-Bounds tag = music Global Upper-Bound (GUB): 1 list per tag • + low space overhead • -- item upper-bounds, and list order(!) may differ from EXACT for most users • -- time to dynamically computing exact scores at query time. item taggers upper-bound • How do we do top-k processing with score upper-bounds? • Q: what score to store with each entry? • We store the maximum score that an item can have across all possible seekers. all seekers

Top-k with Score Upper-Bounds gNRA- “generalized no random access” • access all lists sequentially in parallel • maintain a heap with partial exact scores • stop when partial exact score of kthitem > highest possible score from unseen/incomplete items (computed using current list upper-bounds)

gNRA – NRA Generalization

gTA – TA Generalization

space baseline time baseline Performance of Global Upper Bound (GUB) and Exact • Space overhead • total # number of entries in all inverted lists • Query processing time • # of cursor moves

Clustering and Query-Processing • We want to reduce the distance between score upper-bound and the exact score. • Greater the distance, more processing may be required • Core idea • Cluster users into groups and compute upper-bound for the group. • Intuition • group users whose behavior is similar

Clustering Seekers • Cluster the seekers based on similarity in their scores (because score of an item depends on the network). • Form an inverted list ILt,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster). • Query processing for Q = t1.. tn and seeker u, we • First find the cluster C(u) • And then perform aggregation over the collection • Global Upper-Bound (GUB) is where all seekers fall into the same cluster.

Miguel,… Kath, … Sam, … item upper-bound taggers Miguel, … Bob,… Peter, … gucci 65 Jane, … Peter, … 40 versace Mary, … Mary, … 18 chanel Chris, … Chris, … 16 prada Alice, … 10 puma item upper-bound taggers Miguel,… puma 73 Sam, … 62 adidas Miguel, … 53 diesel Jane, … 36 nike Kath, … 5 gucci Clustering Seekers Global Upper-Bound Example of Clusters item upper-bound taggers 73 puma • assign each seeker to a cluster • compute an inverted list per cluster • ub(i,t,C) = maxuC|Network(u)  {v|Tagged(v,i,tj)}| • + tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound • -- space overhead still high (trade-off) 65 gucci 62 adidas 53 diesel 40 versace 36 nike 18 chanel 16 prada C1: seekers Bob & Alice C2: seekers Sam & Miguel

How do we cluster seekers? • Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard. • Proofs by reduction from independent task scheduling problem and minimum sum of squares problem • Authors present some heuristics • Use some form of Normalized Discounted Cumulative Gain (NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword. • The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order

NDCG - Example

Clustering Taggers • For each tag t we partition the taggers into separate clusters. • We form inverted list and an item i in the list for cluster C gets the score as • maxu ϵ seekers |Network(u) ∩ C ∩ {v1 | Tagged(v1,i,t)}| • How to cluster taggers? • Graph with nodes as taggers and an edge exists between nodes v1 and v2 iff: • Items(v1,t) ∩ Items(v2,t) ≥ threshold

Clustering Seekers Metrics • Space • Global Upper Bound has the lowest overhead. • ASC and NCT achieve an order of magnitude improvement in space overhead over Exact. • Time • Both gNRA and gTA outperform Global Upper-bound. • ASC outperforms NCT on both sequential and total accesses in all cases for gTA and in all cases except one for gNRA. • Inverted lists are shorter • Score upper-bound order similar to exact score order for many users • Average % improvement over Global Upper-Bound • Normalized Cut: 38-72% • Ratio Association 67-87%

Clustering Seekers Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users

Clustering Taggers • Space • Overhead is significantly lower than that of Exact and of Cluster-Seekers • Time • Best Case: all taggers relevant to a seeker will reside in a single cluster • Worst Case: All taggers will reside in separate clusters. • Idea: cluster taggers based on overlap in tagging • assign each tagger to a cluster • compute cluster upper-bounds: • ub(i,t,C) = maxuSeekers, vC|Network(u)  {v |Tagged(v,i,tj)}|

Clustering Taggers

Conclusion and Next Steps • Cluster-Taggers worked best forseekers whose network fell into at most 3 * #tagsclusters • For others, query execution time degraded due to the number of inverted lists that had to be processed • For these seekers • Cluster-Taggers outperformed Cluster-Seekers in all cases • Cluster-Taggers outperforms Global Upper-Bound by 94-97%, in all cases. • Extended traditional top-k algorithms • Achieved a balance between time and space consumption.

Thank You! Questions?WebCT/achawla@ieee.org

Presented by: Ashish Chawla CSE 6339 Spring 2009

Presented by: Ashish Chawla CSE 6339 Spring 2009

Presentation Transcript

CS G140 Graduate Computer Graphics

Spring 2009

ASE 138: What is new in Sybase Adaptive Server Enterprise Connectivity

Anne Lundquist and Allan Shackelford New York State Disability Services Council Spring Meeting June 16, 2009

Text to Speech Systems (TTS) EE 516 Spring 2009

Design of RC Columns

Pulse Oximetry Screening in the Asymptomatic Newborn: Science, Politics and Media

American Board of Thoracic Surgery Spring Meeting

The Arab Spring

The Politics and Economics of International Energy (Spring 2009- E657)

MI-Access Spring 2009 Webcast

Spring 2009 New Mexico Tech

Text to Speech Systems (TTS) EE 516 Spring 2009

Building Enterprise Web Applications with Spring 3.0 and Spring 3.0 MVC

CSCI 4325 / 6339 Theory of Computation Chapter One

March 4, 2009 Presented by Jerry Lynch

Pulse Oximetry Screening in the Asymptomatic Newborn: Science, Politics and Media

KALPANA CHAWLA

Fall/Winter 2009 Charter School Fiscal Management Presented by:

Our festival

Spring scenes

Open Fixed-Mobile Convergence (FMC)