430 likes | 584 Views
Ralf Schenkel. Efficient Top-k Querying over Social Tagging Networks. Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Gerhard Weikum. Social Tagging Networks. Common examples: Flickr (images) YouTube (videos) del.icio.us (bookmarks)
E N D
Ralf Schenkel Efficient Top-k Querying over Social Tagging Networks Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Gerhard Weikum
Social Tagging Networks Common examples: • Flickr (images) • YouTube (videos) • del.icio.us (bookmarks) • Librarything (books) • Discogs (CDs) • CiteULike (papers) • Facebook • Myspace (media) Definition: Social Tagging Network Website where people • publish + tag information • review + rate information • publish their interests • maintain network of friends • interact with friends SIGIR, Singapore
Outline • Search in Social Tagging Networks • Graph Model • Different Information Needs • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore
Social Network Model travelChina queueingtheory travelNorway USERS TAGS ITEMS SIGIR, Singapore
Social Network Model travelChina queueingtheory travelNorway USERS TAGS ITEMS SIGIR, Singapore
Social Network Model travel queues travel probability travel probability travel tripvldb travelChina queueingtheory travelNorway USERS TAGS harrypotter ITEMS SIGIR, Singapore
Components of a Social Tagging Network Graph G=(UI, EUEIEUI) with • 2 types of nodes: • Users U (optionally weighted) • Items I (optionally weighted) • 3 types of edges: • EU: User-User (optionally weighted) • EI: Item-Item (optionally weighted) • EUI: User-Item (labeled with tags T, opt. weighted) SIGIR, Singapore
Information Need 1: Global travel queues travel probability travel probability travel tripvldb travelChina queueingtheory travelNorway USERS harry potter TAGS harrypotter ITEMS Tags by all users equally important SIGIR, Singapore
Information Need 2: Similar Users travel queues travel probability travel probability travel tripvldb travelChina queueingtheory ? travelNorway USERS travel TAGS harrypotter Tags by users with similar tags/items(„brothers in spirit“)more important ITEMS SIGIR, Singapore
Information Need 3: Trusted Friends travel queues travel probability travel probability travel tripvldb travelChina queueingtheory ? travelNorway USERS probability TAGS harrypotter ITEMS Tags by closely related usersmore important SIGIR, Singapore
Wishlist for Social-Aware Social Search • Search results depend on • Global popularity of items • Collection context of the querying user (books, tags) • Social context of the querying user (trusted friends) • Scalable query processing (similar wishlist for social recommendations) SIGIR, Singapore
Outline • Search in Social Tagging Networks • Effective Query Scoring • Quantifying Friendship Strengths • User-specific Scoring Functions • Experimental Evaluation • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore
Notation U set of users T set of tags I set of items tags(u): tags used by user u items(u): items tagged by user u items(t): items tagged with tag t by at least one user df(t): number of items tagged with tag t tfu(i,t): number of times user u tagged item i with tag t tf(i,t): number of times item i was tagged with tag t user uj tagst11… t1m1 tagstn1… tnmn item i1 … item in SIGIR, Singapore
Quantifying Friendship Strengths • Global „friendship“ strength: • Content-based friendship strength • Graph-based friendship strength • Integrated friendship strength SIGIR, Singapore
Content-Based Friendship Strength • Several alternatives: • based on overlap of tag usage: • based on overlap of tagged items: • For both: • Pcontent(u,u):=0 • normalization such that SIGIR, Singapore
Graph-Based Friendship Strength Pgraph(u,u‘) u2 u3 u4 u5 u6 u7 Edges weighted with Pcontent: • For both: • Pgraph(u,u):=0 • normalization such that u1 u5 u3 u7 u2 u6 u4 Unweighted edges: SIGIR, Singapore
Integrated Friendship Similarity Mixture of • content-based similarity • graph-based friendship similarity • background model (global) (0,,1; +=1) Pint(u,u‘) SIGIR, Singapore
Towards a User-specific Score global friendship strength Convert into user-specific social frequency: Define user-specific social score: SIGIR, Singapore
Including Tag Expansion Problem: Users use different tags for similar things poor recall (missing relevant results) Example:MPI, MPII, MPI-INF, MPI-CS, Max-Planck-Institut, D5, AG5, DB&IS, UdS, Saarland University, … Solution: 1. Define notion of similar tags 2. Expand queries with similar tags 3. Modify scoring function for expanded queries SIGIR, Singapore
Heuristics for finding similar tags Specialization heuristics: Tag t2specialization of t1 if t1 occurs (almost) whenever t2 occurs Co-Occurrence heuristics: Tags t1 and t2similar if they occur (almost) always together SIGIR, Singapore
Scoring Expanded Queries Naive approach: For query tag t, add similar tags t‘ with sim(t,t‘)>δ to query But: „transportation disaster“ expanded by „train car bus plane …“ „international crime“ expanded by „mafia camorra yakuza …“ Result quality drops due to topic drift Better: auto-tuning incremental expansion [SIGIR’05] For query tag t, consider only expansion with highest combined score per item SIGIR, Singapore
Experimental Evaluation: Effectiveness Systematic evaluation of result quality difficult Three setups: • Manual queries + human assessments • Queries+assessments derived from external info (ex: DMOZ categories) • Automated assessments from context of user • Items tagged by user and/or friends • Items tagged in the future SIGIR, Singapore
Prototype Implementation SIGIR, Singapore
Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • Overall 49 queries • Crawled (part of) librarything: ~1,3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friends • Measured NDCG[10] (1-α) (content) • Result quality generally very high • Limited social influence is best (not enough friends?) • Tag expansion has limited influence on results (1-α) (graph) SIGIR, Singapore
Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Threshold Algorithms • ContextMerge • Experimental Evaluation • Summary & Further Challenges SIGIR, Singapore
Algorithmic Overview • Input: query q={t1…tn} for user u, α, , • Output: k items with highest scores • Goals: • Avoid computing all results • Minimize disk I/O and CPU load • Utilize precomputed information on disk SIGIR, Singapore
Excursion: Threshold Algorithms for Text IR Input: • query q={t1…tn} • lists L(tp) with pairs <i,score(i,tp)>, sorted by score(i,tp)↓ Output: k items with highest aggregated score Algorithm: • scan lists in parallel • maintain partial candidate results with score bounds • terminate as soon as top-k results are stable SIGIR, Singapore
Excursion: Threshold Algorithms Many powerful extensions: • Probabilistic pruning of candidates withguarantees on result quality • Random accesses to index lists • Scheduling scans and random accesses • Dynamic query expansion techniques • Hierarchical top-k for phrases • Structured queries for XML Most variants provably instance optimal Impossible to precompute scoreu(i,t) (materialize BM25 model per user+config) cannot directly apply Threshold Algorithms SIGIR, Singapore
Revisiting the Social Frequency independent of user u dependent of user u Compute sfu(i,t) on the fly from tf(i,t), friends of u and their tagged documents SIGIR, Singapore
ContextMerge (=0) Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • FRIENDS(u): pairs <u‘,Pgraph(u,u‘)>, sorted by Pgraph(u,u‘)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted Adapted Threshold Algorithm for query u,t1…tn: • Scan ITEMS(tp) and n copies of FRIENDS(u),pick „best“ list • If ITEMS(tp): read next entry • If FRIENDS(u,p): read USERITEMS(u‘,tp) for next friend u‘ • Update candidates and topk • Check for termination SIGIR, Singapore
ContextMerge: Candidates Candidate items c maintain for each query term t tf(t): value read from ITEMS(t) or UNDEF tfu(t): sum of values read from USERITEMS(u‘,t), weighted byPgraph(u,u‘) c(t): unweighted sum of values read from USERITEMS(u‘,t) To compute worstscore(c): • plug tf(t) and tfu(t) into defintion of sfu(t) (0 if UNDEF) • plug sfu(t) into definition of scoreu(t) SIGIR, Singapore
ContextMerge: Candidates To compute bestscore(c): • if tf(t)=UNDEF [not yet seen in ITEMS(t)] use tf(t)=highttfu(t)=highFt· (hight-c(t)) • else [already seen in ITEMS(t)] use tfu(t)=highFt· (tf(t)-c(t)) and plug it into definition of sfu as before hight: current high score in ITEMS(t)highFt: current high score in FRIENDS(u,t) SIGIR, Singapore
ContextMerge: List Selection Lists are greedily selected by highest expected score • ITEMS(t): compute sfu(t), scoreu(t) with tf(t)=hight, tfu(t)=0 • FRIENDS(u,t): compute sfu(t), scoreu(t) with tf(t)=0, tfu(t)=highFt·maxtf max tfu(t) u,t SIGIR, Singapore
ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) SIGIR, Singapore
ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) u7 SIGIR, Singapore
ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) u7 SIGIR, Singapore
ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) u7 SIGIR, Singapore
ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) SIGIR, Singapore
Experimental Evaluation: Efficiency • Testbed: 3 large crawls of real social networks • Flickr: 10 mio pictures, ~50,000 users • Del.icio.us: ~175,000 bookmarks, ~12,000 users • Librarything: ~6.5 mio books, ~10,000 users • Queries: • ~150 frequent tag pairs in each set • for each query pick user with „enough“ results & friends • Cost measure: #sorted acc. + 100#random acc. • Baseline: full join + sort SIGIR, Singapore
Experimental Evaluation: Efficiency α SIGIR, Singapore
Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore
Summary • Need for social-aware social search, supporting • global • social • spiritual information needs • Social scoring • integrating global, collection, and social context • including dynamic tag expansion • ContextMerge: scalable implementation SIGIR, Singapore
Further Challenges • Meaningful & common benchmark • Incremental maintenance for high dynamics • Extend to ratings, user weights, item weights, … • Extend to non-tags (like image features) • Automatic query parameterization • Meaningful explanations of results • Exploit dynamics (hot topics, evolving groups,….) Social-Aware Search & Recommendationsat planet scale SIGIR, Singapore