Efficient Top-k Querying over Social Tagging Networks

Ralf Schenkel Efficient Top-k Querying over Social Tagging Networks Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Gerhard Weikum

Social Tagging Networks Common examples: • Flickr (images) • YouTube (videos) • del.icio.us (bookmarks) • Librarything (books) • Discogs (CDs) • CiteULike (papers) • Facebook • Myspace (media) Definition: Social Tagging Network Website where people • publish + tag information • review + rate information • publish their interests • maintain network of friends • interact with friends SIGIR, Singapore

Outline • Search in Social Tagging Networks • Graph Model • Different Information Needs • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore

Social Network Model travelChina queueingtheory travelNorway USERS TAGS ITEMS SIGIR, Singapore

Social Network Model travel queues travel probability travel probability travel tripvldb travelChina queueingtheory travelNorway USERS TAGS harrypotter ITEMS SIGIR, Singapore

Components of a Social Tagging Network Graph G=(UI, EUEIEUI) with • 2 types of nodes: • Users U (optionally weighted) • Items I (optionally weighted) • 3 types of edges: • EU: User-User (optionally weighted) • EI: Item-Item (optionally weighted) • EUI: User-Item (labeled with tags T, opt. weighted) SIGIR, Singapore

Information Need 1: Global travel queues travel probability travel probability travel tripvldb travelChina queueingtheory travelNorway USERS harry potter TAGS harrypotter ITEMS Tags by all users equally important SIGIR, Singapore

Information Need 2: Similar Users travel queues travel probability travel probability travel tripvldb travelChina queueingtheory ? travelNorway USERS travel TAGS harrypotter Tags by users with similar tags/items(„brothers in spirit“)more important ITEMS SIGIR, Singapore

Information Need 3: Trusted Friends travel queues travel probability travel probability travel tripvldb travelChina queueingtheory ? travelNorway USERS probability TAGS harrypotter ITEMS Tags by closely related usersmore important SIGIR, Singapore

Wishlist for Social-Aware Social Search • Search results depend on • Global popularity of items • Collection context of the querying user (books, tags) • Social context of the querying user (trusted friends) • Scalable query processing (similar wishlist for social recommendations) SIGIR, Singapore

Outline • Search in Social Tagging Networks • Effective Query Scoring • Quantifying Friendship Strengths • User-specific Scoring Functions • Experimental Evaluation • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore

Notation U set of users T set of tags I set of items tags(u): tags used by user u items(u): items tagged by user u items(t): items tagged with tag t by at least one user df(t): number of items tagged with tag t tfu(i,t): number of times user u tagged item i with tag t tf(i,t): number of times item i was tagged with tag t user uj tagst11… t1m1 tagstn1… tnmn item i1 … item in SIGIR, Singapore

Quantifying Friendship Strengths • Global „friendship“ strength: • Content-based friendship strength • Graph-based friendship strength • Integrated friendship strength SIGIR, Singapore

Content-Based Friendship Strength • Several alternatives: • based on overlap of tag usage: • based on overlap of tagged items: • For both: • Pcontent(u,u):=0 • normalization such that SIGIR, Singapore

Graph-Based Friendship Strength Pgraph(u,u‘) u2 u3 u4 u5 u6 u7 Edges weighted with Pcontent: • For both: • Pgraph(u,u):=0 • normalization such that u1 u5 u3 u7 u2 u6 u4 Unweighted edges: SIGIR, Singapore

Integrated Friendship Similarity Mixture of • content-based similarity • graph-based friendship similarity • background model (global) (0,,1; +=1) Pint(u,u‘) SIGIR, Singapore

Towards a User-specific Score global friendship strength Convert into user-specific social frequency: Define user-specific social score: SIGIR, Singapore

Including Tag Expansion Problem: Users use different tags for similar things  poor recall (missing relevant results) Example:MPI, MPII, MPI-INF, MPI-CS, Max-Planck-Institut, D5, AG5, DB&IS, UdS, Saarland University, … Solution: 1. Define notion of similar tags 2. Expand queries with similar tags 3. Modify scoring function for expanded queries SIGIR, Singapore

Heuristics for finding similar tags Specialization heuristics: Tag t2specialization of t1 if t1 occurs (almost) whenever t2 occurs Co-Occurrence heuristics: Tags t1 and t2similar if they occur (almost) always together SIGIR, Singapore

Scoring Expanded Queries Naive approach: For query tag t, add similar tags t‘ with sim(t,t‘)>δ to query But: „transportation disaster“ expanded by „train car bus plane …“ „international crime“ expanded by „mafia camorra yakuza …“ Result quality drops due to topic drift Better: auto-tuning incremental expansion [SIGIR’05] For query tag t, consider only expansion with highest combined score per item SIGIR, Singapore

Experimental Evaluation: Effectiveness Systematic evaluation of result quality difficult Three setups: • Manual queries + human assessments • Queries+assessments derived from external info (ex: DMOZ categories) • Automated assessments from context of user • Items tagged by user and/or friends • Items tagged in the future  SIGIR, Singapore

Prototype Implementation SIGIR, Singapore

Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • Overall 49 queries • Crawled (part of) librarything: ~1,3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friends • Measured NDCG[10] (1-α) (content) • Result quality generally very high • Limited social influence is best (not enough friends?) • Tag expansion has limited influence on results (1-α) (graph) SIGIR, Singapore

Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Threshold Algorithms • ContextMerge • Experimental Evaluation • Summary & Further Challenges SIGIR, Singapore

Algorithmic Overview • Input: query q={t1…tn} for user u, α, ,  • Output: k items with highest scores • Goals: • Avoid computing all results • Minimize disk I/O and CPU load • Utilize precomputed information on disk SIGIR, Singapore

Excursion: Threshold Algorithms for Text IR Input: • query q={t1…tn} • lists L(tp) with pairs <i,score(i,tp)>, sorted by score(i,tp)↓ Output: k items with highest aggregated score Algorithm: • scan lists in parallel • maintain partial candidate results with score bounds • terminate as soon as top-k results are stable SIGIR, Singapore

Excursion: Threshold Algorithms Many powerful extensions: • Probabilistic pruning of candidates withguarantees on result quality • Random accesses to index lists • Scheduling scans and random accesses • Dynamic query expansion techniques • Hierarchical top-k for phrases • Structured queries for XML Most variants provably instance optimal Impossible to precompute scoreu(i,t) (materialize BM25 model per user+config)  cannot directly apply Threshold Algorithms SIGIR, Singapore

Revisiting the Social Frequency independent of user u dependent of user u Compute sfu(i,t) on the fly from tf(i,t), friends of u and their tagged documents SIGIR, Singapore

ContextMerge (=0) Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • FRIENDS(u): pairs <u‘,Pgraph(u,u‘)>, sorted by Pgraph(u,u‘)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted Adapted Threshold Algorithm for query u,t1…tn: • Scan ITEMS(tp) and n copies of FRIENDS(u),pick „best“ list • If ITEMS(tp): read next entry • If FRIENDS(u,p): read USERITEMS(u‘,tp) for next friend u‘ • Update candidates and topk • Check for termination SIGIR, Singapore

ContextMerge: Candidates Candidate items c maintain for each query term t tf(t): value read from ITEMS(t) or UNDEF tfu(t): sum of values read from USERITEMS(u‘,t), weighted byPgraph(u,u‘) c(t): unweighted sum of values read from USERITEMS(u‘,t) To compute worstscore(c): • plug tf(t) and tfu(t) into defintion of sfu(t) (0 if UNDEF) • plug sfu(t) into definition of scoreu(t) SIGIR, Singapore

ContextMerge: Candidates To compute bestscore(c): • if tf(t)=UNDEF [not yet seen in ITEMS(t)] use tf(t)=highttfu(t)=highFt· (hight-c(t)) • else [already seen in ITEMS(t)] use tfu(t)=highFt· (tf(t)-c(t)) and plug it into definition of sfu as before hight: current high score in ITEMS(t)highFt: current high score in FRIENDS(u,t) SIGIR, Singapore

ContextMerge: List Selection Lists are greedily selected by highest expected score • ITEMS(t): compute sfu(t), scoreu(t) with tf(t)=hight, tfu(t)=0 • FRIENDS(u,t): compute sfu(t), scoreu(t) with tf(t)=0, tfu(t)=highFt·maxtf max tfu(t) u,t SIGIR, Singapore

ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) SIGIR, Singapore

ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) u7 SIGIR, Singapore

ContextMerge: Schematic execution consideredUSERITEMS(u‘,t1) consideredUSERITEMS(u‘,t2) Items(t1) Items(t2) Friends(u,t1) Friends(u,t1) SIGIR, Singapore

Experimental Evaluation: Efficiency • Testbed: 3 large crawls of real social networks • Flickr: 10 mio pictures, ~50,000 users • Del.icio.us: ~175,000 bookmarks, ~12,000 users • Librarything: ~6.5 mio books, ~10,000 users • Queries: • ~150 frequent tag pairs in each set • for each query pick user with „enough“ results & friends • Cost measure: #sorted acc. + 100#random acc. • Baseline: full join + sort SIGIR, Singapore

Experimental Evaluation: Efficiency α SIGIR, Singapore

Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges SIGIR, Singapore

Summary • Need for social-aware social search, supporting • global • social • spiritual information needs • Social scoring • integrating global, collection, and social context • including dynamic tag expansion • ContextMerge: scalable implementation SIGIR, Singapore

Further Challenges • Meaningful & common benchmark • Incremental maintenance for high dynamics • Extend to ratings, user weights, item weights, … • Extend to non-tags (like image features) • Automatic query parameterization • Meaningful explanations of results • Exploit dynamics (hot topics, evolving groups,….) Social-Aware Search & Recommendationsat planet scale SIGIR, Singapore

Efficient Top-k Querying over Social Tagging Networks

Efficient Top-k Querying over Social Tagging Networks

Presentation Transcript

Querying in Wireless Sensor Networks

Social People-Tagging vs. Social Bookmark-Tagging

Querying Sensor Networks

Optimal Marketing Strategies over Social Networks

Scalable Top- k Spatio-Temporal Term Querying

Competition over popularity in social networks

Folksonomies and Social Tagging

Querying Sensor Networks

Efficient Internet Traffic Delivery over Wireless Networks

Social Tagging Networks (STN) Leveraging context and social networks

Querying Sensor Networks

Querying Social Networks

Anonymous communication over social networks

Querying Big Social Graphs

Efficient Control of Epidemics over Random Networks

Efficient Top-K Query Calculation in Distributed Networks

Towards the Question of Redundancy in Social Tagging Networks

Querying Sensor Networks

Querying Sensor Networks

Social Tagging and Search