560 likes | 644 Views
Ralf Schenkel. Informationssuche in sozialen Netzen. Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Marc Spaniol, Gerhard Weikum. Social Tagging Networks. Common examples: Flickr (images) YouTube (videos) del.icio.us (bookmarks)
E N D
Ralf Schenkel Informationssuche in sozialen Netzen Joint work with Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Parreira, Marc Spaniol, Gerhard Weikum
Social Tagging Networks Common examples: • Flickr (images) • YouTube (videos) • del.icio.us (bookmarks) • Librarything (books) • Discogs (CDs) • CiteULike (papers) • Facebook • Myspace (media) Definition: Social Tagging Network Website where people • publish + tag information • review + rate information • publish their interests • maintain network of friends • interact with friends Perspektivenvorlesung
Some Statistics Flickr: (as of Nov 2008) • 3+ billion photos, 3 million new photos per day Facebook: (as of Nov 2008) • 10+ billion photos, 30+ million new photos per day • 120 million active users • 150,000 new users per day Myspace: (as of Apr 2007) • 135 million users (6th largest country on Earth) • 2+ billion images (150,000 req/s), millions added daily • 25 million songs • 60TB videos StudiVZ.net: (as of Nov 2008) • 11 million users • 300 million images, 1 million added daily Huge volume of highly dynamic data Perspektivenvorlesung
Showcase: librarything.com Tags Ratings Others Books Perspektivenvorlesung
librarything.com: Social Interaction Similar Users Comments Explicit Friends Perspektivenvorlesung
librarything.com: Tag Clouds Perspektivenvorlesung
librarything.com: Search Search results independent of the querying user(and the social context) Perspektivenvorlesung
librarything.com: Search Search automatically expanded with similar tags(synonyms) Perspektivenvorlesung
Librarything.com: Recommendations Recommendations depend on user and tags(but not on social context) Perspektivenvorlesung
Librarything.com: Recommendations Explanation for the recommendation Perspektivenvorlesung
Librarything.com: Explanations Perspektivenvorlesung
Librarything.com: Explanations Perspektivenvorlesung
Outline • Search in Social Tagging Networks • Graph Model • Different Information Needs • Effective Query Scoring • Efficient Query Evaluation • Summary & Further Challenges Perspektivenvorlesung
Querying Social Tagging Networks travelnorway travelvldb Perspektivenvorlesung
Querying Social Tagging Networks travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter harrypotter harrypotter harrypotter probabilitydata miningfoundations Perspektivenvorlesung
Information Need 1: Globally Popular travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations harry potter Most frequently tagged items „best“Tags by all users equally important Perspektivenvorlesung
Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Perspektivenvorlesung
Information Need 2: Similar Users travelnorway travelnorway travelvldb travelvldb travel travelmexico travelicde traveltrip harrypotter harrypotter harrypotter harrypotter or ? probabilitydata miningfoundations travel Tags by users with similar tags/items(„brothers in spirit“)more important Perspektivenvorlesung
Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Perspektivenvorlesung
Information Need 3: Trusted Friends probabilityselling probabilityselling probabilityselling travelnorway travelnorway travelvldb travelvldb travel travelmexico traveltrip travelicde harrypotter or ? harrypotter harrypotter harrypotter probabilitydata miningfoundations probability Tags by closely related and well-known users more important Perspektivenvorlesung
Towards Social-Aware Social Search Search results may depend on • Global popularity of items • Spiritual context of the querying user(users with similar books and/or tags) • Social context of the querying user(known and trusted friends) Perspektivenvorlesung
Outline • Search in Social Tagging Networks • Effective Query Scoring • Quantifying Friendship Strengths • User-specific Scoring Functions • Experimental Evaluation • Efficient Query Evaluation • Summary & Further Challenges Perspektivenvorlesung
Notation U set of users T set of tags I set of items tags(u): tags used by user u items(u): items tagged by user u items(t): items tagged with tag t by at least one user df(t): number of items tagged with tag t tfu(i,t): number of times user u tagged item i with tag t tf(i,t): number of times item i was tagged with tag t Perspektivenvorlesung
Quantifying Friendship Strengths • Global „friendship“ strength: • Spiritual friendship strength • Social friendship strength • Integrated friendship strength Perspektivenvorlesung
Spritual Friendship Strength u‘ u overlap in interests of u and u‘ • Several alternatives: • based on overlap of tag usage: harrypotterwizard deathlyhallows philosopherstone u‘ u • based on overlap of tagged items: • overlap of behavior (tagging, searching, rating, …) • For all: • Pspirit(u,u):=0 • normalization such that tags(u): tags used by user u items(u): items tagged by user u Perspektivenvorlesung
Graph-Based Friendship Strength • set Psocial(u,u):=0 • normalization such that distance of u and u‘ in user network u1 u5 u3 u7 u2 u6 Psocial( ,u‘) u4 u2 u‘ u3 u4 u5 u6 u7 Perspektivenvorlesung
Integrated Friendship Strength Query-dependent mixture of • spiritual friendship strength • social friendship strength • background model (global) (0,1; +1) Pint(u,u‘) Perspektivenvorlesung
Excursion: Scoring in Text Retrieval Hand-tuned instance: Okapi BM25 Linear combination for query scores General scoring framework: Importance of t in the collection(the less frequent, the better) Importance of t for item i(the more frequent, the better) Perspektivenvorlesung
Towards a User-specific Score global friendship strength Convert into user-specific social frequency: Compute user-specific social score [SIGIR 2008] Perspektivenvorlesung
Including Tag Expansion Problem: Users use different tags for similar things poor recall (missing relevant results) Example:MPI, MPII, MPI-INF, MPI-CS, Max-Planck-Institut, D5, AG5, DB&IS, MMCI, UdS, Saarland University, … Solution: 1. Define notion of similar tags 2. Expand queries with similar tags 3. Modify scoring function for expanded queries Perspektivenvorlesung
Heuristics for finding similar tags Specialization heuristics: Tag t2specialization of t1 if t1 occurs (almost) whenever t2 occurs Example: t1=Europe, t2=Germany Co-Occurrence heuristics: Tags t1 and t2similar if they occur (almost) always together Perspektivenvorlesung
Scoring Expanded Queries Naive approach: For query tag t, add similar tags t‘ with sim(t,t‘)>δ to query But: „transportation disaster“ expanded by „train car bus plane …“ „international crime“ expanded by „mafia camorra yakuza …“ Result quality drops due to topic drift Better: auto-tuning incremental expansion For query tag t, consider only expansion with highest combined score per item Perspektivenvorlesung
Experimental Evaluation: Effectiveness Systematic evaluation of result quality difficult Three possible setups: • Manual queries + human assessments • Queries+assessments derived from external info (ex: DMOZ categories) • Automated assessments from context of user • Items tagged by friends • Items tagged in the future ? Perspektivenvorlesung
Prototype [VLDB/SIGIR 2008 demo] Perspektivenvorlesung
Preliminary User Study LibraryThing user study: [Data Engineering Bulletin, June 2008] • 6 librarything users with reasonably large library and friend sets • Overall 49 queries like „mystery magic“, „wizard“, „yakuza“ • Crawled (part of) librarything: ~1,3 mio books, ~15 mio tags, ~12,000 users, ~18,000 friends • Measured NDCG[10] (spiritual) α(social) • Result quality generally very high • Combination of spiritual and social friends is best Perspektivenvorlesung
Outline • Search in Social Tagging Networks • Effective Query Scoring • Efficient Query Evaluation • Threshold Algorithms • ContextMerge • Experimental Evaluation • Summary & Further Challenges Perspektivenvorlesung
Algorithmic Overview • Input: query q={t1…tn} for user u, α, • Output: k items with highest scores • Goals: • Avoid computing all results • Minimize disk I/O and CPU load • Utilize precomputed information on disk + „harry potter“ …………………….. Perspektivenvorlesung
Excursion: Threshold Algorithms for Text IR Input: • query q={t1…tn} • lists L(tp) with pairs <i,score(i,tp)>, sorted by score(i,tp)↓ Output: k items with highest aggregated score Family of Threshold Algorithms: • scan lists in parallel • maintain partial candidate results with score bounds • terminate as soon as top-k results are stable Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) L1 L2 top-1 item min-k: candidates Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) 0.9 ? A: ?: ? ? score: [0.9;1.9] score: [0.0;1.9] L1 L2 top-1 item min-k: 0.9 candidates Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) ? 0.9 ? ?: A: D: ? ? 1.0 score: [1.0;1.9] score: [0.0;1.9] score: [0.9;1.9] L1 L2 top-1 item 1.0 min-k: 0.9 candidates Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) ? ? 0.9 0.3 A: ?: G: D: ? ? 1.0 ? score: [0.3;1.3] score: [0.0;1.3] score: [0.9;1.9] score: [1.0;1.3] L1 L2 top-1 item 1.0 min-k: candidates Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) 0.3 ? ? 0.9 D: G: A: ?: ? ? ? 1.0 score: [0.9;1.6] score: [1.0;1.3] score: [0.0;1.0] score: [0.3;1.0] L1 L2 top-1 item 1.0 min-k: candidates No more new candidates considered Perspektivenvorlesung
Example: Top-1 for 2-term query (NRA) 0.9 ? ? ? 0.9 0.9 ? 0.9 D: A: A: D: D: A: A: D: ? ? 0.4 1.0 1.0 1.0 1.0 ? score: [1.0;1.25] score: [0.9;1.5] score: [0.9;1.6] score: [1.3;1.3] score: [1.0;1.3] score: [1.0;1.2] score: [0.9;1.55] score: [1.0;1.2] L1 L2 top-1 item 1.0 min-k: 1.3 candidates Algorithm safely terminates Perspektivenvorlesung
Can we reuse this here? No, scores specific to querying user and parameter setting! : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.2,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=1.0,=0.0) : harry (=0.5,=0.5) : harry (=0.0,=0.8) : harry (=0.0,=1.0) : harry (=0.0,=0.8) : harry (=0.5,=0.5) : harry (=1.0,=0.0) : harry (=1.0,=0.0) : harry (=0.0,=1.0) : harry (=0.5,=0.5) 0.98 0.98 0.98 0.98 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.90 0.84 0.84 0.84 0.84 0.89 0.89 0.86 0.89 0.89 0.89 0.89 0.89 0.86 0.89 0.86 0.89 0.89 0.89 0.89 0.86 0.45 0.45 0.45 0.45 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.64 0.56 0.56 0.64 0.56 0.56 0.56 0.56 harry travel 0.87 0.95 0.82 0.85 0.69 0.51 Number of lists to precompute would explode!(#tags #users parameter space) Perspektivenvorlesung
Revisiting the Social Frequency independent of user u dependent of user u Compute sfu(i,t) on the fly from tf(i,t), friends of u and their tagged documents Perspektivenvorlesung
Top-K in Social Networks: ContextMerge Precomputed lists: • ITEMS(t): pairs <i,tf(i,t)>, sorted by tf(i,t)↓ • USERITEMS(u‘,t): pairs <i,tfu‘(i,t)>, unsorted • FRIENDS(u): pairs <u‘,F(u,u‘)>, sorted by F(u,u‘)↓ ITEMS(harry): alreadyexist insystems 32 26 47 … USERITEMS( , harry): FRIENDS( ): 0.085 0.12 0.10 … Perspektivenvorlesung
ContextMerge Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): 47 0.12 0.10 32 0.085 26 … … Perspektivenvorlesung
ContextMerge computemin score bound compute max score bound Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): User-indeppart of sf: 47 User-specpart of sf: 47 0.12 ? |U| 0.10 32 0.085 26 … … Perspektivenvorlesung
ContextMerge User-indeppart of sf: ? User-specpart of sf: 0.12·|U| Adapted Threshold Algorithm for query u,t: • Scan ITEMS(t) and FRIENDS(u) in parallel • pick „best“ list • If ITEMS(t): read next entry • If FRIENDS(u): read USERITEMS(u‘,t) for next friend u‘ • Maintain candidates with bounds for min and max score and current results ITEMS(harry): FRIENDS( ): User-indeppart of sf: 47 User-specpart of sf: 47 0.12 0.88·|U| |U| ? 0.10 32 47 0.085 |U| 26 … … Perspektivenvorlesung