Community-Centric Information Exploration on Social Content Sites

Community-CentricInformation Explorationon Social Content Sites Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29th, 2009

Acknowledgement (thanks to Facebook) Acknowledgement • Michael Benedikt, Oxford • Alban Galland, INRIA • Jian Huang, Penn State • Laks Lakshmanan, UBC • Julia Stoyanovich, Columbia

Social Networking Sites Are Popular!

An Emerging Trend: Social Content Integration

Social Content Sites (SCS) • Web destinations that let users: • Consume content-oriented information: videos, photos, news articles, etc. • Engage in social activities with their friends and people of similar interests • Two major driving factors: • Incorporating social activities improves the attractiveness of traditional content sites • The “similar traveler” feature improves the user duration on Yahoo! Travel significantly • Incorporating content is critical to the value of social networking sites • A significant amount of user time is spent on browsing other people’s photos, notes, etc. • Privacy: important but not the focus here

Main Challenges in SCS • Social Content Management • Data must be maintained and analyzed at web-scale: the underlying social content graph can be enormous. (Edward Chang’s keynote later) • Information has to be integrated from various places. • Information Discovery • Current: pure content based search or simple popularity/rating based browsing • Future: effectively and efficiently leverage the underlying social graph in more sophisticated ways • Information Presentation • Current: a single ranked list based on relevance • Future: metrics other than relevance / groups and clusters / explanations Information Exploration

Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

Information Exploration on SCS • Major Paradigms • User-based: browsing the content of your friends or other users you simply follow (Facebook, twitter) • Search: content based keywords matching (most traditional content sites) • Recommendations: hotlists, tag-based hotlists (Amazon, many collaborative tagging sites) • Query: database-style querying with complex conditions (not many so far) But which ones are more frequent?

Search vs. Recommendation • Majority of those searches are in fact recommendations in disguise! • Users are usually NOT looking for specific items • Majority of the sessions are seeking recommendations with geographical andtopical constraints • Ranking the paradigms (for neurotic people who insist on orders): “Recommendation” ~ User-based > Search > Query

Recommendation 2.0 (or 1.5) • Users demand a lot more from us • Be able to specify the topic: “what are the good destinations for a romantic trip in Asia?” • Be able to specify the community: “what’s hot among my tech-savvy friends?” • Be able to present the results in ways that can maximize the understanding of those results • But we do know a lot more about the users and the contents (through the users) • Richer user profiles • Friendships and relationships • Content activities (rating movies, tagging travel destinations, etc.) • Communication activities (IM, email, twitter follows, Facebook wall messages, etc.)

id: 301 destination state: NY id: 103 people Joe Yankees NYC friend visit id: 101 people John Football fun, club id: 302 destination state: MI college: umich AA id: 102 people Amy Data Model: Social Content Graph A social content graph is a logical graph structure where the labeled nodes represent users and objects, and the labeled edges represent relations between users and items, as well as activities users perform on items or other users. Both nodes and edges can have structural attributes.

A Community-Centric Model • Communities of users as the central concept • Community: a group of “similar” or “related” users • Recommendations are performed per community and results from each community are assembled together in the end • A user can belong to multiple communities: • A vector-based representation: user:<c1:w1, c2:w2, ...> • Challenge I: Community Generation • Topic based • Relationship based • Activity based • Challenge II: Community Selection • Given a user session, determine the appropriate set of communities to work with

A Case Study in Yahoo! Travel on Topic-Based Communities

Data in Yahoo! Travel • Users • Yahoo! users • Destinations • cities, attractions, etc. • Tagging activities: • Destination tags: editorial + custom • Self tags: mostly custom • Goal: • Given a user session (user + tags + region), recommending travel destinations to the user • Approach: • Discover topics for users, tags and destinations based on tagging actions • Construct communities based on topics

Topic Discovery through LDA • Joint work with Jian Huang • Intuition • Each destination is modeled as a document consists of all tags being used on it • Tags are treated as words in the document • Apply Latent Dirichlet Allocation [BNJ03] • A probabilistic generative model for document topic discovery • Produces the following conditional probabilities • Prob(w|t): given a latent topic, the probability of the tag word belongs to that topic • Prob(t|d): given a document, the probability of the document being about a certain latent topic • Prob(t|u) = (Prob(t|di)), where u tagged di

Examples of Topic Discovery Topic 0: Romantic/Luxury Tokens: romantic, 4-star, shopping, spa, golf, luxury, honeymoon … Top cities: Cancun, San Juan, Grand Canyon, Bangkok … Paris Topic 2: Arts/Historical/Cultural Tokens: architecture, sightseeing, art, history, culture, cathedral, castle … Top cities: Paris, London, Boston, Istanbul Amsterdam, Rome, Hong Kong … Cancun Topic 1: Seaside/Water related Tokens: beach, scuba, summer, diving, fishing, snorkeling, island, lake … Top cities: San Diego, Honolulu,Chicago, Miami, Seattle … Topic 3: Nightlife, City-life Tokens: nightlife, gambling, wine, drinking, exciting, casino … Top cities: Las Vegas, New York City, Los Angeles, San Francisco … San Diego Las Vegas

Community Generation • Identify destinations belong to the same topic • For each topic t, identify destination d belongs to t: D(t) = { d | Prob(t|d) > threshold } • Generate topical communities • D(u) = { d | user u visited destination d } • Interests-based topical community Cinterest(t) = { u | > threshold } • Expertise-based topical community Cexpert(t) = { u | > threshold } D(u) D(t)

Community Selection • Each user session consists of three pieces of information • (user, tag, region) • Region is treated as a Boolean constraint • We combine and convert user and tag into a single topical vector Prob(t|(u,w)) = w1·Prob(t|u) + w2·(Prob(w|t)·Prob(t)/Prob(w)) • A community C(t) is selected if Prob(t|(u,w)) is greater than a threshold • The choice between Interests or Experts Communities is heuristically decided • User is an expert: leverage interests-based community • User is not an expert: leverage experts-based community

Example Results

Querying Items within Social Network • Problem Definition: • Given a user and a query, retrieve items relevant to the query based only on the user’s network • The score of the item depends on keyword and user • E.g., score(u,q,i) = cnt(v), where tag(v,i,q) and friend(u,v) • A common scenario on social tagging sites like del.icio.us • Naïve Approach • One inverted index for each (user,keyword) • Apply top-k threshold algorithms at query time • Optimal efficiency • But, high storage overhead

score score 53 99 80 36 30 78 15 75 14 72 tag = music tag = news item item score score item item 10 63 10 60 i5 i5 i1 i1 30 73 5 50 i2 i9 65 i2 i8 29 i2 i8 27 62 i3 i4 i7 i6 40 25 i4 i2 i5 i1 i3 i5 39 23 i6 i8 i6 i6 20 18 i7 i4 i7 i7 15 16 i3 i3 i9 i8 13 16 user Jane user Jane user Ann user Ann Space Overhead of Per-User Indexing • Conservative Example: • 100K users, 1M items, 1K tags • 20 tags/item from 5% of the taggers • 10 bytes per inverted list entry • 1 Terabyte index!

Clustering to the Rescue! • Joint work with Michael Benedikt, Laks Lakshamanan, Julia Stoyanovich • Clustered Approach • Group users based on shared behavior into communities • Shared items • Shared items with same tags • One inverted index for each (group,keyword) • Replace item score with score upper bound (among all users in the group) • Sacrifice a bit of query performance for indexing space efficiency • Community-based (seeker clustering): • Groups are keyword-independent • E.g., shared items, shared tags • Behavior-based (tagger clustering): • Groups are keyword-specific • E.g., Jane may be similar to John on “travel”, but very different from him on “sports”.

Sarah Ann User Clustering one user, one cluster one inverted index per cluster upper-bound score per item Behavior-based Community-based Jane Chris one user, many clusters one inverted index per cluster upper-bound score per item

Community-based Clustering: Space

Community-based Clustering: Performance

Information Presentation • A broad definition: Given the set of relevant items, identify the appropriate subset of items to be shown to the user(s) in the right organization andwith the right amount of information. • Appropriate Subset: • Timeliness • Geographical closeness • Diversity • Etc. • Right Organization: • Ranking • Grouping • Faceted Navigation • Right Amount of Information • Explanation

A Case Study in Recommendation • Joint work with Laks Lakshmanan • While relevance is important to recommendation, others are critical too: • Novelty: avoid returning results that users are likely to know already. • Serendipity: aim to return less relevant results that might give users a pleasant surprise. • Diversity: avoid returning results that are too similar to each other. • Recommendation Diversification: From the pool of candidate items, identify a list of items that are dissimilar to each other while maintaining a high cumulative relevance, i.e., strike a good balance between relevance and diversity.

Existing Solutions for Diversification • Attribute-Based Diversification • Diversity semantics: pair-wise distance functions based on item attributes (e.g., movie attributes). • Combining with relevance: • Threshold either relevance or distance, maximize the other • Optimize an overall score as a weighted combination of relevance and distance • Algorithm • Perform traditional recommendation • Obtain the attributes of each candidate item and compute the pair-wise distance • Ad-hoc methods follow how diversity and relevance are combined • A Major Problem: • Lack of attributes suitable for estimating distance between pairs of items: e.g., URLs in del.icio.us, photos on Flickr, videos on Vimeo.

Explanation-Based Diversification • Intuition • Explanation is the set of objects because of which a particular item is recommended to the user. • Two items share similar explanations are likely to be similar to each other. • Explanation for Item-Based Strategies • Explanation for Collaborative Filtering Strategies (social!)

Explanation-Based Diversity • Pair-wise diversity distance between two recommended items • Standard similarity measures like Jaccard similarity and cosine similarity • E.g. (Distance based on Jaccard similarity) • Diversity for the set of recommended results (S)

Benefits and Practicality of Explanation-Based Diversification • Applicable to items without attributes or whose attributes are difficult to analyze • Common on social content sites • Explanations are by-products of many recommendation processes • They can be maintained with little overhead See our poster on Monday for details

User Social Content Admin Facebook Y! IM Y! Sports SocialScope Platform Information Presentation Social Content Grouper Social Content Ranker User Interface Information Discovery Query / Result Social Content Analyzer Social Query Evaluator Activity Manager Data Manager Content Management Social Content Graph Activities Content Integrator OpenSocial API OpenSocial API Important Information Flow (1) Raw/derived social nodes and links (2) Social content sub-graph (3) Relevant social content sub-graph (4) Relevant social nodes and links

A Graph Based Logical Algebra Framework • Designed for information discovery on social content sites • Aim to provide a declarative way of specifying analysis and query tasks • Uniformity and flexibility • Opportunities for performance optimization • Basic operators: • Node Selection (σN), Link Selection (σL) • Composition, Semi-Join • Node Aggregation, Link Aggregation • Details in [CIDR09]

A Simple Search Task John’s friends People visited Denver John’s friends who visited Denver Their activities

Jelly:A Language Over Social Content Sites • Designed with a focus on community-centric information exploration applications • Most useful applications • A restricted implementation of the SocialScope algebra • Based on nested relation model, instead of full graph model • Built-in primitives for topic and community generation • Topic Generation, Community Extraction • Recommendation Generation • Group Generation, Explanation Generation

A Simple (Incomplete) Example • Topic Generation and Community Extraction • Recommendation Generation • Information Presentation generate topics for item into topics from tagging R using LDA (seed, th=0.8) seed R.item group R.tag weight-with count() generate communities into experts from topics T, tagging R where T.*.item = R.item using jaccard-similarity (seed, th=0.7) seed (R.user, T.topic) list R.item generate recommendations into candidates given user u, query q from experts T where Selected (T.topic, u, q) using count-users (seed) seed T.*.item list T.user generate explanations into results from candidates C, tagging T where C.item = T.item using identity (seed) seed C.item list T.user weight-with count()

Scalability Challenges • Social content graphs are large • 10’s of millions of users • millions (movies, destinations) or billions (web links) of items • Challenge 1: being able to analyze the graph efficiently • We need to design Map-Reduce frameworks that are suitable for those analyses • Activities are being generated at a very high rate • millions of status updates per day (Facebook, twitter) • Similar amount of content activities (browsing, tagging, etc.) • Challenge 2: being able to incrementally incorporate new activities into the existing model

Semantic Challenges and Questions • Connections are multi-dimensional • Two main categories on the surface: explicit (friendship) vs. implicit (shared activities) • However: not all friendship links are the same and not all activities are the same • Challenge: for a given user on a given topic, identify the appropriate set of explicit and implicit links • expert versus interest based communities is one step toward that goal • Temporal, geographical, and other external influences • How does a community evolve at time goes on and as members moves around? • Social graph interactions • How does one social graph (say Facebook) impact another (say MySpace)?

Conclusion • The emerging trend of social content sites present many challenges • Social information integration • Information discovery • Information presentation • The notion of communities and topics are crucial for effective and efficient information exploration on those sites • Community discovery and analysis • Community-based information exploration • Lots more! • At Yahoo!, we are focusing on building systems that can address those challenges

Questions?

Community-Centric Information Exploration on Social Content Sites