320 likes | 432 Views
Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search. Tag-based Social Interest Discovery. Internet Social Networks Are Emerging!. Internet social networks are self-organized by online users Del.icio.us, facebook, flickr, MySpace, YouTube Users are driven by their interests
E N D
Xin Li, Lei Guo, Eric Zhao Yahoo! International Social Search Tag-based Social Interest Discovery
Internet Social Networks Are Emerging! • Internet social networks are self-organized by online users • Del.icio.us, facebook, flickr, MySpace, YouTube • Users are driven by their interests • Fetch and bookmark contents • Create new contents • Share contents • Interest discovery is crucial to a social network • Discover interests of users in different contents • Locate users with similar interests • Link people with similar interests to form communities
Important Features of Social Networks • Organize users and contents • Cluster users into communities • Categorize contents into interesting topics • Provide search functions • Given a topic, locate all matching contents and all users that are interested in the topic • Given a user, locate all his fetched/created contents and the topics of his interests • Given a user, locate all other users that have similar interests
The Problem: Social Interest Discovery • Questions to answer • How to discover a user’s interests based on his fetched/created contents? • How to use individual users’ interests to find interesting topics shared by users? • How to use the topics to create interest-based user communities?
Existing Solutions and Limitations • User-centric • Using social network graph to discover users with common interests • Problem: online/offline user connections are hard to identify • Object-centric • Detect common interests based on the common objects fetched by users • Problem: discovered interests are object-base, non-descriptive and implicit • Predefined categorization • Not flexible, cannot catch most recent popular or hot user interests • Cannot reflect various user interest groups which may keep changing over time
Our approach • Leverage user-generated tags • Compute frequent co-occurrences of tag patterns • Use the tag patterns as topics of interests • Cluster users and content around the topics to build communities
Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion
Tags in Social Networks • User-generated labels for annotating the contents • Descriptive, summary, reflecting human judgment • Meta data between users and contents • Widely used in social networks • Del.icio.us: http://del.icio.us/help/tags • Youtube: http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=55769 • Facebook: http://www.facebook.com/help.php?hq=tag
del.icio.us Social Network • A pioneer social bookmark system • http://del.icio.us/ • Our Data Set • Dump for a limited period of time • 4.3 M public, tagged bookmarks, 0.2 M users, 1.4 M bookmarked URLs
URL Popularity Follows Power Law The distribution of URL bookmarking frequency. Most URLs are unpopular.
User Activity Follows Heavy-tail The distribution of user bookmarking frequency. Most users are less active.
Tag Vocabulary Tag coverage for tf keywords Tag coverage for tf-idf keywords User tags missed ≤ 20% of tf keywords for ≥ 98% docs and ≤ 10% of tf-idf keywords for ≥ 90% docs. Tags covered most important keywords. But the total number of unique tags are ~10x smaller than that of keywords.
Tag Convergence The total number of different tags users can use for a given document is limited no matter how popular the URL is.
Tags Capture Concepts of Contents • Nearly 50% of all URLs have tag match ratio 1 • 70% of all URLs have a tag match ratio > 0.5 • Only 10% of the URLs have no matched tags
From Tags to User Interests • Bookmarks reflect user interests • Tags summarize/describe bookmarked contents • Meta data between users and contents • Connect users and bookmarked contents • Frequently used tag patterns reflect user interests • The key is the co-occurrences of tags
Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion
System Design • Find topics of interests • For a given set of tagged bookmarks, find all topics of interests, i.e., frequent co-occurrences of tags • Clustering • For each topic, find all the URLs and the users such that those users have labeled each of the URLs with all the tags in the topic. • Indexing • Import the topics and their user and URL clusters into an indexing system for application queries.
Posts Data Source Clustering Indexing Topic Discovery Topics, posts Topics, Clusters ISID Architecture
Topic Discovery • Use the association rule algorithms to discover co-occurring tag patterns • Was invented for identifying frequently bought items in supermarkets • E.g., bread and milk • Use a support number to define the frequency threshold • Efficient in finding frequent patterns out of a large set transactions for given support number (threshold) • The rule building part is not used • One more step: remove pattern A if A is a sub-pattern of some other pattern B, and both A & B have the same support number • To remove duplicate clusters
Indexing • Find all URLs that contain a topic, i.e. tagged with same sets of tags • Find all users interested in a topic • Find all topics containing a tag • Find all topics for a user • Find all topics for a URL • Combination of the above
Overview • Motivation and Problem • Analysis of tags in a social network • ISID system design • Evaluation • Conclusion
Content Similarity of Topic Clusters • Similarity of two documents • Inner product of tf-idf document vectors • Keyword-based vector • Tag-based vector (comparison) • Intra-topic similarity • Average cosine similarity of every document pairs • Inter-topic similarity • Similarity of two topics • Average similarity of one topic to all other topics
Tag based (tf-idf) Inter- and Intra- Topic Similarity Keyword based (tf-idf) • Intra-topic similarity is significantly higher than inter-topic similarity • Tag co-occurrence can well cluster similar content • Tag-based similarity is quite close to keyword-based similarity
Tag-based (tf-idf) Inter-topic Similarity Similarity of two topics with different number of overlapped tags Keyword-based (tf-idf) Inter-topic similarity increases with number of co-occurring tags. Tag co-occurrences capture similar contents.
90% users have ≥ 90% top 5 tags covered 87% users have ≥ 90% top 10 tags covered 90% users have ≥ 80% tags covered User Interest Coverage The topics discovered by ISID capture the interests of users.
Human Reviews Scores: 1, Highly unrelated 2, Unrelated 3, Not sure 4, Related 5, Highly related From the human being’s judgment, ISID indeed clusters related URLs into clusters for each topic defined by user tags.
Cluster Properties Cluster size follows power-law User interests follows power-law. There exists really hot topics!
Cluster Properties Most topics have less than 6 tags. Beyond 6, the number of clusters quickly drops.
Overview • Motivation and Problem • Data and Their Properties • ISID system • Evaluation • Conclusion
Conclusion • Tags reflect human judgments on contents • Co-occurring tags are effective to represent user interests • Reflect human understanding for different but similar web contents • Consensus of judgments among users • ISID system • Topic discovery, Clustering, Indexing • Evaluation results are promising