Optimizing Web Search Using Social Annotations

Optimizing Web Search Using Social Annotations Shenghua Bao, Xiaoyuan Wu, Guirong Xue, Yong Yu Shanghai JiaoTong University Ben Fei, Zhong Su IBM China Research Lab WWW 2007

Introduction (1/3) • Two general aspects on improving web search • Ordering the web pages according to the query-document similarity • Ex: Anchor text generation, search log mining …etc • Ordering the web pages according to their qualities • Static ranking • Ex: PageRank, HITS …etc

Introduction (2/3) • Social annotation service (= social bookmarking) • Developed for web users to organize and share their favorite web pages online by social annotations • Emergent useful information that has been explored for folksonomy, visualization, semantic web, etc • Delicious

Introduction (3/3) • Utilizing social annotations for better web search from the two aspects: • Similarity ranking • The annotations provided by web users are usually good summaries (new metadata) of the corresponding web pages • The annotation data may be sparse and incomplete • SocialSimRank (SSR) algorithm • Static ranking • The amount of annotations assigned to a page indicates its popularity and implies its quality in some sense • Different annotation may have different weights in indicating the popularity of web pages • SocialPageRank (SPR) algorithm

Search with Social Annotation • Web page annotators provide cleaner data for users’ browsing • Similar or closely related annotations are usually given to the same web pages

Similarity Ranking between the Query and Social Annotations • Term-Matching based similarity ranking • suffers from the synonymy problem q={q1,q2,…, qn}, A(p)={a1,a2,…, am} • Social Similarity Ranking (SSR) Observation 1:Similar (semantically-related) annotations are usually assigned to similar (semantically-related) web pages by users with common interests. In the social annotation environment, the similarity among annotations in various forms can further be identified by the common web pages they annotated.

Illustration of SocialSimRank A(a)={ubuntu}, A(b)={linux,ubuntu}, A(c)={gnome,linux,ubuntu} P(ubuntu)={a,b,c}, P(linux)={b,c}, P(gnome)={c} MAP(ubuntu, a)=1, MAP(linux, b)=1, MAP(gnome, c)=2

Page Quality Estimation Using Social Annotations • Observation 2:High quality web pages are usually popularly annotated. Popular web pages, up-to-date web users and hot social annotations usually have the following relations: 1) popular web pages are bookmarked by many up-to-date users and annotated by hot annotations; 2) up-to-date users like to bookmark popular pages and use hot annotations; 3) hot annotations are used to annotate popular web pages and used by up-to-date users. • Notations MPU: NP × NU association matrix between pages and users MUA: NU × NA association matrix between users and annotations MAP: NA × NP association matrix between annotations and pages P0: vector of randomly initialized SocialPageRank scores

SocialPageRank Algorithm

Illustration of Quality Transition in the SPR Algorithm

Dynamic Ranking with Social Information • Dynamic ranking method • RankSVM • Features

Experiment Data (1/2) • Delicious data • 1,736,268 web pages and 269.566 annotations are crawled from Delicious during May, 2006. • Split compound annotations into standard words with the help of WordNet ex: java.programming  java, programming

Experiment Data (2/2) • Test set for dynamic ranking with social annotation • Manual query set (MQ) • 50 queries and their corresponding ground truths in Delicious data manually created by CS students • Pooling: judge the top 100 documents returned by Lucene • Automatic query set (AQ) from Open Directory Project (ODP) • Merging Delicious data with ODP and discarding ODP categories that contain no Delicious URLs • Randomly sample 3000 ODP categories and extract the category paths as the query set and the corresponding web pages ex: extract path TOP/Computer/Software/Graphics as “Computer Software Graphics” • 5-fold cross validation for each query set

Evaluation of Annotation Similarities Table. Explored similar annotations based on SocialSimRank

PageRank vs. Average Count

SPR vs. PageRank • SPR is normalized into a scale of 0-10 so that SPR and PageRank have the same number of pages in each grade from 0 to 10 • The pages with each PageRank value diversify a lot on the number of annotations and users • SPR successfully characterizes the web pages’ popularity degrees among web annotators

Results of Dynamic Ranking (1/2) Table. Comparison of MAP between similarity features

Results of Dynamic Ranking (2/2) Figure. NDCG at K for comparison of baseline, baseline+TM, baseline+SSR, baseline+PR, and baseline+SPR on query set AQ

Discussions • There are still several problems to further address • Annotation Coverage • The user submitted queries may not match any social annotation • Many web pages may have no annotations: 1) newly emerging web pages; 2) key-page-associated web pages while users tend to annotate key pages only; 3) uninteresting web pages. • Annotation Ambiguity • SSR may find the similar terms to the query terms while fail to disambiguate terms that have more than one meanings • Annotation Spamming • As social annotation becomes more and more popular, the amount of spam could drastically increase in the near future

Conclusion • The problem of integrating social annotations into web search is studied. • We observed that social annotations could benefit web search in both similarity ranking and static ranking. • The experimental results showed that SSR can successfully find the latent semantic relations among annotations and SPR can provide the static ranking from the web annotators perspective. • In the future, we would optimize the proposed algorithms and explore more sophisticated social features.

Optimizing Web Search Using Social Annotations