120 likes | 231 Views
Using Social Annotations to Improve Language Model for Information Retrieval. Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM ’ 07 poster. Introduction.
E N D
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM’07 poster
Introduction • The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents • Two critical problems in LMIR: data sparseness and term independence assumption • In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us • This paper explore the use of social annotations in addressing the two problems critical for LMIR
Properties of Social Annotations • The keyword property • Social annotations can be seen as good keywords for describing the respective documents from various aspects • The concatenation of all the annotations of a document is a summary of the document from users’ perspective • The structure property • An annotation may be associated with multiple documents and vice versa • The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity
Deriving Data from Social Annotations • On the basis of social annotations, three sets of data can be derived • A summary dataset: sumann= {ds1, ds2, …, dsn}where dsi is the summary of the ith document • A dataset of document similarity simdoc= {(doci, docj,simscore_docij) | 0≦i≦j≦n} • A dataset of annotation similarity simann= {(anni, annj,simscore_annij) | 0≦i≦j≦m} (Define t as a triple of simdoc or simann, t[i] means the ith dimension of t)
Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM
Content Model (CM) • Content Unigram Model (CUM) • Match the query against the literal content of a document • Topic Cluster Model (TCM) • Match the query against the latent topic of a document • Assume the similar documents of document d may more or less share the same latent topic of d • The term distribution over d’s topic cluster can be used to smooth d’s language model
Annotation Model (AM) • Assume AM contains two sub models: an independency model and a dependency model • Annotation Unigram Model (AUM) • A unigram language model that matches query terms against annotated summaries • Annotation Dependency Model (ADM)
Parameter Estimation • 5 mode probailities {Pcum(qi|d), Paum(qi|ds), Ptcm(qi|d), P(qi|a), P(a|ds)} and 3 mixture parameters (λc, λa,λd) have to be estimated • Use EM algorithm to estimate λc, λa, andλd • Dirichlet prior smoothing method for CUM, AUM, and TCM • Ptcm(qi|d) is estimated using a unigram language model on the topic clusters • P(a|ds) is approximated by maximum likelihood estimation • Approximate P(qi|a):
Experiment Setup • 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us • 80 queries with 497 relevant documents manually collected by a group of CS students • Merged Source Model (MSM) as baseline • Merge each document’s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source • SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations
SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM
Retrieval Performance Table. MAP of each model
Conclusions and Future Work • The problem of integrating social annotations into LMIR is studied. • Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. • In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.