A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR 2008 2009. 01. 14. Summarized & presented by Jung-Yeon Yang IDS Lab.

Introduction • A growing interest in finding out people’s opinions from web data • Product survey • Advertisement analysis • Political opinion polls • TREC started a special track on blog data in 2006 – blog opinion retrieval • It has been the track that has the most participants in 2007 • This paper is focused on the problem of searching opinions over general topics

Related Work • The popular opinion identification approaches • Text classification • Lexicon-based sentiment analysis • Opinion retrieval • Opinion retrieval • To find the sentimental relevant documents according to a user’s query • One of the key problems • How to combine opinion score with relevance score of each document for ranking

Related Work (cont.) • Topic-relevance search is carried out by using relevance ranking(e.g. TF*IDF ranking) • Ad hoc solutions of combining relevance ranking and opinion detection result • 2 steps: rank with relevance, then re-rank with sentiment score • Most existing approaches use a linear combinationα*Scorerel+ β*Scoreopn

Backgrounds • Statistical Language Model (LM) • A probability distribution over word sequences • p(“Today is Wednesday”)  0.001 • p(“Today Wednesday is”)  0.0000000000001 • Unigram : P(w1,w2,w3) = P(w1)*P(w2)*P(w3) • Bigram : P(w1,w2,w3) = P(w1)*P(w2|w1)*P(w3|w2) • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model • LM allows us to answer questions like: • Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) • Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval)

Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model Prob. concept space model Vector space model Prob. distr. model Classical prob. Model LM approach Backgrounds (cont.) The notion of relevance

Document language model Discounted LM estimate Collection language model Backgrounds (cont.) • Retrieval as Language Model Estimation • Document ranking based on query likelihood • Retrieval problem  Estimation of p(wi|d) • Smoothing is an important issue • Problem : If tf = 0 , then p(wi|d) = 0 • Smoothing methods try to • Discount the probability of words seen in a document • Re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model(collection language model) to discriminate unseen words

Generation Model for Opinion Retrieval • The Generation Model • To find both sentimental and relevant documents with ranks • Topic Relevance Ranking • Opinion Generation Model and Ranking • Ranking function of generation model for opinion retrieval

The Proposed Generation Model • Document generation model • how well the document d “fits” the particular query qp(d |q)∝p(q |d)p(d) • In opinion retrieval, p(d |q, s ) • Users’ information need is restricted to only an opinionate subset of the relevant documents • This subset is characterized by sentiment expressions s towards topic q

The Proposed Generation Model (cont.)  quadratic relationship • In this work, discuss lexicon-based sentiment analysis • Assume • The latent variable s is estimated with a pre-constructed bag-of-word sentiment thesaurus • All sentiment words si are uniformly distributed • The final generation model

Topic Relevance Ranking Irel (d,q) IDF(w) • The Binary Independent Retrieval (BIR) model is one of the most famous ones in this branch • Heuristic ranking function BM25 • TREC tests have shown this to be the best of the known probabilistic weighting schemes

Opinion Generation Model Iop (d,q,s) • Iop(d,q,s) focus on the problem that given query q, how probably a document d generates a sentiment expression s. • Sparseness problem  smoothing • Jelinek-mercersmoothing is applied.

Opinion Generation Model Iop (d,q,s) Use the co-occurrence of si and q inside d within a window W as the ranking measure of Pml (si |d,q)

Ranking function of generation model for opinion retrieval  Only use the topic relevance The final ranking function To reduce the impact of unbalance between #(sentiment words) and #(query terms) logarithm normalization

Experimental Setup • Data set • TREC blog 06 & 07 • 100,649 blogs during 2.5 month • Strategy : find top 1000 relevant documents, then re-rank the list with proposed model • Models • General linear combination • Proposed generation model with smoothing • Proposed generation model with smoothing and normalization

Experimental Setup (cont.) Sentimental Lexicons

Expetiment Results

Conclusion • Proposed a formal generation opinion retrieval model • Topic relevance & sentimental scores are integrated with quadratic comb. • Opinion generation ranking functions are derived • Discussed the roles of the sentiment lexicon and the matching window • It is a general model for opinion retrieval • Use the domain-independent lexicons • No assumption has been made on the nature of blog-structure text

My opinions • Good points • proposed the opinion retrieval model and ranking function • Do various experiments • Lacking points • Just find documents that has some opinions • Don’t know what kinds of opinions in the documents • Don’t use the sentimental polarity of words

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval