210 likes | 354 Views
Opinion Retrieval from Blogs. Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 wzhang@cs.uic.edu yu@cs.uic.edu meng@cs.binghamton.edu 1 Department of Computer Science, University of Illinois at Chicago 2 Department of Computer Science, Binghamton University. CIKM 2007. 1.
E N D
Opinion Retrieval from Blogs Wei Zhang1 Clement Yu1 Weiyi Meng2 wzhang@cs.uic.edu yu@cs.uic.edu meng@cs.binghamton.edu 1 Department of Computer Science, University of Illinois at Chicago 2 Department of Computer Science, Binghamton University CIKM 2007 1
Outline • Overview of the opinion retrieval • Topic retrieval • Opinion identification • Ranking documents by opinion similarity • Experimental results CIKM 2007 2
Overview of the Opinion Retrieval • Opinion retrieval • Given a query, find documents that have subjective opinions about the query • A query “book” • Relevant: “This is a very good book.” • Irrelevant: “This book has 123 pages.”
Overview of the Opinion Retrieval • Introduced at TREC 2006 Blog Track • 14 groups, 57 submitted runs in TREC 2006 • 20 groups, 104 runs in TREC 2007 (on going) • Key problems • Opinion features • Query-related opinions • Rank the retrieved documents
Our Algorithm Document set Query Retrieved documents Opinionative documents Query-related opinionative documents
Topic Retrieval • Retrieve query-relevant documents • No opinion involved • Features • Phrase recognition • Query expansion • Two document-query similarities
Topic Retrieval – Phrase Recognition • Semantic relationship among the words • For phrase similarity calculation purpose • 4 types • Proper noun: “University of Lisbon” • Dictionary phrase: “computer science” • Simple phrase: “white car” • Complex phrase: “small white car”
Topic Retrieval – Query Expansion • Find the synonyms • “wto” “world trade organization” • Same importance • Add additional terms • “wto” negotiate, agreements, Tariffs,
Topic Retrieval - Similarity • Sim(Query, Doc) = <Sim_P, Sim_T> • Phrase similarity • Having or not having a phrase • Sim_P = sum ( idf(P_i) ) • Term similarity • Sum of the Okapi scores of all the query terms • Document ranking • D1 is ranked higher than D2, if • (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
Opinion Identification Subjective training data Objective training data Feature Selection retrieved documents opinionative documents SVM classifier From topic retrieval To opinion ranking
Opinion Identification – Training Data • Subjective training data • Review web sites • Documents having opinionative phrases • Objective training data • Dictionary entries • Documents not having opinionative phrases
Opinion Identification – Feature Selection • The words expressing opinions • Pearson’s Chi-square test • Test of the independence between subjectivity label and words via contingency table • Count the number of sentences • Unigrams and bigrams
Opinion Identification – Classifier • A support vector machine (SVM) classifier Subjective sentences Objective sentences Features Feature vector representation Training SVM classifier
Opinion Identification – Classifier • Apply the SVM classifier Document SVM classifier Sentence 1 Label 1:objective Sentence 2 Label 2:subjective … … Sentence n Label n:objective
Opinion Similarity - Query-Related Opinions • Find the query-related opinions query opinionative sentence text window document document
Opinion Similarity – Similarity 1 • Assumption 1 • Higher topic relevance • Higher rank • OSim_ir = Sim(Query, Doc)
Opinion Similarity – Similarity 2 • Assumption 2 • More query-related opinions • Higher rank • OSim_stcc: total number of sentences • OSim_stcs: total score of sentences
Opinion Similarity – Similarity 3 • A linear combination of 1 and 2 • a * Osim_ir + (1-a) * OSim_stcc • b * Osim_ir + (1-b) * OSim_stcs
Opinion Similarity – Experimental Results • TREC 2006 Blog Track data • 50 queries, 3.2 million Blog documens • UIC at TREC 2006 Blog Track • Title-only queries: scored the first • 28% - 32% higher than best TREC 2006 scores • Good things learned • More training data • Combined similarity function
Conclusions • Designed and implemented an opinion retrieval system. IR + text classification for opinion retrieval • The best known retrieval effectiveness on TREC 2006 blog data • Extend to polarity classification: positive/negative/mixed • Plan to improve feature selection
Questions? • wzhang@cs.uic.edu • http://www.cs.uic.edu/~wzhang/