330 likes | 529 Views
Opinion Detection by Transfer Learning. 11-742 Information Retrieval Lab Grace Hui Yang Advised by Prof. Yiming Yang. Outline. Introduction The Problem Transfer Learning by Constructing Informative Prior Datasets Evaluation Method Experimental Results Conclusion. Introduction.
E N D
Opinion Detection by Transfer Learning 11-742 Information Retrieval Lab Grace Hui Yang Advised by Prof. Yiming Yang
Outline • Introduction • The Problem • Transfer Learning by Constructing Informative Prior • Datasets • Evaluation Method • Experimental Results • Conclusion
Introduction • TREC 2006 Blog Track • Opinion Detection Task <num> Number: 851 <title> "March of the Penguins" <desc> Description: Provide opinion of the film documentary "March of the Penguins". <narr> Narrative: Relevant documents should include opinions concerning the film documentary "March of the Penguins". Articles or comments about penguins outside the context of this film documentary are not relevant.
Opinion Detection Literature Review • Researchers in Natural Language Processing (NLP) community • Turney (2002) : groups online words whose point mutual information close to "excellent" and "poor" • Riloff & Wiebe (2003): use a high-precision classifier to get high quality opinions and non-opinions, and then extract syntactic patterns. Repeat this process to bootstrap • Pang et al. (2002): treat opinion and sentiment detection and as a text classification problem • Naive Bayes, Maximum Entropy, SVM +unigram pres. (82.9%) • Pang & Lee (2005): use Minicuts to cluster sentences based on their subjectivity and sentiment orientation. • Researchers from data mining community • Morinaga et al. (2002) : use word polarity, syntactic pattern matching rules to extract opinions, PCA to create correspondence between the product names and keywords
Existing System • Query Expansion • Document Retrieval • Binary Text Classification by Bayesian Logistic Regression
No Available Training Data • Transfer Learning • Transfer knowledge over similar tasks but different domain • Generalize knowledge from limited training data • Discover underlying general structures across domains
Transfer Learning Literature Review • Baxter(1997) and Thrun(1996): both used hierarchical Bayesian learning • Lawrence and Platt (2004), Yu et al. (2005): also use hierarchical Bayesian models to learn hyper-parameters of Gaussian process • Ando and Zhang (2005): proposed a framework for Gaussian logistic regression for text classification . • Raina et al. (2006): continued this approach and built informative priors for Gaussian logistic regression
Transfer Learning • The Approach presented in this project is Inspired by the work done by Raina, Ng & Koller (2006) on text classification • Transferring common knowledge (word dependence) in similar tasks by constructing a informative prior in a Bayesian Logistic Regression Framework
Logistic Regression Framework • Logistic regression assumes sigmoid-like data distribution • To avoid overfitting, multivariate Gaussian prior is added on θ • Maximum a posteriori (MAP) Estimation
Non-diagonal Covariance • Zero-mean, equal variance Prior • Cannot capture relationship among words • Zero-mean, non-diagonal covariance Prior • Model word dependency in covariance matrix’s off-diagonal entries
Pair-wised Covariance • Covariance Definition: • Given zero mean,
Get Covariance by MCMC • Markov Chain Monte Carlo (MCMC) • Sample V (V=4) small vocabularies with size S (S=5) containing the two words wi and wjcorresponding to θi and θj. • From each vocabulary, sample T (T=4) training sets with size Z(Z=3) to train an ordinary Log. Reg. model on labeled datasets
Get Covariance by MCMC • Subtract a bootstrap estimation of the covariance due to randomness of training set change
Learning a Covariance Matrix • Learning a single covariance for pairs of regression coefficients is NOT all we need • Two Challenges: (1) Valid Covariance Matrix • A valid covariance matrix needs to be positive semi-definite (PSD) • Hermitian matrix (square, self-adjoint) with nonnegative eigen values. • Project the matrix on to a PSD cone
Learning a Covariance Matrix (2) Pair-wise calculations increase the complexity quadratically with vocabulary size • represent the word dependence as linear combination of underlying features • Learn the coefficients by Least Squared Error
Learning a Covariance Matrix By Joint Minimization • λ is the trade-off coefficient between the two objectives. • As λ-> 0, only care about PSD cone • As λ-> 1, only care about word pair relationship • Set to 0.6
Solve the Joint Minimization • Convex problem, converge to global minimum • Fix Σ , minimize over ψ • Use Quadratic Program (QP) Solver • Fix ψ , minimize over Σ • A special semi-definite programming (SDP) • Eigen decomposition and keep the nonnegative values
Feature Design • Model word dependency • Wordnet synset • and? • People do not always use the same general syntactic patterns to express opinion • "blah blah is good", • "awesome blah blah!"
Target-Opinion Word Pair • Different opinion targets relate to different customary expression • A person is knowledgeable • A computer processor is fast • A computer processor is knowledgeable (ill) • A person is fast (ill) • A computer processor is running like a horse (word polarity test fails)
Target-Opinion Word Pair • From training corpus, extract from a positive example • subject and object (excludes pronouns) • “Melvin, pig” • subject and BE-predicate • “lens, clear”, “base, heavy” • modifier and subject • “good, coffee” , “interesting, movie”
Word Synonym • Bridge vocabulary gap from training to testing • “This movie is good" in training corpus • "The film is really good" in the testing corpus
Feature Vector Log-co-occurrence Target-Opinion Synonym
Datasets • Training Corpus • Movie reviews [Pang & Lee from Cornell] • 10,000 sentences (5,000 opinions, 5,000 non-opinions) • Product reviews [Hu & Liu from UIC] • 4,000+ sentences (2,034 opinions, 2,173 non-opinions. • Digital camera, cell phone, DVD player, Jukebox, …
Datasets • Test Corpus – TREC 2006 Blog corpus • 3,201,002 articles (TREC reports 3,215,171) • December 2005 to February 2006 • Technorati, Bloglines, Blogpulse … • For each topic, 5,000 passages are retrieved • Using Lemur as search engine • 132,399 passages in total • 2,648 passages per topic • Each passage 1-10 sentences ( less than 100 words)
Evaluation Method • Precision at 11-pt recall level • Mean average precision (MAP) • Answers are provided by TREC qrels, • Document ids of documents containing an opinion • Note that our system is developed for opinion detection at sentence level • An averaged score of all the sentences in a retrieved passages • Extract Unique document ids to compare with TREC qrels
Experimental Results • Effects of Using Non-diagonal Prior Covariance • Baseline: Using movie reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2) • Feature Selection: Using common word features in movie reviews and product reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2) • Informative Prior:Using movie reviews to calculate prior covariance, train the Gaussian log. Reg. model with theinformative prior ~N(0,Σ)
Experimental Results • Effects of Feature Design • Baseline: Using movie reviews to train the Gaussian log. Reg. model with Prior ~N(0,σ2), bi-gram model • Transfer Learning Using Synonyms: Using informative prior ~N(0,Σ) • Transfer Learning Using Target-Opinion pairs: informative prior ~N(0,Σ) • Transfer Learning Using Both: informative prior ~N(0,Σ)
Experimental Results • Effects on External Dataset Selection Negative Effect of Transfer Learning
Why Negative Effect Occurs? • Movie covers more general topics • Product only share 23% topics
Conclusion • Applying Transfer Learning in Opinion Detection • Transfer Learning by Informative Prior improves brutal transfer learning by 32% • Discovering a good feature for opinion detection • Target-Opinion pair • Need to be careful when choosing external datasets to help