270 likes | 398 Views
Building Sentiment Resources On Chinese Reviews. Zhang Haochen. Self Introduction. Zhang Haochen ( 张昊辰 ) Ph.D student THUIR, Tsinghua University, China Football, Cooking. Overview. Introduction Related work Issue description Approach Prototype design Pre-processing
E N D
Building Sentiment Resources On Chinese Reviews Zhang Haochen
Self Introduction Zhang Haochen(张昊辰) Ph.D student THUIR, Tsinghua University, China Football, Cooking
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Introduction • Content : Factual vs. Subjective • UGC in Web 2.0 • Reviews on entities: product, movie, news … • Opinionated information: tweet, BBS, … • Application • E-commercial • Public opinion • Recommendation
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Related work • Typical tasks (Pang., 2008): • Extraction: feature / aspect, opinion • Classification: subjective, polarity • Summarization • Search and Comparison • Approaches: • syntax-based • supervised vs. unsupervised • bootstrap / propagation
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Issue description • I/O • Reviews of particular domain / products. • Sentiment dictionary for the domain / products. • Corpus • Chinese : Segmentation, POS tagging • Internet : Spam, OOV, Oral • Difficulties • Noises • Various patterns • Oral and OOV • Solution • Syntax-based + OOV • Pruning
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Pre-processing • Filter noises of POS tagging results • If A is the subset of B, then take A • For completely unmatched tags, annotate with unknown(z) • Same segmentation, diff tag, annotated with unknown (z) • Remove redundant sentences • Remove sentences with too many punctuations.
Feature extraction • Specific patterns • more than noun • verb, morpheme involved • with frequency greater than given threshold • more noises • Verbal stop words • verb as part of phrase. • verb as predicate
Feature extraction • OOV • context entropy gain • whether B should compose phrase with A • mutual information • whether AB should be composed • iteratively
Feature extraction • Co-occurrence frequency with adjective words • Sectional threshold • Filter common words with background corpus (from SogouT, 20M size)
Opinion extraction • Syntax-based • adjacent adjective words • ignore adverb words • in specific windows. • contribute about 70% of the final results
Opinion extraction • OOV • assumption:F + adv. + O + func. • adv. and func. set • between F and Adj. • between Adj and Punc. • phrases between adv. and func. • Pruning • frequency • co-occur with features
Polarity classification • Feature-opinion vs. opinion • high - ? • high price - negative • Initial with polarity of words. • HowNet • Tsinghua • NTU Sentiment Dictionary
Polarity classification • Classify iteratively • Classify unlabeled FO pairs with adjacent FO pairs in one sentence • Classify FO pairs in the entire corpus
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Evaluation • Reviews in domain of camera • 100, 000+ sentences • 769 feature phrases • 806 opinion phrases • 8640 feature-opinion pairs • 5745 positive • 315 neutral • 1948 negative • 632 unknown (treated as neutral in final results) • Performance • feature extraction • opinion extraction • polarity classification
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Conclusion Chinese corpus is different from English corpus and is more troublesome. Syntax-based method is proved to be easy but efficient to explicit features and opinions on well-expressed corpus. Syntax-based method may perform badly on oral corpus.
Overview • Introduction • Related work • Issue description • Approach • Prototype design • Pre-processing • Feature extraction • Opinion extraction • Polarity classification • Evaluation • Conclusion • Future work
Future work more accurate and proper model employ and refer to some approaches of other AI research words apply learning methods implicit features and opinions cross different domains