390 likes | 467 Views
Justin Martineau. Automatic Domain Adaptive Sentiment Analysis Phase 1. Outline. Introduction Problem Definition Thesis Statement Motivation Background and Related Work Challenges Approaches Research Plan Approach Evaluation Timeline Conclusion.
E N D
Justin Martineau Automatic Domain Adaptive Sentiment Analysis Phase 1
Outline • Introduction • Problem Definition • Thesis Statement • Motivation • Background and Related Work • Challenges • Approaches • Research Plan • Approach • Evaluation • Timeline • Conclusion
1. Intro- 2. Related Work - 3. Research Plan - 4. Conclusion Problem Definition • Sentiment Analysis is the automatic detection and measurement of sentiment in text segments by machines. • 3 Sub Tasks • Objective vs. Subjective • Topic Detection • Positive vs. Negative • Commonly applied to web data • Very Domain Dependent
1. Intro- 2. Related Work - 3. Research Plan - 4. Conclusion Sentiment Analysis Example
1. Intro- 2. Related Work - 3. Research Plan - 4. Conclusion Thesis Statement This dissertation will develop and evaluate techniques to discover and encode domain-specific, domain-independent, and semantic knowledge to improve both single and multiple domain sentiment analysis problems on textual data given low labeled data conditions.
1. Intro- 2. Related Work - 3. Research Plan - 4. Conclusion Motivation: Private Sector • Market Research • Surveys • Focus Groups • Feature Analysis • Customer targeting (Free samples etc…) • Consumer Sentiment Search • Compare pros and cons • Overall opinion of products/services
1. Intro- 2. Related Work - 3. Research Plan - 4. Conclusion Motivation: Public Sector • Political • Alternative Polling • Determine popular support for legislation • Choose campaign issues • National Security • Detect individuals at risk for radicalization • Determine local sentiment about US policy • Determine local values and sentimental icons • Portray actions positively using local flavor • Public Health • Detect potential suicide victims • Detect mentally unstable people
1. Intro -2. Related Work- 3. Research Plan - 4. Conclusion Challenges • Text Representation • Unedited Text • Sentiment Drift • Negation • Sarcasm • Sentiment Target Identification • Granularity • Domain Dependence
1. Intro -2. Related Work- 3. Research Plan - 4. Conclusion Domain Dependence 1Domain Dependent Sentiment • The same sentence can mean two very different things in different domains • Ex: “Read the book.” <= Good for books, bad for movies • Ex: “Jolting, heart pounding, You’re in for one hell of a bumpy ride!” Good for movies and books, bad for cars. • Sentimental word associations change with domain • Fuzzy cameras are bad, but fuzzy teddy bears are good. • Big trucks are good, but big iPods are bad. • Bad is bad, but bad villains are good.
1. Intro -2. Related Work- 3. Research Plan - 4. Conclusion Domain Dependence 2 Endless Possibilities
1. Intro -2. Related Work- 3. Research Plan - 4. Conclusion Domain Dependence 3Organization and Granularity
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion Theory of the Three Signals • Authors communicate messages using three types of signals • Domain-Specific Signals • Domain-Independent Signals • Semantic Signals • More specific signals are generally more powerful than more generic signals
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion Domain-Specific Signals • Fuzzy teddy bears • Sharp pictures • Sharp knives • Smooth rides • New ideas • Fast servers • Fast cars • Slow roasted burgers • Slow motion • Small cameras • Big cars • Dependent on problem and domain • Considered more useful by readers • Tells what is good or bad about topic • Domain knowledge determines sentiment orientation • Very strong in context, but weak or misleading out of context • Can cause over generalization error when overvalued • New domain-specific signal words are ignored in CDT
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion Proposed Approach • Sentiment Search is more than just a classification problem • Detecting and Using the three signals • Dynamic Domain Adapting Classifiers • Generic Feature Detection using unlabeled data • Semantic Feature Spaces
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion Dynamic Domain Adapting Classifiers • A (preferably domain-independent) model is built using computationally intense algorithms before query time on a set of labeled data. • Users interact at a query box level • Query results define the domain of interest • Domain specific adaptations are calculated • compares how the domain of interest is different from known cases • uses semantic knowledge about word senses and relations • must be fast algorithm: users are waiting • Domain specific adaptations are woven into the domain independent model • resulting model is temporary • used to classify documents as positive, negative, or objective • Sentimental search results are processed for significant components and presented for human consumption
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion Query Business Intelligence Query Results Define a new Domain Lucene Index Labeled data from known domain Dynamic Domain Adapter Component Analysis Semantic Knowledge General Model Context Specific Model Sentiment Classifier Sentimental Search Results - + Overview Key: User Level, Source Data, Knowledge,Labeled Data Algorithms, Search Results
Subjective Context Scoring • Multiply: • PMI(Word,Context) • IDF • Co-occurance with know generic sentiment seed words times their bias (From movie reviews) • Seeds: • bad,worst,stupid,ridiculous, terrible,poorly • great,best,perfect,wonderful, excellent,effective
Rocchio Baseline • Rocchio - Query Expansion algorithm for search • Similar goals to ours, find more relevant words • Does not account for sentiment • The new query is a weight sum of • Matching document vectors • Query vector • Non-matching document vectors (negative value).
iPod according to TFIDF Positive Sentiment In Movie Reviews Negative Sentiment in Movie Reviews
Sentimental Context • Components: • PMI(Word,Context) • TF • IDF • Log( Actual Co Occur of Word,Seed, context / Prob by chance) • Values: • Abnormality to other docs • Popular words in context • Rare words in the corpus • Words that occur with sentiment words in the query documents
Google Hits (Battery Related): • iPod battery good ~ 13.5 Mill • iPod battery bad ~ 900 K • iPod nano battery good ~ 3 Mill • iPod nano battery bad ~ 785 K • iPod shuffle battery good ~ 1.6 Mill • iPod shuffle battery bad ~ 230 K • iPod shuffle battery price good ~ 2.6 Mill (not a typo) • iPod shuffle battery price bad ~ 230 K • iPod battery price good ~ 13.5 Mill • iPod battery price bad ~ 850 K • iPod nano battery price good ~ 3 Mill • iPod nano battery price bad ~ 785 K
1. Intro - 2. Related Work - 3. Research Plan -4. Conclusion Summary • Interesting problem with many potential applications • Domain dependence is the core challenge • The keys to success are: • Vast quantities of unlabeled data • Semantic knowledge from freely available sources • Semantics must guide and influence but not overrule the statistics
1. Intro - 2. Related Work -3. Research Plan- 4. Conclusion PMI - Pointwise Mutual Information • a.k.a. Specific Mutual Information • Do 2 variables occur more often with each other than chance?