270 likes | 446 Views
Topic-Dependent Sentiment Analysis of Financial Blogs. Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh. Outline. Introduction Glossary Issues Development of corpus Analysis of corpus
E N D
Topic-Dependent Sentiment Analysis of Financial Blogs Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh
Outline • Introduction • Glossary • Issues • Development of corpus • Analysis of corpus • Topic-based analysis • Experiment & Result • Conclusion
Introduction • No existing work used blogs as source, most work used news as source. • News are more likely to report a stock’s past performance. • Blogs are more likely to express opinions and to make predictions about the performance of stocks.
Introduction (Cont.) • The aim is to… • Automatically extract the subjective opinions uniquely found on blogs. • Track the changing sentiment from the blogosphere towards individual stocks and the market in general. • Supervised
Glossary • Document:a blog article. • Topic: name of a stock. • Unique document:a document contains a topic only. • Topic shift:an issue in a multiple topic document.
Glossary (Cont.) • Doc-Topic pair:a topic in a non-unique document. (also a sub-document of a document) • Inter-annotator agreement:the agreement level of annotating labels on an object.
Issues • Topic Shift - How to extract those topics in the document? • What level should be analyzed? Document level? Paragraph level? sentence level? word level? • How many labels should be used to annotate?
Extract sub-document • Using proximity approach • Steps • Find out topic word: T • Set a window size: N • Starting from T, expanding N words both at the right and left side of T.
Development of corpus • The corpus is made up of financial blog articles from “blogged.com” • 232 financial blogs are identified • Separate articles in blogs into 2 crawls according to the date - Craw1: 3 weeks in Feb. 2009 - Craw2: 5 weeks from May. to Jun. 2009
Development of corpus (Cont.) • Noise Removal - Using DiffPost algorithm - Concept: noise tend to be repeated across multiple articles. - Steps • Brake each article into HTML segments • Compare those segments • Remove the repeat segments, only unique segments are kept.
Development of corpus (Cont.) • Labels - Very Negative/Positive - Neutral - Negative/Positive - Mixed - Not relevant - IDK (I Don’t Know)
Development of corpus (Cont.) • Topics and retrieval • 500 stocks were chosen to be topics from “S&P 500”. • Relevant articles must contain the whole company name in upper case. • Unique annotations are identified by the combination of document and topic, doc-topic pair.
Development of corpus (Cont.) • Topics and retrieval • Also annotate a number of documents with respect to their sentiment towards stocks in general. => ~ 1526 unique doc-topic pairs. ~ 167 of which were annotated for stocks in general. ~ 164 of which were annotated by two annotators to facilitate inter-annotator agreement analysis.
Analysis of corpus • Annotation statistics
Analysis of corpus(Cont.) • Inter-Annotator Agreement
Cohen’s Kappa • Example • Probability of consistent agreement P(a)= (20+15)/50=0.7 • A said YES 30 times => 30/50=0.6 B said YES 25 times => 25/50=0.5 probability for both said… YES = 0.6*0.5 =0.3, NO=0.4*0.5=0.2 =>Probability of random agreement P(e)=0.3+0.2=0.5 - Kappa = (0.7-0.5)/(1-0.5)=0.4
Analysis of corpus(Cont.) • Topic Relevance
Topic-based sentiment analysis • Topic-based text extraction • Blog articles often contains multiple topics. • Topic-based extraction enables sentiment analysis at sub-document level, this should alleviate the topic-shift problem.
Topic-based sentiment analysis(Cont.) • Topic-based text extraction • Three approaches to extract sub-document • N-word extraction • N-sentence extraction • N-paragraph extraction
Topic-based sentiment analysis(Cont.) • Sentiment classification • The classification task attempts to model a function 1. For binary classification 2. For 3-point classification
Experiment • Discarded those Doc-Topic in the corpus not having labels , or were labelled in inconsistently by more than one annotators. • 687 labelled documents for binary classification • 917 labelled documents for 3-point classification • Compare three classifiers 1. Multinomial Naïve Baye 2.SVM 3. Trivial classifier as baseline • 10-fold validation • Performance metric: classification accuracy • Sub-document were used to train the classifier
Results • Document level only
Results (Cont.) • Binary classification using MNB at N=30
Conclusion • Explored the use of blog sources for sentiment analysis in the financial domain • Developed a corpus of over 1,500 document-level annotations • Analysis of the annotation effort suggets that humans have particular difficulty annotating for degree of polarity • Proposed text-extraction approach to solve topic-shift problem. • Plan to explore the use of linguistic features and domain independent experiments