Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-Dependent Sentiment Analysis of Financial Blogs Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh

Outline • Introduction • Glossary • Issues • Development of corpus • Analysis of corpus • Topic-based analysis • Experiment & Result • Conclusion

Introduction • No existing work used blogs as source, most work used news as source. • News are more likely to report a stock’s past performance. • Blogs are more likely to express opinions and to make predictions about the performance of stocks.

Introduction (Cont.) • The aim is to… • Automatically extract the subjective opinions uniquely found on blogs. • Track the changing sentiment from the blogosphere towards individual stocks and the market in general. • Supervised

Glossary • Document:a blog article. • Topic: name of a stock. • Unique document:a document contains a topic only. • Topic shift:an issue in a multiple topic document.

Glossary (Cont.) • Doc-Topic pair:a topic in a non-unique document. (also a sub-document of a document) • Inter-annotator agreement:the agreement level of annotating labels on an object.

Issues • Topic Shift - How to extract those topics in the document? • What level should be analyzed? Document level? Paragraph level? sentence level? word level? • How many labels should be used to annotate?

Extract sub-document • Using proximity approach • Steps • Find out topic word: T • Set a window size: N • Starting from T, expanding N words both at the right and left side of T.

Development of corpus • The corpus is made up of financial blog articles from “blogged.com” • 232 financial blogs are identified • Separate articles in blogs into 2 crawls according to the date - Craw1: 3 weeks in Feb. 2009 - Craw2: 5 weeks from May. to Jun. 2009

Development of corpus (Cont.) • Noise Removal - Using DiffPost algorithm - Concept: noise tend to be repeated across multiple articles. - Steps • Brake each article into HTML segments • Compare those segments • Remove the repeat segments, only unique segments are kept.

Development of corpus (Cont.) • Labels - Very Negative/Positive - Neutral - Negative/Positive - Mixed - Not relevant - IDK (I Don’t Know)

Development of corpus (Cont.) • Topics and retrieval • 500 stocks were chosen to be topics from “S&P 500”. • Relevant articles must contain the whole company name in upper case. • Unique annotations are identified by the combination of document and topic, doc-topic pair.

Development of corpus (Cont.) • Topics and retrieval • Also annotate a number of documents with respect to their sentiment towards stocks in general. => ~ 1526 unique doc-topic pairs. ~ 167 of which were annotated for stocks in general. ~ 164 of which were annotated by two annotators to facilitate inter-annotator agreement analysis.

Analysis of corpus • Annotation statistics

Analysis of corpus(Cont.) • Inter-Annotator Agreement

Cohen’s Kappa • Example • Probability of consistent agreement P(a)= (20+15)/50=0.7 • A said YES 30 times => 30/50=0.6 B said YES 25 times => 25/50=0.5 probability for both said… YES = 0.6*0.5 =0.3, NO=0.4*0.5=0.2 =>Probability of random agreement P(e)=0.3+0.2=0.5 - Kappa = (0.7-0.5)/(1-0.5)=0.4

Analysis of corpus(Cont.) • Topic Relevance

Topic-based sentiment analysis • Topic-based text extraction • Blog articles often contains multiple topics. • Topic-based extraction enables sentiment analysis at sub-document level, this should alleviate the topic-shift problem.

Topic-based sentiment analysis(Cont.) • Topic-based text extraction • Three approaches to extract sub-document • N-word extraction • N-sentence extraction • N-paragraph extraction

Topic-based sentiment analysis(Cont.) • Sentiment classification • The classification task attempts to model a function 1. For binary classification 2. For 3-point classification

Experiment • Discarded those Doc-Topic in the corpus not having labels , or were labelled in inconsistently by more than one annotators. • 687 labelled documents for binary classification • 917 labelled documents for 3-point classification • Compare three classifiers 1. Multinomial Naïve Baye 2.SVM 3. Trivial classifier as baseline • 10-fold validation • Performance metric: classification accuracy • Sub-document were used to train the classifier

Results • Document level only

Results (Cont.)

Results (Cont.) • Binary classification using MNB at N=30

Conclusion • Explored the use of blog sources for sentiment analysis in the financial domain • Developed a corpus of over 1,500 document-level annotations • Analysis of the annotation effort suggets that humans have particular difficulty annotating for degree of polarity • Proposed text-extraction approach to solve topic-shift problem. • Plan to explore the use of linguistic features and domain independent experiments

Thanks for your listening

Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-Dependent Sentiment Analysis of Financial Blogs

Presentation Transcript

Sentiment Analysis

Sentiment Analysis

Sentiment analysis of news articles for financial signal prediction

Sentiment Analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

Sentiment analysis

Sentiment Analysis

SENTIMENT ANALYSIS OF A DOCUMENT

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

NLTK Sentiment Analysis

Sentiment Analysis

Sentiment Analysis

Sentiment Analysis