1 / 27

Topic-Dependent Sentiment Analysis of Financial Blogs

Topic-Dependent Sentiment Analysis of Financial Blogs. Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh. Outline. Introduction Glossary Issues Development of corpus Analysis of corpus

claire
Download Presentation

Topic-Dependent Sentiment Analysis of Financial Blogs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic-Dependent Sentiment Analysis of Financial Blogs Neil O’Hare, Michael Davy, Adam Beringham, Paul Fergusion, Paraic Sheridan, Cathal Gurrin, Alan F. Smeaton Date: 2010/04/19 Speaker: Yu-Cheng Hsieh

  2. Outline • Introduction • Glossary • Issues • Development of corpus • Analysis of corpus • Topic-based analysis • Experiment & Result • Conclusion

  3. Introduction • No existing work used blogs as source, most work used news as source. • News are more likely to report a stock’s past performance. • Blogs are more likely to express opinions and to make predictions about the performance of stocks.

  4. Introduction (Cont.) • The aim is to… • Automatically extract the subjective opinions uniquely found on blogs. • Track the changing sentiment from the blogosphere towards individual stocks and the market in general. • Supervised

  5. Glossary • Document:a blog article. • Topic: name of a stock. • Unique document:a document contains a topic only. • Topic shift:an issue in a multiple topic document.

  6. Glossary (Cont.) • Doc-Topic pair:a topic in a non-unique document. (also a sub-document of a document) • Inter-annotator agreement:the agreement level of annotating labels on an object.

  7. Issues • Topic Shift - How to extract those topics in the document? • What level should be analyzed? Document level? Paragraph level? sentence level? word level? • How many labels should be used to annotate?

  8. Extract sub-document • Using proximity approach • Steps • Find out topic word: T • Set a window size: N • Starting from T, expanding N words both at the right and left side of T.

  9. Development of corpus • The corpus is made up of financial blog articles from “blogged.com” • 232 financial blogs are identified • Separate articles in blogs into 2 crawls according to the date - Craw1: 3 weeks in Feb. 2009 - Craw2: 5 weeks from May. to Jun. 2009

  10. Development of corpus (Cont.) • Noise Removal - Using DiffPost algorithm - Concept: noise tend to be repeated across multiple articles. - Steps • Brake each article into HTML segments • Compare those segments • Remove the repeat segments, only unique segments are kept.

  11. Development of corpus (Cont.) • Labels - Very Negative/Positive - Neutral - Negative/Positive - Mixed - Not relevant - IDK (I Don’t Know)

  12. Development of corpus (Cont.) • Topics and retrieval • 500 stocks were chosen to be topics from “S&P 500”. • Relevant articles must contain the whole company name in upper case. • Unique annotations are identified by the combination of document and topic, doc-topic pair.

  13. Development of corpus (Cont.) • Topics and retrieval • Also annotate a number of documents with respect to their sentiment towards stocks in general. => ~ 1526 unique doc-topic pairs. ~ 167 of which were annotated for stocks in general. ~ 164 of which were annotated by two annotators to facilitate inter-annotator agreement analysis.

  14. Analysis of corpus • Annotation statistics

  15. Analysis of corpus(Cont.) • Inter-Annotator Agreement

  16. Cohen’s Kappa • Example • Probability of consistent agreement P(a)= (20+15)/50=0.7 • A said YES 30 times => 30/50=0.6 B said YES 25 times => 25/50=0.5 probability for both said… YES = 0.6*0.5 =0.3, NO=0.4*0.5=0.2 =>Probability of random agreement P(e)=0.3+0.2=0.5 - Kappa = (0.7-0.5)/(1-0.5)=0.4

  17. Analysis of corpus(Cont.) • Topic Relevance

  18. Topic-based sentiment analysis • Topic-based text extraction • Blog articles often contains multiple topics. • Topic-based extraction enables sentiment analysis at sub-document level, this should alleviate the topic-shift problem.

  19. Topic-based sentiment analysis(Cont.) • Topic-based text extraction • Three approaches to extract sub-document • N-word extraction • N-sentence extraction • N-paragraph extraction

  20. Topic-based sentiment analysis(Cont.) • Sentiment classification • The classification task attempts to model a function 1. For binary classification 2. For 3-point classification

  21. Experiment • Discarded those Doc-Topic in the corpus not having labels , or were labelled in inconsistently by more than one annotators. • 687 labelled documents for binary classification • 917 labelled documents for 3-point classification • Compare three classifiers 1. Multinomial Naïve Baye 2.SVM 3. Trivial classifier as baseline • 10-fold validation • Performance metric: classification accuracy • Sub-document were used to train the classifier

  22. Results • Document level only

  23. Results (Cont.)

  24. Results (Cont.)

  25. Results (Cont.) • Binary classification using MNB at N=30

  26. Conclusion • Explored the use of blog sources for sentiment analysis in the financial domain • Developed a corpus of over 1,500 document-level annotations • Analysis of the annotation effort suggets that humans have particular difficulty annotating for degree of polarity • Proposed text-extraction approach to solve topic-shift problem. • Plan to explore the use of linguistic features and domain independent experiments

  27. Thanks for your listening

More Related