260 likes | 271 Views
Compare different financial news sources, similar products for Chinese news, analyze sentiment using word2vec and doc2vec, and explore their relation to stock prices. Evaluate available Chinese news sources like HKEx, Wisenews, 華富財經, and more. Discuss advantages and disadvantages of each source. Examine online sources like 財華網, 阿思達克財經網, and 經濟通. Explore sentiment analysis tools such as 優礦 and Wordstat. Develop milestones for a delivery system of financial news analysis.
E N D
Financial news sources, product delivery, doc2vec 15-8-2017 David Ling
Contents • Comparing different news sources • Similar existing products for Chinese news • Expected milestone • Sentiment using word2vec and doc2vec • Relation to stock price
News • Available Chinese News sources: 3 types • Official news and announcements – HKEx • Traditional newspaper – Wisenews • Real time online news – 華富財經, 經濟通, 阿思達克財經網, 財華網
HKEX Disadvantages • Not frequent • Only official documents • Many announcements are numerical data and forms, format may vary for different companies, difficult for data cleaning • Often numbers and tables, a lot of unrelated words, and few related sentence for textural data for analyzing Sample news and announcement 2 Sample news and announcement 1
Wisenews • Collection of news from traditional newspapers Advantages • Large data base (up to the year 2000) • More frequent than HKEx • News context is richer than HKEx documents, with more descriptive terms like “急挫”,“落後”,“增持” Disadvantages • Not as frequent as online news • Need segmentation and tagging for related company • Not free • News are not financial categorized 8297 3983, 3303
Online news1: 華富財經 Advantages: • Update frequently • Tagged with related company already • Easy to crawl (I have wrote a crawler) Disadvantages: • Less reliable when compared with newspaper • Less, but still sufficient past data (up to ~ 2010)
Online news2: 財華網 Advantages: • Update very frequently • Used by Li Xiaodong 2014 • Doctorial thesis, City university • News Impact Analysis in Algorithmic Trading • 2003-2008 English news articles were used • dictionary based Disadvantages: • Harder to crawl
Online news3: 阿思達克財經網, 經濟通 Advantages • Similar to the previous Disadvantages • Fewer past data • Search results are limited to 15 pages (阿思達克) • Search results are limited to 200 items(經濟通)
News sources comparison • According to data format, we may start with online news first • Online news is very demanding for algorithmic and no latency trading, as there are too many online financial news to read for traders • Final product should be able to include newspaper
Similar works on sentiment • Last time: • DICTION (dictionary- based) • Thomson Reuters news sentiment (neural network, supervised learning) • This time: • 台股新聞情緒 (Chinese, dictionary-based) • 優礦 (Chinese) • Wordstat (dictionary-based)
台股新聞情緒 • Webpage format • Providing 2 kinds of index of a company: • SR (optimistic) • ITDC (risk) • Also lists companies with top SR index per week • News sentiment by counting keywords“指標試算” • They have developed their own dictionary in Chinese 聯合知識庫+銘傳大學
News sentiment by counting keywords: Incorrect, the tone should be negative. Correct, the tone is negative. Dictionary seems not so sophisticated. Like “蝕”is not detected.
優礦 • A mainland big data company • Provide news and news sentiment api • 40k news threats per day • Sentiment score: [-1,1] • Provide strategy trading simulation • A user demonstrated a strategy trading on the company forum
By user cheng.li • Funny simulation result: both strategies are making profits
Wordstat • Non- free article sentiment analysis software • Dictionary-based • Trial version last for 30 days • Provide only descriptive statistic by using Loughranand McDonald Sentiment Word Lists
Milestone delivery: • Similar to 聯合知識庫and 優礦 • Web, api,and financial news dictionary for Hong Kong can be built • Collect online news from multiple online sources • Provide news headlines, links, and sentiment scores • Calculate score index for each company for the day • List companies with high/low scores for recommendation • May also attract mainland traders
Sentiment methods • Classify texts into groups (eg. Optimistic or pessimistic) • Last time: • n gram + 1-hot vector + SVM • Other possible ways: • Word2vec • Doc2vec • N gram + 1-hot vector • Document 1: “cat sat mat” • Document 2: “cat hate cat”
word2vec • Model which turns a word into a vector • Method • Teach the machine to guess the context from the target word (skip- gram) • A mapping between the context and the target word • Example: • Doc1: Thecatsat on the mat • Doc2: Thedogsat on the mat • Teaching: • Given “cat” (target word), guess “the” and “sat” (context) • Given “dog” (target word), guess “the” and “sat” (context) • Outcome: • Similar words are usually having very similar context • Their mapping parameters are similar • And thus the word vectors (which are the mapping parameters)
Word2vec-practical • Jieba sentence cutting (結巴分詞) • tagging numerical terms (regexp) • Before: • 長和今天放榜,早前大摩預測長和中期比只升5%,主要受英英鎊貶值等外匯因素影響,預測長和上半年經營溢利同比升5%至309億元 • After: • 長和 今天 放榜 早前 大摩 預測 長和 中期 比 只升 xpercent主要 受英 英鎊 貶值 等 外匯 因素 影響 預測 長和 上半年 經營 溢利 同比 升 xpercent至 xmoney
Word2vec-practical • using crawled 30000 Quamnet news, 300 embedded feature • Results Word2vec by Tensorflow Nearest to 跌: 升, 挫, 倒跌, 微跌, 股亦收升, 現跌, 無升, 微升, Nearest to 同比: 按年, 去年同期, 之後高見, 僅減, 連特別息, 遠洋報, 此負, Nearest to 日: xdate, 日期, 日起, 日向, 昨日, 日止, 日終, 郭樹清, Nearest to 對: 認為, 家會員, 讓, 運費, 他們將, 將對, 令電能, 與, Nearest to 涉及: 共, 成交, xhand, 涉資約, 光啟, bcm_energy_partners, 對換, Nearest to 公布: 公佈, 宣布, 公告, 放榜, 發布, 止, 公在, arthur_h_del_prado, Nearest to 中: 指中, 內, 中解釋, 神華及, 耐, 港鐵學院, 其後再展, 遴選及, Nearest to 在: 於, 或, , 預期, 將在, 將於未來, 資源予, 與, Nearest to 而: 但, 另外, 表示, 或, 九鐵, 至於, 認為, 他稱, Nearest to 投資: 投資及, 金遠, 資產, 融資, 基金, 投資項, 阿拉斯加, 發展, Nearest to 公佈: 公布, 宣布, tank, 表示, 宣, 矽谷, 姜元, 刊發, Nearest to 虧損: 盈利, 溢利, 純利, 增長, 收益, , 收入, 錄純利, Nearest to 止: 止六個, 止全, 止首, 日止, 止九個, 月止, 止三個, 止將, Nearest to 會: 將會, 洪建, 只會, 會會, 可以, 起累, 匿名信, 希望, Nearest to 由: 則由, 為, 因為, 至, 因倫敦, 從, 自, 調莎莎, Nearest to 後: 簡俊傑, xindex, 前, 向國纜, 擴大後之, 港股, 已, 建市場,
Target word • Another method in word2vec is Continuous bag of words (CBOW) • Opposite to skip-gram, given the context, guess the target word • Mapping parameters form the word vector • But, we need a vector for a document for classification, not for a single word • Solutions: • Solution 1: Averaging all the word vectors: Doc1 vector = [0.42, 0.38,0.22] • bad, as averaging is losing a lot of information • Solution 2: Extending word2vec to doc2vec context
doc2vec • Adding 1-hot paragraph / document id vector • Is a constant input vector for different input context in a particular paragraph • Both weighting parameters for paragraph id and words are updated at the same time during training • Weighting parameters for the paragraph id • represent the missing information from the current context • act as a memory of the topic of the paragraph • formed the paragraph vector Quoc Le, Tomas Mikolov 2014 https://arxiv.org/pdf/1405.4053v2.pdf
doc2vec Comparing sentiment accuracy (movie review): https://recurrentnull.wordpress.com/tag/sentiment-analysis/ • Doc2vec with logistic regression has the highest accuracy • But only slightly higher than bag of words + 1 hot by 1%
Sentiment method and evaluation: • Proposed approach: supervised learning for optimistic and pessimistic : • Manually classify may be ~5k news articles • Very Positive, positive, neutral, negative, very negative • Additional scores can be added at a later time • Separate into training data and testing data, evaluate accuracy and F1 score • Use SVM, FFNN, RNN, logistic regression, and even naive bayesian for performance comparing • Features will be 1hot bag of words, doc2vec, part of speech • Comparison with Mcdonal’s financial dictionary may not be meaningful • Their word list come from 10K reports, while we are news • Their word list is in English
Relation to stock price • Calculate the correlation • 2 time series, corr(X,Y) • Event study (The Econometrics of Financial Markets (Ch.4)) • Impact on stock price by an event • Abnormal return = return – normal return • Return: data in event window (diff in daily closing prices) • Normal return: using data in estimation window (eg. 60 days) • Null hypothesis: AR~N(0,var) • Stock buying simulation using simple strategies Very small corr, as stock prices may also vary by other factors not on news
The End and Thank you references • 技术分析【3】—— 众星拱月,众口铄金?https://uqer.io/community/share/55498c0af9f06c1c3d68806e • Sentiment Analysis of Movie Reviews (3): doc2vechttps://recurrentnull.wordpress.com/tag/sentiment-analysis/ • Distributed representation of sentences and documentshttps://arxiv.org/pdf/1405.4053v2.pdf