320 likes | 447 Views
HKUST CSE Dept. 24 th March 2009. Research Text Mining. COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior?. Presented by Louis Wong. Presentation Outline. Background Information Main Idea of Papers
E N D
HKUST CSE Dept. 24th March 2009 Research Text Mining COMP630P Selected Paper Presentation Can the Content of Public News be used to Forecast Abnormal Stock Market Behavior? Presented by Louis Wong
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Main Idea of Paper • The paper wants to deliver the following messages: • Traditional statistical Model can process numerical data efficiently but fail to process unstructured data like news • Traditional processing of news by financial investors is inefficient and inconsistent • Market reaction to asset specific news should be extensively studied
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Data Set – Data Source • Data Source: Bloomberg Professional Service • Studied Target: S&P 500, FTSE 100 & ASX 100 (totally 283stocks) • Data Range: July 2005 ~ November 2006 • Number of Articles: 500000 • Number of Data Source Provider: Around 200 News source provider • Features: • Largest scale of data set for financial text mining experiments • My comment: Data is the king of data-mining
Data Set – Data Preprocessing • Data ( news ) are preprocessed for the following purposes: • Remove number, URL, e-Mail address, meaningless symbol & formatting • Porter Stemmer Algorithm used to remove suffix of words: • For instance, Finance, Finances ,Financed & Financing refer to the word of Finance. Without removing suffix, in processing, they are regarded as different words.
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Methodology: GARCH • When is news affecting stock time series ? • In this paper, in the minds of authors, GARCH is used to model volatility of stock price series. Once the error is larger than expected, the error caused is attributed to stock specific news • Log return of stock price series & realized volatility Delta-T = Time Interval (minute) P = period N = number of returns
Methodology: GARCH • After computing the volatility (annualized), GARCH model is used forecast volatility, based upon previous return & volatility. • Once the forecasting error is larger than mean by N S.D ( mean & SD are computed by pervious 20 trading days ‘ data), the position on time axis is identified as the appearance of abnormal behavior. • GARCH parameters are optimized by 1 month data
Methodology: Alignment of Documents • After identifying the position of abnormal forecasting error, all the news that are within last Delta t minutes are called as “interesting documents”. It will be used in the future training set. • A dictionary is created to include the unique words that occur in documents. • At the same time, two statistical counting are computed. One is term count (dj), meaning frequency of term in all documents while another one is dfj that records the number of documents containing the term.
Methodology: Term Weighing • After establishing dictionary, term weighing are implemented to measure which term is important or not. • Totally 3 popular term weighing approaches are used: • Binary Version of Gain Ratio • ADBM25 • TFIDF (Term Frequency Inverse Document Frequency)
Methodology: Binary Version of Gain Ratio dfj is # of documents containing term j dj is frequency of term j in all documents N is total # of document R is number of interesting document r is number of interesting document containing term j • This method normally selects the term that provides the most information That means, terms help us discriminate the documents more effectively • E (n , m) is a generic formula of computing entropy value. Given featured term, “warrant”, 20 out of 30 documents contain this word. Substituting n to be 20 and m to be 30
Methodology: Binary Version of Gain Ratio dfj is # of documents containing term j dj is frequency of term j in all documents N is total # of document R is number of interesting document r is number of interesting document containing term j Gain Ratio Formula: It consists of 3 parts: • 1st Part : Entropy value of ratio of interesting documents to uninteresting documents. • 2nd Part : Entropy value of ratio of interesting document to uninteresting document, both of which contains the desired term. • 3rd Part : Entropy value of ratio of uninteresting documents to documents ,both of which contain the desired term. 11
Methodology: TFIDF • A common term weighing formula • First part values the frequency of term occurrence • Second part penalizes non-featured words { a, an, the } • Computationally inexpensive {linear scanning} dfj is # of documents containing term j dj is frequency of term j in all documents N is total # of document
Methodology: ADBM 25 K1 & b = constant dj = freq. of doc. containing term j dl = document length avdl = average document length R = number of interesting document r = number of interesting document containing term j dfj = # of doc containing term j • 1st part: Normalized frequency of term, by taking into account the length of document containing term and average document length • 2nd part: This also factors in r, R,dfj & N. Normally, it penalizes some less • important words but appear in many interesting documents.
Term weighing Ranking • After computing the importance of terms, terms are ranked. • The most important N words in different interesting documents form a binary vector. It will be trained by the following two models: • SVM light classifier is used to learn these training samples and make classification on the unseen document in testing phase. • Decision Tree (C4.5) , an improvement of old fashioned ID3 Decision Tree.
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Result : Interesting Document Ratio • Interesting observation: US has the largest document sets as it has largest stock market all over the world. UK is second and AU is the last one.
Result : Parameter Selection • Shorter window size has a relatively higher classification accuracy • Justify the presence of EMH (Efficient Market Hypothesis and RET (Rational Exception Theory) • It also increases the abundance of noise
Result : Effect of Term Weighing & Classifier Trade off of terms considered and accuracy
Result : ROC Curve The proposed method outperforms than discrimination curve in each country
Result : Effect of Historical Window (Length of Training Set Parameters are fixed as follows: GARCH (P, Q) = 3,3 and Delta T is 5 minutes & SD = 6 Observation: More historical knowledge can improve the classification result
Result : Best Classifiers • The following is a list of best classifiers for different countries ‘ data:
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Pros & Cons • Pros: • Large scale experiments: 500000 documents are involved • Impact of news come from different countries on classification result are considered • Different Classifiers, Term weighing method are used • Detail studies about parameters and full interpretation • Cons: • Relatively less newly proposed methods • The paper style is like an industrial paper or empirical studying
Presentation Outline • Background Information • Main Idea of Papers • Overview of Data Set • Overview of Methodology • Result Interpretation • Pros & Cons of Proposed Methodology • Possible Future Work
Future Work Suggested by Author • First Area: Determine whether the news can affect the co-movement behavior of assets • Second Area: Stock trends can be grouped according to sectors of stock and observe whether the Can some of the abnormal behavior of macro-economic news affect a few stock at the same time ?? Oil price related news & China Petrol and Oil price related news with Sinopec • Third Area: In this paper, all news are published in English. Will the results be influenced by the language ? German with DAX market and French with CNC Market
Q & A Session Thank for your listening