530 likes | 691 Views
Predicting Market Movements: From Breaking News to Emerging Social Media. Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu http://ai.arizona.edu Acknowledgements: NSF CRI; NSF EXP-LA; DOD DTRA, CTFP, NPS; (ARFL WMD, CIA, FBI).
E N D
Predicting Market Movements: From Breaking News to Emerging Social Media Dr. Hsinchun Chen Director, Artificial Intelligence Lab University of Arizona hchen@eller.arizona.edu http://ai.arizona.edu Acknowledgements: NSF CRI; NSF EXP-LA; DOD DTRA, CTFP, NPS; (ARFL WMD, CIA, FBI)
Predicting Markets • Markets: international markets, emerging markets, import/export markets, financial market, stock market, commodity market, retail market • Economics (macro), international relations (trade, geopolitics), finance (international/banking/stock), accounting (market return), marketing (sales/retailing) • US (NSF SBE, social behavioral economics; governments, think tanks), Europe/Asia Business school research in not science (cannot be funded by NSF in US)! • Economics, finance, accounting, political science, social science, marketing, computer science (small, no funding in US!), MIS (business intelligence) • Geopolitical/econ/finance/accounting models/theories, market metrics/parameters, analytical techniques, results interpretations, predicating markets • EMH (efficiency market hypothesis), RWT (random walk theory), CAPM (capital asset pricing model), quant/algorithm trading
Research Opportunities • Sophisticated econ/finance/accounting/marketing models/theories, established analytical techniques and metrics (numeric), abundant structured databases (financial metrics, economic indicators, stock quotes) • New, diverse unstructured (text) web-enabled business data sources, e.g., 10K/10Q SEC reports, mass media news, local news, Internet news, financial blogs, investor forums, tweets… • Topic extraction, named entity recognition, sentiment/affect analysis, multilingual language models, social network analysis, statistical machine learning, temporal data/text mining, time-series analysis…
Nerds on Wall Street “Future technological stars…(1) Advanced electronic market tools; (2) Understanding both quantitative and qualitative information…” “The Text Frontier, Collective Intelligence, Social Media, and Market Monitors” “Stocks are stories, bonds are mathematics.” David Leinweber, 2009
AZ BIZ INTEL: BUSINESS MASS MEDIA, SOCIAL MEDIA, TEXT ANALYTICS, SENTIMENT ANALYSIS, SPIKE DETECTION, FINANCE/ACCOUNTING/MARKETING MODELING, PREDICTING MARKET MOVEMENTS
Business Intelligence & Analytics • $3B BI revenue in 2009 (Gartner, 2006) • The Data Deluge (The Economists, March 2010); internet traffic 667 Exabytes by 2013, Cisco; Total amount of information in 2010, 1.2 Zettabyte (KB-MB-GB-TB-PB-EB-ZB-YB) • $9.4B BI software M&A spending in 2010 and $14.1B by 2014 (Forrester) • IBM spent $14B in BI in five years; $9B BI revenue in 2010 (USA Today, November 2010); 24 acquisitions, 10,000 BI software developers, 8,000 BI consultants, 200 BI mathematicians Acquired i2/COPLINK in 2011
Business Intelligence & Analytics • BI: “skills, technologies, applications, and practices used to help an enterprise better understand its business and market.” • Technologies: data warehousing; Extraction, Transformation, and Load(ETL); Business Performance Management (BPM); visual dashboards; and advanced knowledge discovery using data and text mining • BI 2.0: web intelligence, web analytics, web 2.0, social media analytics, opinion mining; cloud computing and web services; real-time monitoring and mining; enterprise performances (marketing/accounting/finance/healthcare)
AZ BIZ INTEL • Mass media, social media contents • Text & social media analytics techniques • Finance/accounting/marketing models (Tetlock/Columbia, Antweiler/UBC, Das/Santa Clara) NYU (Dhar), Arizona (Dhaliwal, Kelly, Jiang, Lusch, Yong), National Taiwan U (Li, Hong, Lu) • Bag of words, named entities, proper nouns, topics (1, 2-, 3- grams) • Sentiment/valence, lexicons, machine learning, stakeholder analysis, EFLS analysis • Time series models, spike detection, decaying function, trading windows, targeted sentiment • Econometrics/regression models (R-sqr, p-value), 10-fold validation (F, accuracy), simulated trading (cost, frequency, exit)
Results • Evolution of online WOM through new-product lifecycle • WOM communication starts early in preproduction, becomes highly active before movie release, then diminishes gradually • Valence has a clear decreasing trend over time, indicating that WOM becomes more negative after movie release • Subjectivity, number of sentences and number of valence words stay stable over time
Literature Review: Stock Performance Prediction • Theoretical perspectives on stock behavior • Efficient market hypothesis (Fama 1964) • Price of a stock reflects all available information • Market reacts instantaneously; impossible to outperform • Random walk theory (Malkiel 1973) • Price of a stock varies randomly over time • Future prediction, outperforming the market is impossible • Pessimistic assessments of the predictability of stock behavior refuted through empirical studies • Lo and MacKinlay 1988; Jaffe et al 1989; Pesaran and Timmermann 1995
Literature Review: Stock Performance Prediction • Predominant approaches to stock prediction • Fundamentalists utilize fundamental and financial measures of economy, industry, and firm • Economy and sector indicators, financial ratios of the firm • Fama-French three factors model (Fama and French 1993) • Market return, market capitalization, book to market ratio • Currency exchange rates, interest rates, dividends • Technicians utilize historical time-series information of the stock and market behavior • Historical price, volatility, trading volume • Various machine learning models applied • Regression, ANN, ARIMA, support vector machines
Literature Review: Stock Performance Prediction • In addition to financial and stock variables, researchers have incorporated firm-related news article measures • Developed trend-based language models for news articles • Lavrenko et al. 2000 • Categorized press releases (good, bad, neutral) • Mittermayer 2004 • Examined various textual representations of news articles • Schumaker and Chen, 2009a; 2009b • But few have incorporated firm-related web forums • Thomas and Sycara (2000) utilize text classifications of discussions on Raging Bull to inform stock trading strategies
Literature Review:Firm-Related Web Forums and Stock • Studies relating web forums and stock behavior • Examined firm-related web forums on major web portals • Early studies focused on activity, without content analysis • Supported market efficiency; only concurrent relationships identified • Wysocki 1998; Tumarkin and Whitelaw 2001 • Subsequently challenged; forum activity predicted stock behavior • Antweiler and Frank 2002; 2004; Das and Chen 2007 • Analysis advanced to measure opinions in discussions • ‘Bullishness’ classifiers to distinguish investment positions • Antweiler and Frank 2004; Das and Chen 2007 • Classified buy, hold, or sell positions with 60 – 70% accuracy • Identified predictive relationships between forum discussion sentiment and subsequent stock returns, volatility, trading volume • Shortcomings • Retrospective analyses, shareholder perspective of major forums
AZ FinText: numbers + text • Techniques: bag of words, named entities, proper nouns, past stock prices + SVR • Testbed: S&P 500 5 weeks, Oct-Nov 2005, 2,809 news, 10M stock quotes, • GICS industry classification • Evaluation: Return, vs. Quant funds; 20-minute prediction
AZ FinText in the news Thursday, June 10, 2010 AI That Picks Stocks Better Than the Pros A computer science professor uses textual analysis of articles to beat the market. WSJ Technology News and Insights June 21, 2010, 1:45 PM ET Using Artificial Intelligence to Digest News, Trade Stocks
AZ STOCK TRACKER I: mass, social media, topic, volume, sentiment Data collection Topic extraction Conversation analysis • Topic Mutual information phrase extractor Traffic dynamics Web Forums Online news Discussion topics Topic correlation and evolution Sentiment correlation and evolution Spider/ Parser • Sentiment • Author Sentimentaggregator Sentiment identification Sentiment grader Active topics and sentiments Database Market prediction Message sentiments • Message
User-Generated Contents (UGC): Conversations of 30,000 Wal-Mart Constituents and 500,000 Responses
Market Modeling • Correlation • Sentiment expressed in the forum contemporaneously correlates significantly with stock return • Disagreement, volume, and length expressed in the forum also hold significant correlations with volatility and trading volume
Market Predictive Results (cont’d) • Predictive regression (t-1) • The significant measures of forum discussions identified in contemporaneous regressions maintain their significance in the predictive regression models • Additionally, sentiment expressed in the web forum holds a significant relationship with the trading volume on the following day • Positive sentiment reduces trading volume; negative sentiment induces trading activity
Experimental Design: Description of Prediction Models • Baseline Model – Baseline-FF • Fundamental variables: Fama-French model • Baseline Model – Baseline-Tech • Technical variables: Lagged stock returns, volatility, trading volume, day-of-week dummies • Baseline Model – Baseline-Comp • Comprehensive: all fundamental and technical variables Where t = days (t = 1, 2, …, n);day of the week (d = 1, …, 4)
Experimental Design: Description of Prediction Models • Forum models • Comprehensive baseline variables plus forum-level measures Where t = days (t = 1, 2, …, n);day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c)
Experimental Design: Description of Prediction Models • Stakeholder models • Comprehensive baseline variables plus stakeholder group-level forum measures Where t = days (t = 1, 2, …, n);day of the week (d = 1, …, 4); stakeholder clusters (s = 1, 2, …, c); index k = (((c - 1) * 6) + 15)
Experimental Design:Social Media Data • A 17 month period was utilized for analysis and experimentation • November 1, 2005 to March 31, 2007 • First five months were utilized to calibrate the initial stock return prediction models • November1, 2005 – March 31, 2006 • Calibrated models applied for prediction during each trading day in the next month • Each subsequent month, new models were calibrated using five previous months of time-series variables, for stock return prediction during the next month of trading • In total, stock return prediction was performed daily for one year (250 trading days) • April 1, 2006 – March 31, 2007
Results and Discussion • Hypothesis testing results
Results and Discussion • Wal-Mart stock return prediction model results • Baseline models using fundamental and technical variables • Results across 250 trading days forecasted • Baselines for simulated trading (initial investment of $10,000): • Holding Wal-Mart stock for the year results in $10,096 • Holding S&P 500 for the year results in $11,012
Results and Discussion • Wal-Mart stock return prediction model results • Incorporating the Wakeup Wal-Mart web forum • Results across 250 trading days forecasted Pair-wise t-test; improvement over best baseline model at * p < 0.10 ** p < 0.05
Introduction • Forward-looking statements (FLS) refer to • Projections, forecasts, or other predictive statements • Made by firm management • Section 21E of the Securities Exchange Act (1934) • Extended forward-looking statements (EFLS) • Statements that may have implications for a firms future development • Similar to FLS, but broader • Including information from information intermediaries (e.g., newspapers, newswires) and individuals (e.g., blogs)
Recognizing EFLS • EFLS: Extends FLS to include statements about firm’s future performance from other sources such as financial press, analysts’ reports, and individuals
Summary of Annotation Results • High kappa values (>0.7) on risks supports the coding scheme being empirically valid • Agreement upper bound • 89% to 91% (for ALL, POS, and NEG) • Reference Standard Dataset: • 2539 sentences in total Note: (95% CI) from 1,000 Bootstrappings
EFLS Impacts: Hypotheses Development • Theoretical framework (Easley and O’Hara, 2004) • There are signals for stock k () • () • : The relative amount of private-versus-public information Public Signals Private Signals
Hypotheses Development (Cont’d.) • Hypothesis 1: Firms with lower EFLS intensity are associated with higher expected return.
Hypotheses Development (Cont’d.) • Hypothesis 2: Firms with lower EFLS intensity are associated with the higher stock volatility. • If and then >0 • Intuition: if there are enough signals and the fraction of informed investors is larger than 41%, then firms with lower amounts of EFLS Higher Volatility
Firm-Level Performance Evaluation (Cont’d.) • Empirical Model 1: • Empirical Model 2: Hypothesis 1 Predicts Negative b1 Hypothesis 2 Predicts b1 ≠ 0
Experiment Two: Firm-Level Evaluation • Research Testbed: January 1986 to May 2008, 1,134,321 Wall Street Journal news articles • Merged with CRSP, Compustat, and IBES • Stock prices lower than $5 at the end of a month were removed (Cohen and Frazzini 2008; Fang and Peress 2009) • 1,274,711 firm-months, spanning 269 months
Expected Return and EFLS Intensity ***, **, * indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.
Volatility and EFLS Intensity ***, **, * indicate statistical significance at the 0.01, 0.05, and 0.1 levels, respectively.