410 likes | 428 Views
This article discusses the challenges faced in loading financial data, specifically stock market data, in a low-cost startup. It covers topics like data quality, market data rules, table designs, and the sources of financial data.
E N D
Financial Informatics: • Startup, low-cost, dataload Challenges and Solutions
What are we talking about today? • Financial Data, more specifically stock market data as an example • The basic design of a MySQL database that contains a daily history of stock prices • Building a stock machine and some of the challenges posed • Some large data ‘gotchas’ and solves • Some large mysql ‘gotchas’ and solves
Financial data • csv records of what happened that day or a signal • often have unexplaind anomalies • daily arrival of row data which doesn’t conform to spec
1. Big Picture • Who is your audience? Make your analytics and application work with a small dataset first • Market data rules: You can’t scrape Yahoo • QA is not a bad word: Data Quality is key • What’s a security? • What’s a corporate action? • OLAP: This is once a day processing • Take performance of your dev boxes seriously: Dell 2950 with 32GB of ram, 6 disks, RAID10.
Where does financial data come from? • Thompson / Reuters • McGraw Hill / Interactive Data • Securities and Exchange Commission • Dow Jones • Standard and Poors • Bloomberg • Lots of ‘boutique’ $100M companies
Market data rules • Information about a security’s trade on an exchange is owned by the exchange and distributed to those who have made a license agreement (Reuters, Interactive Data, et al.) Your license agreement with these 3rd parties will start at $20k-$50k a year • Scraping yahoo, msn money, Forbes or another site is infringement • There are different license levels with financial data providers, redistribution usually costs more than a quantitative black box • After three days most data is less valuable / expensive, you may get a bargain for dev phase • Working with financial data providers is a slow process, it may take you 8 weeks from your initial point of contact with a rep before securing a license agreement. Work with your business decision team to prepare for this • Even indexes like S&P 500 and industry data is under license.
What Data Do You Need? • Historical Price - Everyone needs this for charts, models, etc • Corporate Actions - Adjustments going forward for historical data • Real-time Price - You may want this for real-time charts (100’s of Megs a Day) • SEC Filings - You may want to decompose for quant models or present reports to users • 3rd Party Quant Data - Black box trading solution, quant box
Don’t load everything day 1 • AAPL, INTL, T, X, XOM, DVW, DELL, GE • S&P 500 • Russell 3000 • FTSE • APAC • OTC / PINK / BB • Mutual Funds • Money Market • Indexes
What’s a security? • Stocks, bonds, mutual funds and more • In this context traded on an exchange • A note held for you by your broker • Represents a debt to be paid by issuer -or- • Represents a share of the issuer -or- • Represents a bet on the issuer -or- • Represents an index of multiple securities -or- • Represents another abstraction of ownership or bet
What’s a corporate action? • A change to an attribute of a security or a security’s price • Split; reverse split • Dividend • Name change • Listing; delisting; • Exchange change • Notes change • Regional change • Currency change
QA is not a bad word • QA of financial data is much different than qa of software • row data can arrive empty, wrong, portions missing • row data can fail to arrive • stocks may be priced wrong • corporate actions may be for the wrong stock • Canadian stock can be listed in the us with Canadian dollar prices • all kinds of other fun • You must have Excel jockeys to identify and explain noise to: • Engineers • Your data provider • Your customers
2. Table Designs • SECURITY - Attributes of a security • RAW_PRICE - Attributes of a security’s trades from csv, unadjusted • PRICE - Attributes of a security’s trades, adjusted for corporate actions • CORPORATE_ACTIONS - Change records of a security or price attributes • JOBS - Attributes of a job • COUNTRY - A reference table for a security’s country • EXCHANGE - A reference table for a security’s exchange • REGION - A reference table for a security’s region • SOURCE - A reference table of the data provider for a security
SECURITY • security_id is your abstraction of data industry identifiers • SECURITY_ID, your identifier int NOT NULL AUTO_INCREMENT unsigned pk • SECURITY_NAME, exchanges name for company • SOURCE_ID, what data provides this char(1) • CUSIP, us and canada unique identifier, char(9) • TICKER, an identifier, a gotcha, varchar(14) • SYMBOL, an identifier, a gotcha, varchar(14) • EXCHANGE_ID, what exchange is it traded on • REGION_ID, what region does this trade in int • COUNTRY_ID, what country does this trade in int
SECURITY table • Uses internal identifier SECURITY_ID • If you’re experimenting with different providers, SOURCE_ID should be added to pk • Holds providers key for a security (ric, symbol, ticker, cusip) • 500k rows max
RAW_PRICE table (Load Everything) • The rows just as they’ve come from the provider with an artificial key • Price corrections with asof_date in the past may come in, check for these • Sometimes attributes don’t exist in source files, missing asof, open, etc, not null loses the whole row and it might take days to get another one resent
PRICE table • SECURITY_ID, your identifier int unsigned pk • ASOF_DATE, what data provides this char(1) pk • OPEN, the opening price decimal • LOW, the low price for the day decimal • CLOSE, the closing price for the day • HIGH, the high price for the day • VOLUME, how many shares sold that day • SPLIT ADJUSTMENT: (REUTERS, not COMSTOCK) multiplier decimal
PRICE table • Only one price per security per day • Validation happens from RAW_PRICE to PRICE • Instead of bouncing rows you may consider a suspect data flag which bubbles up to UI
CORPORATE_ACTIONS table • Comstock: Splits and Reverses are in this file • Reuters: Splits and Reverses are in price file • Denormalized - Boo! • Much of this information is display information • Changes to exchange or trading status are in here (bankruptcy, emerging from bankruptcy, changing from NASDAQ to OTC.BB, etc) • Dividend information is in here too
COUNTRY, REGION, EXCHNAGE tables • COUNTRY, keeps track of what country a security trades in • USA • CANADA • REGION, keeps track of what region a security trades in • NORTH AMERICA • APAC • EXCHANGE, keeps track of what Exchange a security is traded on • VANCOUVER • NASDAQ • NASDAQ OTC.BB
SOURCE table • Keeps track of who provides what data in the security table • Good to side-by-side comparisons where data comes from two different providers • Helps build organizational knowledge over what providers have good data-quality
Data Gotchas • Do: load everything, don’t build constraints based on provider specs prior to understanding the data • Do: use 5.0.31 or above with innodb • Do: wrap batches in BEGIN / END • Do: set innodb_rollback_on_timeout = ON • Do: stage feeds in raw tables b/c if you adjust for splits in the live history table and make mistakes you’re be loading millions of rows again • Don’t run things like: • exec(“mysql -u user -e “source /feed/load_statements.sql”); • Don’t: foreign keys until process is hardened or never
3. Gears • load_raw_prices(); • daily_price_clean(); • load_security(); • load_price(); • split(); Special sauce for you to write • undo_split(); Ditto
An approach to data loads • Daily load phase 1 • Get data from provider in csv or xml • Don’t translate • Import into raw tables • Run variance checks to throw alerts (~50k securities) • is ( yesterday n rows / today n rows ) between 99.99 and 100.01%? • Daily load phase 2 • Load data into live tables • Make adjustments for corporate actions • Run your models • Run variance checks to throw alerts
load_raw_prices() • function load_prices( $price_file ) { • $lines = file($price_file); • $counter = 0; • foreach ($lines as $line_num => $line ) { • $counter = $counter+1; • $row = explode(",",$line); • $cusip = $row[0]; • $ric = $row[1]; • $asof_date = $row[2]; • $open = $row[3]; • $high = $row[4]; • $low = $row[5]; • $close = $row[6]; • $volume = str_replace( "\n", "", $row[7] ); • $split_adjustment = str_replace( "\n", "", $row[8] ); • $today = date('Y-m-d'); • if($split_adjustment=='') { • $split_adjustment = '0.00000'; • }
load_raw_prices() (cont’d) • $query = "INSERT INTO RAW_PRICE ( CUSIP, RIC, ASOF_DATE, OPEN, HIGH, LOW, CLOSE, VOLUME, SPLIT_FACTOR, LOAD_DATE ) VALUES ( " • . "'" . $cusip . "'," • . "'" .$ric . "'," • . "'" .$asof_date . "'," • . $open . "," • . $high . "," • . $low . "," • . $close . "," • . $volume . "," • . $split_adjustment . "," • . "'" . $today . "')" ; • # put the rows in the raw_prices table • sm_query( $query ); • if (($counter%100)==0) { • echo $counter . " lines processed.\n"; • } • } • echo $counter . " total lines processed.\n"; • }
daily_price_clean() • function daily_price_clean( $source_file, $new_file ) { • $lines = file($source_file); • foreach ($lines as $line_num => $line ) { • # strip "-9,999,401" • $line = str_replace("\"-9,999,401\"","NULL",$line); • # strip volume quotes and commas • $pieces = explode("\"",$line); • $pieces[1] = str_replace(",","",$pieces[1]); • $fixed_line = implode("",$pieces); • # do some more funky stuff to get the date re-arranged • $date_repair = explode(",",$fixed_line); • $date_digits = explode("/",$date_repair[2]); • $date_repair[2] = "20" . $date_digits[2] . "-" . $date_digits[0] . "-" . $date_digits[1]; • $fixed_line2 = implode(",",$date_repair); • # write out new file • if ( !file_exists($new_file)) { • touch ($new_file); • } • $handle = fopen ($new_file, 'a'); • fwrite($handle, $fixed_line2); • fclose($handle); • } • }
load_secuirty() • function load_security( $security_file ) { • $lines = file($security_file); • $counter = 0; • foreach ($lines as $line_num => $line ) { • $counter = $counter+1; • $row = explode(",",$line); • $cusip = $row[0]; • $ric = $row[1]; • $ticker = $row[2]; • $today = date('Y-m-d'); • $query = "INSERT INTO SECURITY ( CUSIP, RIC, TICKER, CREATED_DATE ) VALUES ( " • . "'" . $cusip . "'," • . "'" . $ric . "'," • . "'" . $ticker . "'," • . "'" . $today . "')" ;
load_secuirty() (cont’d) • # put the rows in the raw_prices table • sm_query( $query ); • if (($counter%100)==0) { • echo $counter . " lines processed.\n"; • } • } • echo $counter . " total lines processed.\n"; • }
load_prices() • function load_prices( $date ) { • $query = "INSERT INTO PRICE • SELECT • S.SECURITY_ID, RP.ASOF_DATE, RP.OPEN, RP.HIGH, • RP.LOW, RP.CLOSE, RP.VOLUME, RP.SPLIT_FACTOR, • date(now()) • FROM • RAW_PRICE RP, • SECURITY S • WHERE • S.RIC = RP.RIC • AND • RP.ASOF_DATE = '" . $date . "'"; • echo $query ; • sm_query( $query ); • }
Dependency Task Scheduling • Php and shell scripts are useful tools to download and process price data • But cron doesn’t do a very good job of keeping track in a database of when something starts, finishes, fails, fails to start • If email is broken or cron isn’t reporting correctly you may not know of problems until it’s too late • Often a layer of metadata fails b/c of failed or weird market data, a missing price can make a graph or signal look weird to customers • You can’t load prices if the ftp or feed fails • You can’t process corporate actions until you know the price • You can’t get accurate calculations against time-series if there’s holes in the series • You can’t send signals or present accurate graphs if anything related to a security fails • Keeping track of failed jobs gives you a flag that can also tell your users what they’re seeing is questionable and will be corrected • You can report on a jobs list and throw alerts on failed jobs
Tracking variances in data quality • Price weirdness: • yesterday’s price / today’s price • Row weirdness: • num rows yesterday / num rows today • Range weirdness: • yesterday’s average of a sum / today’s average of a
Questions?Acknowledgements • Starmine: Tripp, Flanzer, Foster, Breffle, Miller • Cake Financial: Reed