170 likes | 314 Views
Web-Based Data Collection and Analysis. Andy Leone June 2014. Course Overview. Tools for extracting and analyzing text SEC Filings Websites PDF Files Software Perl – ActivePerl Perl “ Front End ” Komodo MySQL. Why Perl?. There Are Alternatives to Perl (e.g., Python, Ruby, R)
E N D
Web-Based Data Collection and Analysis Andy Leone June 2014
Course Overview • Tools for extracting and analyzing text • SEC Filings • Websites • PDF Files • Software • Perl – ActivePerl • Perl “Front End” Komodo • MySQL
Why Perl? • There Are Alternatives to Perl (e.g., Python, Ruby, R) • Perl – Practical Extraction and Reporting Language • It is a mature language with considerable support. • Countless packages (add-ons) • About the best there is when it comes to “Regular Expressions.”
The big picture- What I do SAS Connect WRDS SAS Pass-through Perl Edgar SQL ODBC ODSProc Export Excel
SEC FTP • Files can be obtained from the SEC’s FTP site: • ftp.sec.gov • Note that the files are not made available on the FTP site until 24 hours after they are filed. • It is possible to get real-time feeds. • You are supposed to do large downloads after 9:00PM Eastern.
Index Files • Index files are archived each quarter and the most recent one us updated every day: • /edgar/full-index/2006/QTR1/company.zip • The zip file contains a file “company.idx”
Index File, continued -First 10 lines - description then data- -Company Name -Form Type -CIK Number -Filing Date -Directory and File Name Description: Master Index of EDGAR Dissemination Feed by Company Name Last Data Received: September 6, 2006 Comments: webmaster@sec.gov Anonymous FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name --------------------------------------------------------------------------------------------------------------------------------------------- 033 ASSET MANAGEMENT LLC / 13F-HR 1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt 1 800 CONTACTS INC 10-Q 1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt
Data Files • SGML format • All files contain basic header information tagged by <SEC-HEADER> • This is followed by each file submitted with the filing. • Example- • AMREP Corp. Submitted a 10-K/A(Text Version)
Challenge • Lack of uniformity • Another Example: • Smart-tek Solutions Inc. (Text)
Perl • Practical Extracting and Reporting Language • A powerful scripting language for working with unformatted text.
Perl • The key feature of Perl is Regular Expressions. • It is relatively easy to search through text files and match words or word patterns. • For example, to identify firm with an internal control deficiency when wording differs. • “We have identified a material weakness” • “A material weakness exists” • “A control deficiency exists” • “We identified a control deficiency” • “No material weakness exists” (Don’t want to match this one)
Expectations • Using PERL can be very helpful. • More refined measures (e.g., modified audit opinions). • Large samples relative to hand collection. • But.. • Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error. • It would take an infinite amount of time to get success rates up to 99-100%. • Need to accept 90-95% and live with some “noise”
Example: Audit Opinions • How do you find the opinion? • Look for something like “We have audited.” • But PWC has to start with “In our opinion.” • How do you find the end of the opinion? • Look for a date (e.g., March 15, 2005) a the start of a line • But some dates in the opinion happen to fall at start of line (e.g., for year ended December 31, 2005) • Then require the date to be followed by spaces. • But some opinions are dated with “except for.” • Foreign auditors might date the report in day- month-year order (15 March, 2005). • Some firms have multiple opinions. • Sometimes there are typos.
When is PERL most useful? • Extracting specific text and looking for keywords and phrases • Audit Opinions • Internal Controls • Footnotes • Not great for unstructured financial information. For example, use of proceeds in IPO prospectus. • Sometimes you still need to code the data by hand.
What I do • Download all index files and write them to a MYSQL database (via PERL). • Download filings I work with fairly often (e.g., 10-K, 10-Q, 8-K) • Create header header tables in MYSQL for each filing type. • Create project-specific tables as needed. • Note: MYSQL allows for easy access from SAS, STATA, excel, etc. via ODBC.
MYSQL, PERL and SAS • DBI Module • This makes it really easy to work with data via SQL or statistics software (SAS, STATA). • SAS Example: proc sql; connect to odbc (datasrc=Compustat user=me password=xxxx); create table myrestates as select * from connection to odbc (select gvkey, file, sic, hlink, company_name, to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date from 8kdata.restate_sample_jul2006 a); quit;
Software Installation Checklist • ActivePerl 5.10 • Let’s go to the Package Manager and install a couple of modules: • DBD::mysqlPP • HTML::Format • Komodo Edit 4.4 • MySQL 5.0 • MySQL GUI Tools • SAS 9.1.3 (or higher).