1 / 17

Web-Based Data Collection and Analysis

Web-Based Data Collection and Analysis. Andy Leone June 2014. Course Overview. Tools for extracting and analyzing text SEC Filings Websites PDF Files Software Perl – ActivePerl Perl “ Front End ” Komodo MySQL. Why Perl?. There Are Alternatives to Perl (e.g., Python, Ruby, R)

lorie
Download Presentation

Web-Based Data Collection and Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web-Based Data Collection and Analysis Andy Leone June 2014

  2. Course Overview • Tools for extracting and analyzing text • SEC Filings • Websites • PDF Files • Software • Perl – ActivePerl • Perl “Front End” Komodo • MySQL

  3. Why Perl? • There Are Alternatives to Perl (e.g., Python, Ruby, R) • Perl – Practical Extraction and Reporting Language • It is a mature language with considerable support. • Countless packages (add-ons) • About the best there is when it comes to “Regular Expressions.”

  4. The big picture- What I do SAS Connect WRDS SAS Pass-through Perl Edgar SQL ODBC ODSProc Export Excel

  5. SEC FTP • Files can be obtained from the SEC’s FTP site: • ftp.sec.gov • Note that the files are not made available on the FTP site until 24 hours after they are filed. • It is possible to get real-time feeds. • You are supposed to do large downloads after 9:00PM Eastern.

  6. Index Files • Index files are archived each quarter and the most recent one us updated every day: • /edgar/full-index/2006/QTR1/company.zip • The zip file contains a file “company.idx”

  7. Index File, continued -First 10 lines - description then data- -Company Name -Form Type -CIK Number -Filing Date -Directory and File Name Description: Master Index of EDGAR Dissemination Feed by Company Name Last Data Received: September 6, 2006 Comments: webmaster@sec.gov Anonymous FTP: ftp://ftp.sec.gov/edgar/ Company Name Form Type CIK Date Filed File Name --------------------------------------------------------------------------------------------------------------------------------------------- 033 ASSET MANAGEMENT LLC / 13F-HR 1114831 2006-08-11 edgar/data/1114831/0001110550-06-000042.txt 1 800 CONTACTS INC 10-Q 1050122 2006-08-10 edgar/data/1050122/0001104659-06-053544.txt

  8. Data Files • SGML format • All files contain basic header information tagged by <SEC-HEADER> • This is followed by each file submitted with the filing. • Example- • AMREP Corp. Submitted a 10-K/A(Text Version)

  9. Challenge • Lack of uniformity • Another Example: • Smart-tek Solutions Inc. (Text)

  10. Perl • Practical Extracting and Reporting Language • A powerful scripting language for working with unformatted text.

  11. Perl • The key feature of Perl is Regular Expressions. • It is relatively easy to search through text files and match words or word patterns. • For example, to identify firm with an internal control deficiency when wording differs. • “We have identified a material weakness” • “A material weakness exists” • “A control deficiency exists” • “We identified a control deficiency” • “No material weakness exists” (Don’t want to match this one)

  12. Expectations • Using PERL can be very helpful. • More refined measures (e.g., modified audit opinions). • Large samples relative to hand collection. • But.. • Extracting data from free-form text can be messy. Especially when formatting varies so much from file to file. A lot of trial and error. • It would take an infinite amount of time to get success rates up to 99-100%. • Need to accept 90-95% and live with some “noise”

  13. Example: Audit Opinions • How do you find the opinion? • Look for something like “We have audited.” • But PWC has to start with “In our opinion.” • How do you find the end of the opinion? • Look for a date (e.g., March 15, 2005) a the start of a line • But some dates in the opinion happen to fall at start of line (e.g., for year ended December 31, 2005) • Then require the date to be followed by spaces. • But some opinions are dated with “except for.” • Foreign auditors might date the report in day- month-year order (15 March, 2005). • Some firms have multiple opinions. • Sometimes there are typos.

  14. When is PERL most useful? • Extracting specific text and looking for keywords and phrases • Audit Opinions • Internal Controls • Footnotes • Not great for unstructured financial information. For example, use of proceeds in IPO prospectus. • Sometimes you still need to code the data by hand.

  15. What I do • Download all index files and write them to a MYSQL database (via PERL). • Download filings I work with fairly often (e.g., 10-K, 10-Q, 8-K) • Create header header tables in MYSQL for each filing type. • Create project-specific tables as needed. • Note: MYSQL allows for easy access from SAS, STATA, excel, etc. via ODBC.

  16. MYSQL, PERL and SAS • DBI Module • This makes it really easy to work with data via SQL or statistics software (SAS, STATA). • SAS Example: proc sql; connect to odbc (datasrc=Compustat user=me password=xxxx); create table myrestates as select * from connection to odbc (select gvkey, file, sic, hlink, company_name, to_CFO, TO_CFO_Date,cik, irreg,to_CEO, to_CEO_Date from 8kdata.restate_sample_jul2006 a); quit;

  17. Software Installation Checklist • ActivePerl 5.10 • Let’s go to the Package Manager and install a couple of modules: • DBD::mysqlPP • HTML::Format • Komodo Edit 4.4 • MySQL 5.0 • MySQL GUI Tools • SAS 9.1.3 (or higher).

More Related