1 / 8

10-K filing annual report word and document statistics

10-K filing annual report word and document statistics. 9-10-2017 David Ling. Document statistics. Downloaded S&P 500 companies 10-K filings 2011-1-1 to 2017-1-1 1 filing per year, 6 reports per company (some are less due to newly joined) Using regexp to extract item 7

Download Presentation

10-K filing annual report word and document statistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 10-K filing annual report word and document statistics 9-10-2017 David Ling

  2. Document statistics • Downloaded S&P 500 companies 10-K filings • 2011-1-1 to 2017-1-1 • 1 filing per year, 6 reports per company (some are less due to newly joined) • Using regexp to extract item 7 • Items are stored as separated files

  3. For documents with words < 4000, we may consider it as a fail extraction: • Incomplete extraction (part of them are extracted) • Referring to some where else • Regexp cannot be found

  4. Extracted document statistics • Total documents: 2859 • Documents with words > 4000: 2459 (valid extraction) • Companies with valid extraction for recent 3 years: 409 • Companies with valid extraction for recent 6 years: 369 • We can rank that 409 companies Extracted number of words for some companies: [CIK, 2016, 2015, 2014, 2013, 2012, 2011] ['93751' 40711 41958 41740 31540 28126 27087] ['9389' 8953 7578 7397 7615 7877 8162] ['940944' 89 89 89 89 89 89] ['943819' 7202 6636 6653 6714 6688 6712] ['96021' 18994 18870 22269 19989 18672 19268] ['97476' 4661 5477 80 69 69 69]

  5. Top 50 frequent words among valid extracted • 59290 distinct words in valid extracted • Did not apply Stemming and lemmatization (eg. cat and cats, play and played, company and company’s are distinct) • They are distinct in downloaded GloVe data Frequency in valid extracted

  6. Frequency percentile • About 10% of words appear only 1 times • Frequency are highly dominated by 1% of the frequent words

  7. Some selected uncommon words • Rank, word, freq., doc freq. • 58783,lncome,1,1 • 58784,quality.,1,1 • 58785,2.53x,1,1 • 58786,amrisc,1,1 • 58787,1.85x,1,1 • 58788,2.09x,1,1 • 58789,1.36x,1,1 • 58790,mid-fifties,1,1 • 58951,padding-bottom,1,1 • 58952,post-january,1,1 • 58953,disappear,1,1 • 58954,low-point,1,1 • 58955,-balance,1,1 • 58956,earnings.we,1,1 • 58957,non-deductible.our,1,1 • 58958,decemberr,1,1 • Some are due to: • Numbers without spaces • Full stop without followed by a capital letter (‘…quality. table of …’) • Missing space (blue) • Hyphen • Wrong spelling • As their appear frequency is small, we may just ignore them, or regard them as noise at this stage.

  8. Discussions • Next step: term weighting and stop words • Filtering stop words by stop word list on internet (Bill McDonald) • Examples: • A ABOUT ABOVE ACROSS AFOREMENTIONED AFORESAID AFTER AFTERWARDS AGAIN AGAINST ALL ALMOST ALONE ALONG ALREADY ALSO ALTHOUGH ALWAYS AMONG AMONGST AN AND ANOTHER ANY ANYHOW ANYONE ANYTHING ANYWHERE ARE AROUND AS AT BE BECAME BECAUSE • Filtering stop words by inverse document frequency • Idf = log( 1/ document frequency) • As document length is long, this is not able to differentiate frequent word and stop words, eg. Both ‘the’ and ‘income’ appear on all documents (same idf) , but ‘income’ is much more meaningful than ‘the’

More Related