1 / 19

School of Computer Science Queen’s University Belfast

Zipf and Type-Token rules. for the English and Irish languages. School of Computer Science Queen’s University Belfast. Le Quan Ha and F. J. Smith. Abstract. Natural language processing. Zipf’s Law: Relationship between Frequency of Words and the Rank. Usage of English and Irish Corpora.

jonah
Download Presentation

School of Computer Science Queen’s University Belfast

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Zipf and Type-Token rules for the English and Irish languages School of Computer Science Queen’s University Belfast Le Quan Ha and F. J. Smith

  2. Abstract • Natural language processing. • Zipf’s Law: Relationship between Frequency of Words and the Rank. • Usage of English and Irish Corpora. • Zipf curves for English and Irish Corpora. • Type – Token Relationship. • Enhancement of Smith – Devine law. • Recent Results on Latin ?

  3. Zipf’s Discovery • In 1949, Zipf proposed the following empirical law : where f is the frequency, r is the rank and k is a constant for the corpus and b = 1. • Zipf discovered the law by analysing the novel “Ulysses” by James Joyce that contains 29,899 word types and 260,430 word tokens.

  4. Zipf curve for Small Text English 250,000 words

  5. Failure of Zipf ‘s Law Zipf curve for Wall Street Journal (40 million words): For rank > 5000, Zipf ‘s law fails.

  6. Very Large English Corpus • North American News Text corpus: • Los Angeles Times & Washington Post (May 1994-Aug 1997) of 71 million tokens and 253 thousand types. • New York Times News Syndicate (Jul 1994-Dec 1996) of 249 million tokens and 461 thousand types. • Reuters News Service (Apr 1994-Dec 1996): General of 90 million tokens and 259 thousand types & Financial of 25 million tokens and 122 thousand types. • Wall Street Journal (Jul 1994-Dec 1996) of 54 million tokens and 198 thousand types.

  7. Zipf curve for English NANT corpus Note: if n-grams (i.e. phrases) are included, the slope is –1 for all ranks

  8. Irish Corpus • The Irish language is a highly-inflected Indo-European Celtic language. Both the beginning and end of words are regularly inflected; so it is very different from English. • The Irish corpus used in our experiments is taken from a corpus of 17th and 18th century Irish from the Royal Irish Academy (http://www.ria.ie) with sizes 7 million tokens with 450 thousand types.

  9. Comparison of Zipf curves for English and Irish

  10. The list of 20 words after rank r with frequency f for the English Zipf curve

  11. Type-Token relationship In 1967, Booth ’s assumption for a word occurring once Applying Zipf ’s law f = k/r, where k is a constant we get So if N is the highest rank of any word in the corpus then So In 1985, Smith and Devine used the same logic to investigate the token- type distribution and proposed the integral for Zipf’s law. Solved as Smith – Devine prediction

  12. Smith – Devine prediction

  13. Enhancement of Smith-Devine law We use an approximation based on a slope –1 for rank  N0 and slope–2 for rank  N0

  14. Enhancement of Smith-Devine law The direct sum from Zipf’s law The sum of this series is well-known and is given by where  = Euler ‘s constant = 0.577. The Smith-Devine equation is wrong by ~ 0.11k. for large N, where Break the integral at N0 where the curve begins to turn down.  k1 and k2 are constants.

  15. Enhancement (continue…) noting that for the last rank N, f = ; so, or . At the rank r = N0, the two curves join. So  Integrating and substituting N0 = 5,000 for English and N0 = 30,000 for Irish.

  16. Extended law on English

  17. Extended law on Irish

  18. Conclusions • For a very large corpus the Zipf curve for English has two slopes,  = 1 for rank less than 5,000 and  = 2 for rank above 5,000. • The curve for Irish, an inflected language, is flatter with a slope of –1 until a rank of about 30,000. • An extended law for the type-token relationship is derived and tested.

  19. New result: Extended law on Latin

More Related