1 / 31

English Corpora and Words Defined in Learner's Dictionaries

Yang Shouxun yang.shx@fltrp.com Corpus Development Section, FLTRP. English Corpora and Words Defined in Learner's Dictionaries. English Corpora. A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus

Download Presentation

English Corpora and Words Defined in Learner's Dictionaries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Yang Shouxun yang.shx@fltrp.com Corpus Development Section, FLTRP English Corpora and Words Defined in Learner's Dictionaries

  2. English Corpora • A corpus is a collection of written texts and/or transcripts of spoken language • The Brown Corpus • The British National Corpus • Special software to access, and analyze the corpus • FLTRP English/Chinese Parallel Corpus

  3. Frequency • A small number of high-frequency words cover a large proportion of the corpus. • A large number of low-frequency words cover a disproportionate part.

  4. Frequency • High-frequency words are most useful to learners of the language. • High-frequency words should be defined in a learner's dictionary. • More frequently used senses should come before less frequently used senses.

  5. Objectives • Are the words defined in a learner's dictionary really high-frequency words? • What high-frequency words are not defined in a learner's dictionary? • What low-frequency words are included in a learner's dictionary (and what not)?

  6. Research methods • Six learner's dictionaries by well-known international publishers • Three corpora for word frequency extraction • Brown Corpus • British National Corpus • FLTRP English/Chinese Parallel Corpus

  7. Research methods • A frequency table for each corpus is computed and the frequency is normalized to that per million words. • Lists of defined words in the 6 dictionaries are extracted. • But multi-word entries are excluded:“a priori”, “according to”

  8. Research methods • Words from corpora are reduced to the base forms • dictionaries contain basically words in base forms • corpora contain words in all possible forms, including cases and capitalization • some issues with this method • thought/think • case distinction cannot be kept: A/a China/china • lots of entries containing numbers in corpora, but only a few numbers are entries in dictionaries.

  9. Computation • Distribution of word frequency in a dictionary • What percentage of high-frequency words are defined in a dictionary?

  10. Distribution of word frequency in dictionaries • Brown Corpus

  11. Distribution of word frequency in dictionaries • More than 45% of words defined in dictionaries C, D, E, and F are not found in Brown Corpus. • More than 25% of words defined in dictionaries A and B are not found in Brown Corpus. • Still good dictionaries • Even learner's dictionaries include far more words than a learner possibly needs.

  12. Distribution of word frequency in dictionaries

  13. Distribution of word frequency in dictionaries • The figure clearly shows that the dictionaries can be clustered into two categories • A, B • C, D, E, F • The denominator is the size of words defined in the dictionaries • for learners • for advanced learners

  14. Distribution of word frequency in dictionaries • BNC

  15. Distribution of word frequency in dictionaries • Similar trend • Just a little smaller

  16. Distribution of word frequency in dictionaries

  17. Distribution of word frequency in dictionaries • FlecPara

  18. Distribution of word frequency in dictionaries

  19. How many high-frequency words are defined? • Brown Corpus

  20. How many high-frequency words are defined? • The denominator is constant across dictionaries. • Advanced dictionaries are rated higher, but the margin is very small. • The curves after frequency < 8 are surpring and require an explanation.

  21. How many high-frequency words are defined?

  22. How many high-frequency words are defined? • BNC

  23. How many high-frequency words are defined?

  24. How many high-frequency words are defined? • FlecPara

  25. How many high-frequency words are defined?

  26. High-frequency words not defined in dictionaries • An increasing number of high-frequency words (with the frequency getting lower) are not defined. • Place names, such as “Asia”, “Europe” • Person's names, such as “John”, “David” • Other cases, such as “ii”, “na”, “ca” • Words probably should be included: “legislative”(>19) not defined in B • Some dictionaries extensively use words in definitions that are not defined.

  27. Why not all high-frequency words are defined? • Computational methods are good enough but not perfect: • how to reduce words to the base forms, spelling variations in the corpus • numbers • They are not supposed to be important or are just left out by accident: “Soviet”(>119), “Unix”(>42)‏ • Some vulgar words are probably avoided intentionally for elementary learners.

  28. Low-frequency words defined in dictionaries • Some lower-frequency words have to be chosen if the dictionary is a big one. • Words and expressions come into wider use after the corpus is built may find their way into new dictionaries or updated versions. • "ISP", "spammer", "MP3", and "e-commerce" • Affixes, e.g. “post-”, “-proof” • Why some low-frequency words are chosen and others not is not so clear.

  29. Concluding remarks • Take the numbers with a grain of salt. • The frequency principle is well observed in modern English dictionaries. • There may be occasional bugs. • The corpus should be kept up-to-date, or new words and expressions should be added from other sources if the dictionary is targeted at advanced learners.

  30. Concluding remarks • A learner's dictionary does not really need to cover so many low-frequency words. • A better metric for evaluating learner's dictionaries will be coverage of high-frequency words in the dictionary texts, and a topic for further study. • It'll be interesting to include some dictionaries compiled without corpora in the study.

  31. Thanks!

More Related