310 likes | 448 Views
Yang Shouxun yang.shx@fltrp.com Corpus Development Section, FLTRP. English Corpora and Words Defined in Learner's Dictionaries. English Corpora. A corpus is a collection of written texts and/or transcripts of spoken language The Brown Corpus The British National Corpus
E N D
Yang Shouxun yang.shx@fltrp.com Corpus Development Section, FLTRP English Corpora and Words Defined in Learner's Dictionaries
English Corpora • A corpus is a collection of written texts and/or transcripts of spoken language • The Brown Corpus • The British National Corpus • Special software to access, and analyze the corpus • FLTRP English/Chinese Parallel Corpus
Frequency • A small number of high-frequency words cover a large proportion of the corpus. • A large number of low-frequency words cover a disproportionate part.
Frequency • High-frequency words are most useful to learners of the language. • High-frequency words should be defined in a learner's dictionary. • More frequently used senses should come before less frequently used senses.
Objectives • Are the words defined in a learner's dictionary really high-frequency words? • What high-frequency words are not defined in a learner's dictionary? • What low-frequency words are included in a learner's dictionary (and what not)?
Research methods • Six learner's dictionaries by well-known international publishers • Three corpora for word frequency extraction • Brown Corpus • British National Corpus • FLTRP English/Chinese Parallel Corpus
Research methods • A frequency table for each corpus is computed and the frequency is normalized to that per million words. • Lists of defined words in the 6 dictionaries are extracted. • But multi-word entries are excluded:“a priori”, “according to”
Research methods • Words from corpora are reduced to the base forms • dictionaries contain basically words in base forms • corpora contain words in all possible forms, including cases and capitalization • some issues with this method • thought/think • case distinction cannot be kept: A/a China/china • lots of entries containing numbers in corpora, but only a few numbers are entries in dictionaries.
Computation • Distribution of word frequency in a dictionary • What percentage of high-frequency words are defined in a dictionary?
Distribution of word frequency in dictionaries • Brown Corpus
Distribution of word frequency in dictionaries • More than 45% of words defined in dictionaries C, D, E, and F are not found in Brown Corpus. • More than 25% of words defined in dictionaries A and B are not found in Brown Corpus. • Still good dictionaries • Even learner's dictionaries include far more words than a learner possibly needs.
Distribution of word frequency in dictionaries • The figure clearly shows that the dictionaries can be clustered into two categories • A, B • C, D, E, F • The denominator is the size of words defined in the dictionaries • for learners • for advanced learners
Distribution of word frequency in dictionaries • Similar trend • Just a little smaller
How many high-frequency words are defined? • Brown Corpus
How many high-frequency words are defined? • The denominator is constant across dictionaries. • Advanced dictionaries are rated higher, but the margin is very small. • The curves after frequency < 8 are surpring and require an explanation.
How many high-frequency words are defined? • FlecPara
High-frequency words not defined in dictionaries • An increasing number of high-frequency words (with the frequency getting lower) are not defined. • Place names, such as “Asia”, “Europe” • Person's names, such as “John”, “David” • Other cases, such as “ii”, “na”, “ca” • Words probably should be included: “legislative”(>19) not defined in B • Some dictionaries extensively use words in definitions that are not defined.
Why not all high-frequency words are defined? • Computational methods are good enough but not perfect: • how to reduce words to the base forms, spelling variations in the corpus • numbers • They are not supposed to be important or are just left out by accident: “Soviet”(>119), “Unix”(>42) • Some vulgar words are probably avoided intentionally for elementary learners.
Low-frequency words defined in dictionaries • Some lower-frequency words have to be chosen if the dictionary is a big one. • Words and expressions come into wider use after the corpus is built may find their way into new dictionaries or updated versions. • "ISP", "spammer", "MP3", and "e-commerce" • Affixes, e.g. “post-”, “-proof” • Why some low-frequency words are chosen and others not is not so clear.
Concluding remarks • Take the numbers with a grain of salt. • The frequency principle is well observed in modern English dictionaries. • There may be occasional bugs. • The corpus should be kept up-to-date, or new words and expressions should be added from other sources if the dictionary is targeted at advanced learners.
Concluding remarks • A learner's dictionary does not really need to cover so many low-frequency words. • A better metric for evaluating learner's dictionaries will be coverage of high-frequency words in the dictionary texts, and a topic for further study. • It'll be interesting to include some dictionaries compiled without corpora in the study.