630 likes | 908 Views
Dr. Charles Browne Professor of Applied Linguistics Meiji Gakuin University, Tokyo browne@ltr.meijigakuin.ac.jp. The New General Service List: Celebrating 60 years of Vocabulary Learning. A few current Corpus Projects…. Business English Word List for NHK TV Show in Japan
E N D
Dr. Charles Browne Professor of Applied Linguistics Meiji Gakuin University, Tokyo browne@ltr.meijigakuin.ac.jp The New General Service List: Celebrating 60 years of Vocabulary Learning
A few current Corpus Projects… • Business English Word List for NHK TV Show in Japan • EnglishCentral (a HUGE video corpus of authentic English) • New General Service List (CEC) • New Academic Word List (CEC) • TOEIC Vocabulary Study List (using past tests materials)
EFL Vocabulary Learning in Japan… exasperate digress abstain emigrate torment chaos chaos permission permission and of the and of the Frequency 600,000 ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ • The Negative Effect of “Test English” • PROBLEM: Students NEED to learn the first 5000 words of English to use English in the real word… • But entrance exams and high school textbooks force students to memorize hundreds of low-frequency words… • RESULT? High school students can’t deal with real world English because they don’t know hundreds of the most important high frequency words… 84,168 42,024 25,537 23,371 14,641 5,000 4,441 ace 2,566 bid HFW 2,289 sum 3 2 1
When reading or listening to a text, students will of course will not know many words… What percentage of words do you think must be known for them to be able to read easily? 50% ? 75% ? 85% ? 95% ?
75% Coverage 1000 high frequency words [ 19 missing words ] …another possible problem with _____ _____ is how to _____ learner _____ although research suggests that _____ are a very _____ way to learn new words (Leitner, 1972, Mondria, 1994, Nation, 1990, 2001), students may lose interest if _____ are the _____ _____ of doing _____ _____. There is a _____ _____ in the _____ classroom of using games with a _____ purpose to increase and _____ learner _____ (Ersoz , 2000, Uberman 1988, Wright, Betteridge & Buckby, 1984), as well as lower the learner _____ _____ (Asher, 1965, 1977, Dulay, Krashen & Burt, 1982)
85% Coverage 2000 high frequency words [ 13 missing words ] …another possible problem with _____ _____ is how to _____ learner _____ although research suggests that _____ are a very efficient way to learn new words (Leitner, 1972, Mondria, 1994, Nation, 1990, 2001), students may lose interest if _____ are the _____ method of doing _____ _____. There is a rich tradition in the _____ classroom of using games with a communicative purpose to increase and maintain learner _____ (Ersoz , 2000, Uberman 1988, Wright, Betteridge & Buckby, 1984), as well as lower the learner _____ _____ (Asher, 1965, 1977, Dulay, Krashen & Burt, 1982)
95% Coverage 5000 high frequency words [ 4 missing words ] …another possible problem with vocabulary _____ is how to sustain learner motivation although research suggests that _____ are a very efficient way to learn new words (Leitner, 1972, Mondria, 1994, Nation, 1990, 2001), students may lose interest if _____ are the sole method of doing vocabulary review. There is a rich tradition in the _____ classroom of using games with a communicative purpose to increase and maintain learner motivation (Ersoz , 2000, Uberman 1988, Wright, Betteridge & Buckby, 1984), as well as lower the learner affective filter (Asher, 1965, 1977, Dulay, Krashen & Burt, 1982)
Vocabulary Thresholds: • Below 80%, reading comprehension is almost impossible (Hu & Nation, 2001) • 95% coverage is the point at which learners can read without the help of dictionaries (Laufer, 1989)
Goals of the NGSL Project… • to update and greatly expand the size of the corpus used (273 million words) compared to the limited corpus behind the original GSL (about 2.5 million words), with the hope of increasing the generalizability and validity of the list • to create a NGSL of the most important high-frequency words useful for second language learners of English which gives the highest possible coverage of English texts with the fewest words possible. • to make a NGSL that is based on a clearer definition of what constitutes a word • to be a starting point for discussion among interested scholars and teachers around the world, with the goal of updating and revising the list based on this input (in much the same way that West did with the original Interim version of the GSL)
Original GSL in a nutshell… • West’s 1953 GSL was actually a more fully developed version of Faucett’s 1936 “Interim Report on Vocabulary Selection” (sponsored by the Carnegie Corporation) • Contributors included many famous linguists such as Thorndike, Horn, Maki, Palmer and West • Based on a 2.5 million word hand collected corpus (later increased to 5 million words) • Combined objective (frequency) and subjective (teacher intuition) criteria • Approximately 2200 words giving about 80% coverage in general texts • No systematic attempt to define what a word was: “no attempt has been made to be rigidly consistent in the method used for displaying the words: each word has been treated as a separate problem, and the sole aim has been clearness” (West, 1953, page viii)
General Service Lists GSL (West, 1953)http://jbauman.com/aboutgsl.html#1953
Academic Word List AWL (Coxhead 2000)http://www.victoria.ac.nz/lals/resources/academicwordlist/
Getting AWL/GSL lists w/definitions & sound files… • I made a few GSL/AWL apps and have made all the context available for free to teachers and researchers. Please contact me if you need any of the following for the GSL or AWL: • Word lists • Parts of speech • Definitions in easy English • Definitions in Japanese • Sound files for pronunciation of words • browne@ltr.meijigakuin.ac.jp
Original GSL created in 1930s…2.5m corpus may have had too many agriculture and religion texts? AGRICULTURE • plow • mill • spade • cultivator SEA TRAVEL • sailor • oar • vessel • merchant RELIGION • kingdom • god • devil • mercy • bless • fellowship • preach • sacred • worship • holy • pray • heaven • grace • pupil • church • Lord NOT AS IN USE? • telegraph • chimney • coal • cottage • gaiety • shilling • headdress • saucer • woolen • amongst
Starting Point for NGSL….Access to Cambridge’s more modern 2 BILLION word corpus CEC corpora used for preliminary analysis of NGSL Corpus Tokens Newspaper 748,391,436 Academic 260,904,352 Learner 38,219,480 Fiction 37,792,168 Journals 37,478,577 Magazines 37,329,846 Non-Fiction 35,443,408 Radio 28,882,717 Spoken 27,934,806 Documents 19,017,236 TV 11,515,296 Total 1,282,909,322
Problems… • Newspaper subsection was too large and dominated the frequencies • Newspaper subsection in CEC had too much of a bias towards financial terms • Academic subcorpus of CEC not really related to needs of General English for 2nd language learners
Balancing the NGSL Corpus… CEC corpora included in final analysis for NGSL Corpus Tokens Learner 38,219,480 Fiction 37,792,168 Journals 37,478,577 Magazines 37,329,846 Non-Fiction 35,443,408 Radio 28,882,717 Spoken 27,934,806 Documents 19,017,236 TV 11,515,296 Total 273,613,534* *273 million word subsection used is 100x larger than original GSL corpus…
Next steps… • Removed proper nouns • Removed numbers, days of the week, months of the year, etc. • Used statistical procedures to combine the frequencies from the various sub-corpora while adjusting for differences in their relative sizes • Had meetings with Paul Nation to review list in relation to other frequency list and add/delete words deemed appropriate
Comparing the GSL and NGSL: Apples and Oranges? Word Families or Lemmas?
Comparing the GSL and NGSL: “To be or not to be, that is the question.” • 10 Tokens to, to, be, be, or, not, that, is, the, question • 8 Types to, be, or, not, that, is, the, question • 7 Lemmas to, be, or, not, that, the, question
Comparing the GSL and NGSL: “To be or not to be, that is the question.” RankWordTokensCoverage 1 be 3 30% 2 to 2 20% 3 not 1 10% 3 or 1 10% 3 question 1 10% 3 that 1 10% 3 the 1 10%
Comparing the GSL and NGSL: The assumption in Word Families is that if the headword is known, so are all derived forms… ACCEPT ACCEPTABILITY ACCEPTABLE UNACCEPTABLE ACCEPTANCE ACCEPTED ACCEPTING ACCEPTS
Comparing the GSL and NGSL: But are they?
Comparing the GSL and NGSL: THE WORD FAMILY APPROACH (Bauer and Nation, 1993) Level 1 A different form is a different word. Capitalization is ignored. Level 2 Regularly inflected words are part of the same family. Level 3 (10 affixes) -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, all with restricted uses Level 4 (10 affixes) -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, in-, all with restricted uses.
Comparing the GSL and NGSL: Level 5 (48 affixes) -age (leakage), -al (arrival), -an (American), -ance (clearance), -ant (consultant), -ary (revolutionary), -atory (confirmatory), -dom (kingdom: officialdom), -eer (black marketeer), -en (wooden), -en (widen), -ence (emergence, -ent(absorbent), -ery (bakery: trickery), -ese (Japanese; officialese), -esque (picturesque, -ette (usherette; roomette), -hood (childhood), -i (Israeli), -ian (phonetician; Johnsonian), -ite (Paisleyite; also chemical meaning), -let (coverlet), -ling (ducking), -ly (leisurely), -most (topmost), -ory (contradictory), -ship (studentship), -ward (homeward), -ways (crossways), -wise (endwise; discussion-wise), anti- (anti-inflation), ante- (anteroom), arch- (archbishop), bi- (biplane), circum- (circumnavigate), counter- (counter-attack), en- (encage; enslave), ex- (ex-president), fore- (forename), hyper- (hyperactive), inter- (interweave), mid- (mid-week), mis- (misfit), neo- (neo-colonialism), post- (post-date), pro- (pro-British), semi- (semi-automatic), sub- (subclassify; subterranean).
Comparing the GSL and NGSL: Level 6 (10 affixes) -able, -ee, -ic, -ify, -ion, -ist, -ition, -ive, -th, -y Level 7 Classical roots
Comparing the GSL and NGSL: • However, the GSL is not consistent in defining what to count as a word. • “no attempt has been made to be rigidly consistent in the method used for displaying the words: each word has been treated as a separate problem, and the sole aim has been clearness” (West, 1953, page viii) • To get some consistency, Bauman and Culligan (1995) grouped the original GSL headwords using Level 4 affixes. Then they ranked the words according to frequencies from the Brown Corpus. • Subsequently, Nation released a word list with the program Range that grouped words up to Level 6 affixes, and also included numbers, days of the week, months of the year, and metric units of measurement.
Comparing the GSL and NGSL: NGSL: A Modified Lexeme Approach • All inflected forms for all parts of speech plus the plural of the gerund • Includes both British & American spellings • Examples • accept: accepts, accepted, accepting, acceptings • acceptable:acceptables • paint: paints, painted, painting, paintings
Comparing the GSL and NGSL: Apples and Oranges no longer… When both lists are lemmatized, the NGSL provides far more coverage with far fewer words, one of the chief goals of this project…
List downloadable in many forms www.newgeneralservicelist.org Headword list…
List downloadable in many forms www.newgeneralservicelist.org Lemmatized list…
List downloadable in many forms www.newgeneralservicelist.org List with definitions in easy English…
List downloadable in many forms www.newgeneralservicelist.org List with raw data… (coming soon!)