Tatsuhiko Matsushita PhD candidate, Victoria University of Wellington

Is the vocabulary learning burden of Japanese really heavier than that of English?日本語の語彙学習負担は本当に英語よりも大きいか？ Tatsuhiko Matsushita PhD candidate, Victoria University of Wellington 17th Biennial Conference of the Japanese Studies Association of Australia (JSAA)

Contents 本発表の内容 • Motives for the study • Goals and research questions • Method • Results • Discussion • Conclusion • References • 研究動機 • 目的・研究課題 • 方法 • 結果 • 考察 • まとめ • 引用文献

1. Motives for the study　研究動機 (1) • Heavy burden in learning Japanese vocabulary? (Tamamura, 1984) • Text coverage study テキストカバー率の研究 Text coverage = Coverage of word tokens （延べ語数） • Top (=most frequent) 1000 words cover 60% in Japanese magazines (NINJAL: The National Institute forJapanese Language, 1962; 2006) • Top 1000 words cover over 70% in English (e.g., Carroll, Davies & Richman, 1971). • To reach 95%/98% text coverage, 9500/20000 words (lexeme 語彙素) are required in Japanese, while only 5000/9000 word families are required in English. (Matsushita, 2011; Nation, 2006) Note: Word family (English) ≒ Lexeme (Japanese)?

1. Motives for the study　研究動機 (2) Word family (English) ≒ Lexeme 語彙素 (Japanese)? • “Word family” adopted by Nation (2006) • Level 6 of Bauer & Nation (1993) -- including derived words with frequent affixes and ‘regular but infrequent affixes’ e.g. Members of abbreviate : abbreviate, abbreviates, abbreviated, abbreviating, abbreviation, abbreviations • Lexeme defined by UniDic (Den et al., 2009) Members of the short unit 短単位 of a lexeme e.g. 読む-読み, やはり-やっぱり, 足-脚, 受け入れる cf. 短縮／する • Why is the text coverage in Japanese and English so different? • Possible explanation: many groups of words with different word-origins 語種but similar meanings (e.g., Akimoto, 2002) e.g., 旅館, 宿屋, ホテル

1. Motives for the study　研究動機 (3) • Questions about the explanation • Method: magazine texts? Coverage not including function words? • English synonyms with different word-origins e.g. liberty-freedom, spirit-soul • Nature of Japanese: many transparent compounds composed of Kanji e.g. 春季/shunki/: low frequency word (Ranked at 28587 in Matsushita (2011)) 春/haru/: high frequency word (1019, ibid) 季節/kisetsu/: high frequency word (1955, ibid)  not difficult to infer the meaning of 春季if the meanings of 春and 季節are already known （春季 istransparent） • For those words, learners normally only need to understand the meanings of components and word formation rules –either implicitly or explicitly. cf. Harlan (2011)

1. Motives for the study　研究動機 (4) cf. Harlan (2011) = a comedian ‘Pakkun’ （パックン）「漢字はある程度覚えると、逆に語彙力を上げるのがすごく簡単になるんです。基本の数を覚えてしまえばあとは応用が利くこともありますし、100覚えれば、その次の100覚えるのがさらに早くなる。500覚えたら、その次の500、1000が倍、3倍速くなるんです。」「漢字を覚えると、新しく聞いた単語を漢字で分析すれば、その意味もわかります。「冷蔵庫」の冷は冷やす、蔵は「くら」だし、車庫の庫で、何か物置的なイメージです。その３つの字を組み合わせれば何となく意味がわかります。」

2. Goals and Research Questions　目的・研究課題 Goals: • To estimatethe true learning burden of Japanese vocabulary • To think about more efficient order for learning Japanese vocabulary Research Questions: • How many ‘characters’ learners need to learn to attain a certain level of text coverage of ‘words’? Note: it is not to see the simple text coverage by character. cf. Chikamatsu et al. (2000) To know the meaning of a single character 節is NOT enough to understand the meaning of 季節. • Do the characters which provide the certain level of text coverage (in Q.1) cover all the high frequency words? If no, what Kanji are further required to cover the words? (Is there any discrepancy between the word frequencies and character frequencies?)

3. Method　方法 (1) - 1 • Calculate character frequencies in BCCWJ (the Balanced Corpus of Contemporary Written Japanese 現代日本語書き言葉均衡コーパス (BCCWJ) 2009 monitor version: NINJAL, 2009) • Give a learning order ranking to each character • Rank the types of character as Alphabet, Hiragana, Katakana and Kanji/signs • Rank Kanji by frequency • List all words in orthographic forms （書字形） in BCCWJ • Separate each word into characters • Give the learning order ranking to each character • Calculate the text coverage by filtering the character of the words by learning order ranking

3. Method　方法 (1) -2 BCCWJ 2009 monitor version (NINJAL, 2009) • Book corpus (approx. 28 million running words) and • Internet forum site corpus (approx. 5 million running words) • Unit of counting a ‘word’ used for this study: • the short form （短単位） defined by UniDic (Den et al., 2009) • the orthographic form （書字形） i.e. 書く / 書か/ かく　or　足 / 脚are counted as different orthographic forms but as one lexeme （語彙素）

3. Method　方法 (2) For RQ. 2, • Identify the relationship between Kanji frequency levels & the former JLPT 旧日本語能力検定試験 Kanji levels to check if the JLPT Kanji are ranked properly • Identify the words which are not covered by the high frequency Kanji and check what Kanji are used in those words

4. Results　結果 (1) - 1 RQ. 1: How many ‘characters’ learners need to learn to attain a certain level of text coverage of ‘words’? • 64% of the words (half of them are function words): covered only by the phonographic characters (Hiragana, Katakana and alphabet) • 82% :by phonographic characters + top 300 Kanji • Learning 100 Kanji in top 1000 Kanji means potential understanding of 6000 – 7000 types 異なり語 of orthographic forms (3000–4000 lexemes)

4. Results　結果 (1) - 2 • 95 - 96%: by phonographic characters 表音文字 & top 1000 – 1100 Kanji  threshold level for reading comprehension? (Hu & Nation, 2000; Komori et al., 2004) • 98%: by phonographic characters &top 1500 kanji

4. ResultsNumber/Ratio of Words (orthographic forms) and Text Coverage by Character Types (+Level of Kanji) in Japanese 日本語の文字タイプ（＋漢字レベル）別の語の数／割合とテキストカバー率

4. Results 結果 日本語の単語のテキストカバー率（漢字レベル別／累積）

4. Results 結果 (2) - 1 RQ. 2: Do the characters which provide the text coverage in Q.1 cover all the high frequency words? If no, what Kanji are further required to cover the words? (Is there any discrepancy between the word frequencies and character frequencies?) i.e. Can low frequency Kanji be barrier against learning high frequency words?

Number of Kanji at Different Frequency Levels and the Former JLPT Levels

4. Results 結果 (2) - 2 • A narrow gap between Kanji frequency level and the former JLPT Kanji Level • Among the top 1000 Kanji, more than 800 Kanji are covered by the Kanji at the former JLPT level 4, 3 and 2 • More than 96% of the word tokens （延べ語数） in general texts will be covered by 1200 Kanji of: • All Kanji at the former JLPT level 4, 3, and 2 (Total: 1000) • + Top 200 Kanji at the former JLPT level 1

4. Results 結果 (2) - 3 Top 196 Kanji at the former JLPT level 1 and others 級外 • Within the top 300: 保義公価基条応態郎& 々 • Within the top 1000: 張士氏視素護離証企異評提姿井統振吉策影紀為宮江派僕従系衛皇展案松隊施我整及織環響修遺宗昭撃株節源養項興故裁沢端障志激弁益嫌佐司眼密載己債訳症標健納請授挙恵貴徳推描崎抗属盛監傷創徴街善援衆康模敵津拠継隠称尾聖鮮厳攻妙融丈筋帝秘敷驚射壊刑壁染功訴討幕扱脱範契弾診詳房避酸倉充典繰儀至削博瞬仮縁憲択就聴握詩秀柄浜滅拡惑踏華闘微雄維隣如審誘賀郷霊釈黙魔携掲遣艦剣致 & 誰頃藤俺之岡伊阪

4. Results 結果 (2) - 4 • 95% text coverage requires • Top 9600 lexemes / Top 20749 orthographic forms (types 異なり語数) • Top 1000 Kanji +Hiragana, Katakana + alphabet • Within top 9600 lexemes, 1700 lexemes are estimated to require Kanji beyond the top 1000 e.g.比較、記憶、批判、距離、指摘、希望、分析、韓国、基礎、誕生、監督、雰囲気、卒業、洗濯 • Many of them are often written in Hiragana/Katakana e.g.即ち、駄目、奴、凄い、頑張る、挨拶、嘘、　　煙草、匂い、只、是非、無駄、喧嘩、噂、伺う

5. Discussion　考察 1) • For general texts, learners can attain more than 70% comprehension with the 95-96% coverage (For English, see Hu & Nation, 2000; for Japanese, see Komori, Mikuni & Kondo, 2004) • Learning Kanji by order of frequency is much more efficient to gain higher text coverage (Zipf’s Law: Zipf, 1949) • Top 300 – 500 Kanji seems much more essential • Top 1000- 1500 Kanji might be enough for general purposes (with occasional use of dictionary) • It may also mean that learning Kanji without reaching the threshold level is of little use…

5. Discussion　考察 2) • Also, to attain 95% coverage, 1000 Kanji are required; however, there are some important words not covered by the top 1000 Kanji • In other words, some low frequency Kanji are used for high frequency words • Many of those Kanji has low productivity, that is, they are rarely used for other words e.g.雰囲気、卒業、洗濯 • To cover top 9600 words (lexemes), further 200 – 500 Kanji are estimated to be required

5. Discussion　考察 3) • Certainly, the burden of learning Japanese characters is heavier than most other languages • However, the burden of learning Japanese vocabulary may be rather lighter once the learner knows: • the 1000-1500 characters • word formation/compounding rules of Kanji • metaphors of Kanji compounds e.g. 入門: entering a gate  first step, to start training • despite the fact that the text coverage is lower than English at all word frequency levels

5. Discussion　考察 4) In other words, it is possible that • the number of ‘units of learning Japanese vocabulary’ is not so many as generally perceived • It will also be important for students/teachers to learn/teach • association of different readings (typically On-reading and Kun-reading) of each Kanji  to reduce the burden of learning Japanese vocabulary e.g. 入門/nyuRmoN/ 入る/hairu/ + 門/moN/ 入る(freq. ranking: 117) is more likely to be learned earlier than 入門(freq. ranking: 6369) (Matsushita, 2011) • Without this kind of association, learners have to learn more words separately

6. Conclusion　まとめ • 63% of BCCWJ texts are covered without Kanji (but half of them are function words) • To attain 95% coverage, 1000 Kanji are required; however, some important words are not covered by the top 1000 Kanji • To cover those words, further hundreds of Kanji will be required • The text coverage in Japanese are generally lower than in English, i.e. Japanese requires more words to learn • However, many Japanese words are composed of limited number of Kanji, therefore, the burden of learning Japanese vocabulary may not be heavy as expected from the text coverage studies, once the learner knows: • the 1000-1500 characters • form, meaning and compounding rules of Kanji • metaphors of Kanji compounds • association of different readings (e.g. On-reading and Kun-reading) of each Kanji

These slides will be uploaded in the site shown below. 「松下言語学習ラボ」 http://www.wa.commufa.jp/~tatsum/ You can find the site by Google with the key words of 松下 (Matsushita)and 言語 (language).

References　引用文献 1) • Akimoto, M. （秋元美晴）. (2002). よくわかる語彙 [Uniderstanding Vocabulary]. Tokyo: Alc（アルク）. • Bauer, L. & Nation, P. (1993). Word families. International Journal of Lexicography. 6(4), 253-279. • Carroll, J. B., Davies, P., & Richman, B. (1971). Word Frequency Book. New York: Houghton Mifflin, Boston American Heritage. • Chikamatsu, N., Yokoyama, S., Nozaki, H., Long, E., & Fukuda, S. (2000). A Japanese logographic character frequency list for cognitive science research. Behavior Research Methods, Instruments, & Computers, 32(3), 482-500. • Den, Y. （伝康晴）, Yamada, A. （山田篤）, Ogura, H. （小椋秀樹）, Koiso, H. （小磯花絵）, & Ogiso, T. （小木曽智信）. (2009). UniDic. Version. 1.3.11. Downloaded from http://www.tokuteicorpus.jp/dist/

References　引用文献 2) • Harlan, P. （パトリック・ハーラン）. (2011). ゼロからの日本語学習と僕の好きな日本のカルチャー (Learning Japanese from zero, and the Japanese culture I like). Cited from http://www.wochikochi.jp/topstory/2011/04/packun.php • Hu, M. H. & Nation, P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403-430. • Komori, K. （小森和子）, Mikuni, J. （三國純子）, & Kondo, A. （近藤安月子）. (2004). 文章理解を促進する語彙知識の量的側面　―既知語率の閾値探索の試み― (What percentage of known words in a text facilitates reading comprehension: a case study for exploration of the threshold of known words coverage). 日本語教育 [Teaching Japanese as a Foreign Language], 125, 83-92.

References　引用文献 3) • Matsushita, T. （松下達彦）. (2011). 日本語を読むための語彙データベース (The Database for Reading Japanese). Downloaded from http://www.geocities.jp/tatsum2003/ • Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63(1), 59-82. • NINJAL: The National Institute for Japanese Language (国立国語研究所). (1962). 現代雑誌90種の用字・用語　第一分冊　総記および語彙表 (Vocabulary and Chinese characters in ninety magazines of today: (Volume I) General description & vocabulary frequency tables). Tokyo: ShuueiShuppan (秀英出版).

References　引用文献 4) • NINJAL: The National Institute for Japanese Language (国立国語研究所). (2006). 現代雑誌200万字言語調査語彙表 (The vocabulary lists from the language survey of contemporary magazines with two million running characters). Downloaded from http://www.kokken.go.jp/katsudo/seika/goityosa/index.html • Tamamura, F. (玉村文郎). (1984). 語彙の研究と教育（上）. Tokyo: The National Institute for Japanese Language (国立国語研究所). • Zipf, G. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. New York: Hafner.

Tatsuhiko Matsushita PhD candidate, Victoria University of Wellington

Tatsuhiko Matsushita PhD candidate, Victoria University of Wellington

Presentation Transcript

Burçin Bozkaya, PhD. – Sabancı University Ronay Ak , PhD. Candidate – Istanbul Technical University

Liz Thevenard Faulty of Education Victoria University of Wellington Email: liz.thevenard@vuw.ac.nz

PhD Candidate:

Naomi Lightman, PhD Candidate, University of Toronto

Helena Hansson Nylund PhD Candidate Örebro University

Naomi Lightman, PhD Candidate, University of Toronto

Tatsuhiko Matsushita （松下達彦） PhD candidate Victoria University of Wellington

Susy Frankel Victoria University of Wellington New Zealand

Theofania Antoniou PhD candidate Panteion University

M.R. Kroessin, PhD Candidate, IDD, University of Birmingham

Charlie Bishop PhD Candidate Memorial University

Hans Lehmann Victoria University of Wellington, New Zealand Brent Gallupe

Hayley Vujcich Masters of Environmental Studies SGEES Victoria University of Wellington

PhD Candidate Rights

Peter J. Dowling Victoria University of Wellington New Zealand

Silvia Palașcă, PhD Candidate Sebastian-Florian Enea, PhD Candidate

Doug Clover PhD Candidate Environmental Studies Victoria University of Wellington

Laurie McLay, University of Canterbury Larah van der Meer, Victoria University of Wellington

Alastair G. Smith School of Information Management, Victoria University of Wellington

Brian Lutz PhD Candidate, Duke University Visiting Researcher, University of Oslo

PhD candidate Jagiellonian University in Kraków

Charlie Bishop PhD Candidate Memorial University