Tatsuhiko Matsushita （松下達彦） PhD candidate Victoria University of Wellington

Analyzing a Japanese Reading Text as a Vocabulary Learning Resource by Lexical Profiling and Indices Tatsuhiko Matsushita （松下達彦） PhD candidate Victoria University of Wellington tatsuma2010@yahoo.co.jp The First Extensive Reading World Congress 4 September, Kyoto Sangyo University

Motive • How can we control the vocabulary of a reading text to maximize the vocabulary learning effect? • Too easy -- fewer words to learn • Too many unknown words -- no learning/inference Goals • To show methods to assess a (Japanese) reading text as a vocabulary learning resource by exploiting lexical profiling and indices

Conclusion = Main Points The simplest way to rewrite a reading text (with 2000 words or less) for a better resource for vocabulary learning: • Delete one-timers(or the words whose occurrences are less than the set level) at the lowest frequency level in the text, or • make them occur more in the text by adding words or replacing other words with the one-timer  The index (LEPIX) figure will be improved • These methods make it possible to predict and compare the efficiency of second language vocabulary learning with a reading text.

Similar Previous Ideas and Attempts • Nation & Deweerdt (2001) • Ghardirian (2002) • Cobb (2007) *No integrated index is shown in previous studies Lexical Profiling • Basically the same idea as Lexical Frequency Profiling (LFP) (Laufer, 1994) • “the percentage of words …… at different vocabulary frequency levels” (p.23)

The Baseword Lists for Lexical Profiling • VDRJ: Vocabulary Database for Reading Japanese (Matsushita, 2010; 2011) http://www.geocities.jp/tatsum2003/ • All words are ranked by Usage Coefficient(Juilland & Chang-Rodrigues, 1964) U = Frequency × Dispersion • Three types of word rankings • For General Learners • For International Students --used for this study • For General Written Japanese • Japanese Character Frequency List (Matsushita, Unpublished) • From the same corpus (BCCWJ) as VDRJ is created from * When analyzing Japanese texts, it is necessary to set a certain level of known characters (Kanji) as well as vocabulary

Assumptions • Required Level of Text Coverage Words which are assumed known to the reader must be within a certain level. (e.g. Hu & Nation, 2000) • Minimum Occurrences of Target Words Among the words assumed unknown, words which occur more frequently than a certain times can be the learning target words. (e.g. Waring & Takaki, 2003)

More Types of Target Words The text where the more types of target words occur is a better text as a vocabulary learning resource. • Density of Target Words (%) The text where the target words occur at a higher ratio is a better text as a vocabulary learning resource.

Methods The main software: AntWordProfiler Ver. 1.200W (Anthony, 2009) • To identify the lexical level of the text by lexical profiling, set the threshold level of (assumed) known words. In this study, the levels are: • 98% for an extensive reading text Lexical Level of Text (LLT98) • 95% for an instructional material Lexical Level of Text (LLT95) (Hu & Nation, 2000) • To Identify the target words, set the minimum occurrences of target words. *6-10 occurrences are required for learning a word incidentally through reading (e.g. Waring& Takaki, 2003), however, a word is not learned by reading one short text • Twice or more for an extensive reading text *Set occurrence will depend on the text length • Twice for a short instructional material

Count T which is the Number of Types of the Target Words. • Calculate (W*100)/N where: W is the Number of Tokens of the Target Words. N is the Total Number of Tokens of the Text. Lexical Learning Possibility Index for a Reading Text *Simply multiply the factors of III & IV  {T*(W*100)/N} (LEPIX) = (T*W*100)/N

Sample Text (original) 人知のシミュレーションが人工知能だとすれば、コンピュータのなかに「知をあつかうメカニズム」を作り込まなければならない。　ところでコンピュータとは、要するに〈記号処理マシン〉である。だからこの場合の〈知〉とは、「記号で表された知」ということになる。記号といっても色々あるが、人工知能が得意なのは、いわゆる言語記号である。たとえば、「今は五月だ」「五月は春だ」「楓の葉は、春と夏には緑色、秋には赤色である」などというのがその守備範囲ということになる。　ところでこういった例は、少しばかり興ざめではなかろうか？　というのは、〈知〉とは、単なる知識の断片ではなく、それらを包括し、横断しながら世界に光を当てていく精神のダイナミズムのように思えるからである。〈知〉はイマジネーションの能力を持たなければならない。さらに〈知〉は、スポーツのような身体の所作にうめこまれている、明言化されない暗黙知の領域をもカバーしなければならない。それこそが、知の知たるゆえんではないだろうか？　残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。それはいまだに、図像さえ自由自在には扱えないのである。英語や日本語などの〈自然言語〉を操作するだけでも四苦八苦なのである。（出典：西垣通『秘術としてのＡＩ思考』）

Sample Text (modified) 人間の頭脳を模倣して作ったものが人工知能だとすれば、コンピュータの中に「知をあつかうメカニズム」をていねいに作っていかなければならない。しかしそこへの道はまだ程遠い。コンピュータとは、要するに〈記号処理のメカニズム〉である。だからこの場合の知とは、「記号で表された知」ということになる。記号といってもいろいろあるが、人工知能が得意なのは、いわゆる言語記号である。例えば、「今は五月だ」「五月は春だ」「カエデの葉は、春と夏には緑、秋には赤である」などという人工言語的表現は処理しやすいのである。しかし、こういった例は、少しばかりつまらないのではないだろうか？　というのは、知とは、一つ一つの知識がバラバラに存在するのではなく、それらを一つにまとめたり、横断したりしながら、世界に光を当てていく精神の力強い働きのように思えるからである。知は想像力を持たなければならない。さらに知は、スポーツのような身体の動きの中にある、はっきりとした言葉にならない知の領域もカバーしなければならない。カエデといえば私たちが紅葉を見て感じる気持ちまで横断的にカバーしなければならないのだ。それこそが、知を知として成り立たせているものではないだろうか。残念ながら、現在の人工知能技術は、この期待に応えるすべを知らない。人間の頭脳の模倣にはまだ程遠いレベルだ。英語や日本語などの〈自然言語〉を操作するだけでも非常に苦労しているのである。

Treatment for Low Frequency Words *Check the level of characters (Kanji) and avoid low frequency ones.

Comparison between the Original and the Modified Texts

For Learning Domain-Specific Words • The target domain is set up at first • The domain-specific words included in the text are identified by checking the list of the domain-specific words • The levels of the identified domain-specific words included in the text are checked by lexical profiling to see how many unknown domain-specific words are contained in the text • The indices are calculated

More Examples of Analysis

How does the text length work for LEPIX? Total Number of Token/Type and LEPIX from Texts with 500-4000 Running Words Total Number of Token/Type and LEPIX from Texts with 1000-2000Running Words LEPIX cannot be compared when the text length is more than double the other.

Remaining Issues • If a repeatedly-used essential key word in the text is at the lowest frequency level, the index doesn’t work well.  there are some solutions for that, but it makes the procedure/calculation more complicated. • Minimum occurrence level of target words will differ according to the text length. Twice will be enough for a short material text, but it is not clear for a longer extensive reading text. • Validation of the indices through empirical study

Conclusion = Main Points The simplest way to rewrite a reading text (with 2000 words or less) for a better resource for vocabulary learning: • Delete one-timers(or the words whose occurrences are less than the set level) at the lowest frequency level in the text, or • make them occur more in the text by adding words or replacing other words with the one-timer  The index (LEPIX) figure will be improved • These methods make it possible to predict and compare the efficiency of second language vocabulary learning with a reading text.

References Anthony, L. (2009).AntWordProfiler 1.200w program. Downloaded from http://www.antlab.sci.waseda.ac.jp/software.html Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning and Technology, 11(3), 38-63. Ghadirian, S. (2002). Providing controlled exposure to target vocabulary through the screening and arranging of texts. Language Learning and Technology, 6(1), 147-164. Hu, M., & Nation, I. S. P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), 403-430. Juilland, A., & Chang-Rodrigues, E. (1964). Frequency Dictionary of Spanish Words. London: Mouton & Co. Nation, I. S. P., & Deweerdt, J. (2001). A defence of simplification. Prospect, 16(3), 55-67. Laufer, B. (1994). The lexical profile of second language writing: does it change over time? RELC Journal, 25(2), 21-33. Matsushita, T. （松下達彦）. (2010) 日本語を読むために必要な語彙とは？－書籍とインターネットの大規模コーパスに基づく語彙リストの作成－ [What words are essential to read Japanese? Making word lists from a large corpus of books and internet forum sites]. 2010年度日本語教育学会春季大会予稿集 [Proceedings for the Conference of the Society for Teaching Japanese as a Foreign Language, Spring 2010], 335-336. Matsushita, T. （松下達彦）. (2011). 日本語を読むための語彙データベース (The Vocabulary Database for Reading Japanese) Ver. 1.01. Downloaded from http://www.geocities.jp/tatsum2003/ Waring, R., & Takaki, M. (2003). At what rate do learners learn and retain new vocabulary from reading a graded reader? Reading in a Foreign Language, 15(2), 130-163.

Tatsuhiko Matsushita （松下達彦） PhD candidate Victoria University of Wellington