Zero Pronoun Identification in Japanese Corpus

Zero pronoun identification in Japanese corpus OYA, Masanori 大矢　政徳 NCLT, School of Computing Dublin City University NCLT Seminar Series

Outline • Motivation • Zero pronouns; what are they? • Zero pronoun identification; related work • Methods of zero pronoun identification • Experiments • Results • Conclusion NCLT Seminar Series

Motivation • Japanese uses zero pronouns quite often both in written and spoken language. • Zero pronoun identification (and its resolution) is one of the important issues in Japanese NLP. • Zero pronoun identification is required in Long-distance dependency (LDD) resolution and automatic case frame extraction from a corpus. NCLT Seminar Series

Motivation • Not all the sentences in Kyoto Text Corpus (KTC) are annotated with zero pronouns. • Kurohashi-Nagao Parser (KNP), a dependency parser for Japanese, does not parse zero pronouns. • Zero pronoun identification is required for them to be more accurate. NCLT Seminar Series

Zero pronouns; what are they? • Also called “pro drop”, “ellipsis” • Pronouns without sound • There are several types of zero pronouns: • with an intrasentential antecedent • with an intersentential antecedent • with definite antecedent • with no definite antecedent NCLT Seminar Series

Zero pronouns: what are they? • Core grammatical functions in Japanese are represented by case particles on an NP; • An NP with the case particle “-ga” is the SUBJ of the verb on which the NP is dependent. • An NP with the case particle “-wo” is the OBJ of the verb on which the NP is dependent • An NP with the case particle “-ni” can be the OBL of the verb on which the NP is dependent (but not always). NCLT Seminar Series

Zero pronouns; what are they? • Zero pronouns with an intrasentential antecedent: their antecedent appears in the same sentence (or clause); • Topicalization e.g.) この本は私が読んだ。 kono hon-wa watashi-gaΦ* yon-da this book-TOP I-NOM Φread-PST “This book, I read.” * Φ refers to a zero pronoun. NCLT Seminar Series

Zero pronouns; what are they? • Zero pronouns with an intrasentential antecedent: their antecedent appears in the same sentence (or clause); • Relative clauses e.g.) 私が読んだ本 [np[rel watashi-ga Φ yon-da]* hon] [np[rel I – NOM Φ read-PST] book] “The book (which) I read” *Japanese does not have relative pronouns NCLT Seminar Series

Zero pronouns; what are they? • Zero pronouns with intersentential antecedents: their antecedent appears in the previous sentence, or refers to the interlocutors; e.g.) Speaker A: この本、読んだ？ SpearkerB: うん、読んだよ。 NCLT Seminar Series

Zero pronouns; what are they? • Zero pronouns with intersentential antecedents; e.g.) Speaker A: Φ1 kono hon yon-da? Φ1 this book read-PST “Have you read this book?” Speaker B: Un, Φ1Φ2 yon-da-yo. Yes, Φ1Φ2 read-PST-end “Yes, I did.” - Φ1 refers to the speaker B, Φ2 to “the book”. NCLT Seminar Series

Zero pronouns; what are they? • Zero pronouns with no definite antecedent; e.g.) ダブリンに来て一年になる。 Daburin-ni ki-te ichinen-ni naru. Φ1 Dublin-OBL come-and Φ? one year-OBL become “One year has passed since I came to Dublin.” • Φ1 refers to someone; intersentential referent • Φ? refers to the length of the speaker’s stay in Dublin, not the speaker; the sentence would be translated literally as follows: “I came to Dublin, and the length of my staying in Dublin has become one year.” NCLT Seminar Series

Zero pronoun identification • Related work: • Yamamoto et al. (1997, 1998): based on decision-tree model • Seki et al. (2001, 2002): based on probabilistic model • Kawahara et al. (2003): based on an automatically extracted large-scale case-frame • Johnson (2002): based on a pattern-matching algorithm (of English) NCLT Seminar Series

Zero pronoun identification • Among others, Seki et al. (2001, 2002) point out the importance of zero pronoun identification before its resolution, such as LDD. NCLT Seminar Series

Zero pronoun identification • Zero pronouns must be identified before resolution: e.g.) 私が読んだ本 [np[rel watashi-gaΦ yon-da] hon] [np [rel I-NOMΦ read-PST]book] “the book which I readΦ” NCLT Seminar Series

Zero pronoun identification • Zero pronouns must be identified before resolution: e.g.) 読んだ本 [np[relΦ1 Φ2 yon-da] hon] [np[relΦ1 Φ2 read-PST] book] “The book which Φ1 read Φ2” NCLT Seminar Series

Zero pronoun identification • In the previous example, it is necessary to identify these two zero pronouns, and then resolve the antecedent of each. • If a system fails to identify one of the zero pronouns, or both, then it also fails to resolve the antecedent. NCLT Seminar Series

Zero pronoun identification • Seki et al. (2001, 2002) also stress the importance of taking different types of zero pronouns into consideration, in order to improve the performance of zero pronoun resolution: e.g.) 読んだ本 [np[relΦ1 Φ2 yon-da] hon] [np[relΦ1 Φ2 read-PST] book] “The book which Φ1 read Φ2” One of the pros refers to “the book”, the other to someone who reads the book. NCLT Seminar Series

Zero pronoun identification • To identify zero pronouns, case-frame dictionaries are used in (Seki et al.(2001,2002), Kawahara et al.(2003)): e.g.) “yomu” is registered as a transitive verb in a case-frame dictionary →If the verb appears without the subject, the object, or both, in a corpus, then (an) appropriate zero pronoun(s) will be inserted. NCLT Seminar Series

Zero pronoun identification [np[relΦ1 Φ2 yon-da] hon] “[np[relΦ1 Φ2 read-PST] hon]” ↓ [np[rel pro-SUBJ pro-OBJyon-da] hon] “[np[rel pro-SUBJ pro-OBJread-PST] book]” The grammatical function of these pros are specified; LDD nor non-local dependency are not yet resolved. NCLT Seminar Series

Zero pronoun identification • Problem in using case-frame dictionary in zero pronoun identification: • If a verb has more than one case frame, then it is sometimes uncertain which case frame must be applied to an instance of zero pronoun. NCLT Seminar Series

Zero pronoun identification e.g.) 将来を考える shorai-wo kangaeru future-ACC think “to think about future” 将来は明るいと考える [shorai-wa akarui]-to kangaeru [future-Top bright]-COMP think “to think that (the) future is bright” NCLT Seminar Series

Zero pronoun identification • What if the verb appears without OBJ or COMP? e.g.) 私は考える watashi-wa kangaeru I-TOP think “I think” ... Is it just an intransitive, or is there OBJ pro? NCLT Seminar Series

Zero pronoun identification • There are four possible solutions to this problem: • Add pro in all cases where SUBJ, OBJ, OBL, or all of them are missing (a simplistic method) • zero pronoun identification using verbal morphology (a morphological method) • zero pronoun identification with a case-frame dictionary which includes probabilistic information of each case frame of a verb (a probabilistic method) • zero pronoun identification with 2 and 3 mixed (a mixed method) NCLT Seminar Series

Method of zero pronoun identification 1. Zero pronoun identification using verbal morphology for OBJ • Some Japanese verbs make transitive-intransitive pair; they are distinguished by verbal morphology • If no NP with the case particle “-wo” is dependent on a verb which has the transitive morphology, then the verb has an OBJ zero pronoun. NCLT Seminar Series

Method of zero pronoun identification • Types of transitive-intransitive pair • intr. root-aru trans. root-eru • intr. root-u trans. root-eru • intr. root-eru trans. root-u • intr. root-iru trans. root-osu • intr. root-eru trans. root-yasu • intr. root-eru trans. root-asu • intr. root-rerutrans. root-su • intr. root-eru trans. root-su • irregular • きえる kieru けすkesu: extinguish The suffix “-su” marks that the verb can take a “wo”marked NP: NCLT Seminar Series

Method of zero pronoun identification • Types of transitive-intransitive pair • intr. root-aru trans. root-eru • intr. root-u trans. root-eru • intr. root-eru trans. root-u • intr. root-iru trans. root-osu • intr. root-eru trans. root-yasu • intr. root-eru trans. root-asu • intr. root-rerutrans. root-su • intr. root-eru trans. root-su • irregular • きえる kieru けすkesu: extinguish The suffix “-eru” is ambiguous: NCLT Seminar Series

Method of zero pronoun identification • Extraction of transitive verbs based on their morphology: • Verbs with “-su” ending are automatically extracted • Other ambiguous verbs are manually extracted NCLT Seminar Series

verb types in KTC: 3506 transitive verb types extracted so far: 1286 root-su: all transitives; 507 transitive with “-aru” intransitive: 71 types transitive with “-u’ intransitive: 176 types intransitive with “-u” transitive: 121 types transitive without intransitive: 292 types*(and more) intransitive without transitive: ? root-eru: 1166 “-suru”: 140 types; 58 transitives (extracted manually) “-ru”: 123 types; 61 transitives (extracted manually) “-u” * : 1570 types; 121 transitives with “-eru” intr. root-(! -su ||! –eru): 1833 * There must be more transitives of this type ... NCLT Seminar Series

verb types in KTC: 3506 transitive verb types extracted so far: 1286 root-su: all transitives; 507 transitive with “-aru” intransitive: 71 types transitive with “-u’ intransitive: 176 types intransitive with “-u” transitive: 121 types transitive without intransitive: 292 types*(and more) intransitive without transitive: ? root-eru: 1166 “-suru”: 140 types; 58 transitives (extracted manually) “-ru”: 123 types; 61 transitives (extracted manually) “-u” * : 1570 types; 121 transitives with “-eru” intr. root-(! -su ||! –eru): 1833 *compound verbs derived from “eru” transitives with “u” intr.; there must be more transitives in this category... NCLT Seminar Series

Method of zero pronoun identification 1. Zero pronoun identification using verbal morphology for OBL • Causative, passive and benefactive constructions are represented by verbal morphology. • The causee in a causative, the agent of a passive, and the benefactor of a benefactive are represented by a noun in oblique case, viz. a noun with the particle “-ni”. • If the oblique case noun is missing in a clause whose root verb has the causative, passive or benefactive morphology, then it has an OBL zero pronoun. NCLT Seminar Series

Method of zero pronoun identification e.g.) 彼女は私に公園を走らせた。 kanojo-wa watashi-ni kouen-wo hashir-ase-ta she-TOP I-OBL park-OBJ run-CAUS-PST “She made me run the park.” NCLT Seminar Series

Method of zero pronoun identification e.g.) 彼女は公園を走らせた。 kanojo-wa kouen-wo hashir-ase-ta she-TOP park-OBJ run-CAUS-PST “She madeΦ run the park.” NCLT Seminar Series

Method of zero pronoun identification • Ambiguity of “-ni” • “-ni” can be used for temporal or locative adverbials; some verbs take a “-ni”-marked NP as argument: e.g.) 去年の2月にダブリンに来た。 Φ kyonen-no nigatsu-ni daburin-ni ki-ta Φ last year-of February-TEMP Dublin-OBL come-PST “Φ came to Dublin on February last year.” NCLT Seminar Series

Method of zero pronoun identification 2. zero pronoun identification with a case-frame dictionary which includes probabilistic information of each case frame of a verb • Case frame dictionaries available now do not include it (except for the Automatically Constructed Case Frames (http://nlp.kuee.kyoto-u.ac.jp/nl-resource/caseframe.html)), so ... • Extracted (pseudo-) probabilistic case frames from KTC NCLT Seminar Series

Method of zero pronoun identification • Transitivity rate (following Seki et al.(2002)) • For each verb, its transitivity rate is calculated as follows: the instances of a verb in transitive use* the token number of the verb transitivity rate = *Transitive use of a verb: An NP with the OBJ case particle “–wo” is dependent on the verb NCLT Seminar Series

Method of zero pronoun identification E.g.) 否定する　hitei-suru “deny” the instances of this verb in transitive use in KTC = 68 the token number of this verb in KTC = 89 transitivity rate of this verb ≈ 0.76 NCLT Seminar Series

Method of zero pronoun identification • Assumption: It is assumed that, if a verb with “high” transitivity rate appears without “-wo” NP in a sentence in KTC, then it is highly possible that there is a zero pronoun in the sentence; hence pro-OBJ is annotated in the f-structure. The same applies to pro-OBL. NCLT Seminar Series

Experiment 1: zero pronoun identification in KTC • 500 randomly chosen sentences from one half of Kyoto Text Corpus are converted into f-structures, and manually corrected. • The zero pronouns are added manually in these Gold Standard f-structures, based on the context in which each of them appeared in the original text, verbal morphology, and A Japanese Lexicon, a hand-coded Japanese case-frame dictionary. • The Gold Standard sentences in KTC are converted into f-structures, with different methods of zero pronoun identification. • The Gold standard f-structures and the f-structures converted from KTC are compared and the precision, recall and f-score are calculated. NCLT Seminar Series

Experiment 1: zero pronoun identification in KTC • The numbers of SUBJ, OBJ and OBL in the Gold standard f-structures (zero pronouns are included): • SUBJ 1411 • OBJ 536 • OBL 568 NCLT Seminar Series

Experiment 1: zero pronoun identification in KTC • The numbers of zero pronouns of each grammatical functions in the Gold standard f-structures: • SUBJ 1121 (approx. 79% of all SUBJ) • OBJ 122 (approx. 22% of all OBJ) • OBL 199 (approx. 35% of all OBL) NCLT Seminar Series

Experiment 1: zero pronoun identification in KTC • The Gold Standard sentences in KTC are converted into f-structures with zero pronoun identification, using the following methods Method 1: Null; No zero pronoun identification Method 2: *Simplistic; Add pro-OBJ and pro-OBL whenever missing, regardless of the case frame of the verb Method 3: Morphological; Use the list of verbs whose morphology specifies their transitivity Method 4: Probabilistic; Use the list of verbs with high transitivity rate (extracted from the other half of KTC, which contains no GS sentences) Method 5: Mix the method 3 and 4; add in the list of method 3 those verbs whose morphology does not specify their transitivity and have high transitivity rate *In all methods except for method1, pro-SUBJ is added simplistically; since every verb subcategorises for a subject, hence if a clause lacks subject NP, then pro-SUBJ is added into the clause. NCLT Seminar Series

Results Method 2 yields higher results in SUBJ pro than the other two. NCLT Seminar Series

Results Precision Recall F-score • Method 3 yields better result in OBJ than in OBL, due to lower frequency of passive, causative and benefactive constructions. • The lower recall of OBJ in method 4 reflects the small size of list extracted from KTC; extraction from larger amount of text might improve the result. NCLT Seminar Series

Results Recall F-score Precision • Method 5 yields the best pred-only f-score so far. • From all the results, it seems that Method 3 is appropriate for OBJ zero pronoun identification. • The results of zero pronoun identification for OBL is lower than that for OBJ, because of the ambiguity of “-ni” marked NPs. NCLT Seminar Series

Experiment 2: zero pronoun identification in the parser output • Strip off the dependency tags in the 500 GoldStandard sentences and parse them with KNP • Convert the parsed output of KNP into f-structures, and identify the zero pronouns in them using the same four methods • Compare the Gold standard f-structures and the f-structures of parsed output with zero pronoun identification NCLT Seminar Series

Results NCLT Seminar Series

Results Precision Recall F-score NCLT Seminar Series

Results • Method 5 yields the best pred-only f-score. • Parser output itself must be improved before zero pronoun identification NCLT Seminar Series

Conclusion • Zero pronouns of different grammatical functions can be identified effectively by different methods. • The result will be used for LDD (or non-local dependency) resolution for Japanese. • The methods can be applied to automatic extraction of case frame from Japanese large corpus. NCLT Seminar Series

References Ikehara, Satoru et al. 1999. Bunrui Goi Taikei "A Japanese Lexicon". NTT Communication Science Laboratories, Iwanami shoten. Johnson, Mark. 2002. A simple pattern-matching algorithm for recovering empty nodes and their antecedents. Proceedings of the 40th Anuual Meeting of the Associaton for Computational Linguistics, 136-143. Kawahara,Daisuke et al. 2002. Construction of a Japanese Relevance-tagged Corpus. In proceedings of the 3rd International conference of Language Resources and Evaluation. pp. 2008-2013. Kawahara, Daisuke et al. 2002. Yogen to chokuzen no kakuyouso no kumi wo tan-i to suru kaku fureemu no jidou kouchiku, "Automatic construction of case frames using the pairs of a predicate and the previous case element as one unit". Journal of Natural Language Processing, vol.9, No.1, pp.3-19. Kawahara, Daisuke et al. 2003. Jidou kouchiku shita kaku fureemu jisho ni motoduku shoryaku kaiseki no daikibo hyouka, "Large-scale evaluation of ellipsis analysis beasd on an automatically constructed case frame dictionary. In the 9th Conference of the Association of Natural Language Processing, pp.589-592. Kurohashi, Sadao and Makoto Nagao. 1997. Kyoto daigaku text corpus project. In the proceedings of the 3rd Conference of the Association of Natural Language Processing, 115, 118. Kurohashi, Sadao and Makoto Nagao. 1998. Building a Japanese Parsed Corpus while Improving the Parsing System. In the proceedings of the 1st International Conference on Language Resources and Evaluation, 719-724. Seki, Kazuhiro et al. 2001. Kakuritu moderu ni motoduku nihongo zero daimeishi no shouou kaishou, "Zero pronoun resolution based on a probabilistic model". In the proceedings of the 7th Conference of the Association of Natural Language Processing, pp.510-513. Seki, Kazuhiro et al. 2002l Zero daimeishi no kenshutu to hokan wo tougou shita kakuritu teki shouhou kaishou moderu, "A probabilistic model for zero pronoun resolution with integration of zero pronoun identification and complementation". In the proceedings of the 8th Conference of the Association of Natural Language Processing, pp. 591-594. Yamamoto, Kazuhide et al. 1997. Ketteigi wo mochiita nihongo zero daimeishi hokan, "Zero pronoun complementation for Japanese using decision trees". The 55th Conference of Information Processing Society of Japan. Yamamoto, Kazuhide et al. 1998. Ketteigi ni yoru nihongo zero daimeishi hokan no seinou hyouka, "Performance evaluation of zero pronoun complementation for Japanese using decision trees". In the proceedings of the 4th Conference of the Association of Natural Language Processing, pp. 19-22. NCLT Seminar Series

Zero Pronoun Identification in Japanese Corpus

Zero Pronoun Identification in Japanese Corpus

Presentation Transcript

Pronoun Case

PRONOUN

PRONOUN

Pronoun Reference

Pronoun Case

Pronoun Case

PRONOUN

Zero Pronoun Resolution in Japanese

Pronoun Agreement

Pronoun Case

Pronoun Usage

Pronoun/Antecedent

Pronoun Practice

Grounding frame elements identification in corpus collocational patterns

Comparative Study on Zero-Knowledge Identification Protocols

Spoken Language Identification Using the Speechdat-M Corpus

PRONOUN REVIEW!

Pronoun Problems

Pronoun Problems

Pronoun Problems

Pronoun Practice

Pronoun Agreement