290 likes | 413 Views
REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS. Andrej A. Kibrik , Grigorij B. Dobrov , Natalia V. Loukachevitch , Dmitrij A. Zalmanov aakibrik@gmail.com. Referential choice in discourse.
E N D
REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov aakibrik@gmail.com
Referential choice in discourse • When a speaker needs to mention (or refer to) a specific, definite referent, s/he chooses between several options, including: • Full noun phrase (NP) • Proper name (e.g. Pushkin) • Common noun (with modifiers) = definite description (e.g. the poet) • Reduced NP, particularly a third person pronoun (e.g. he) 2 2
anaphors Example Full NP antecedent coreference Pronoun • Tandy said consumer electronics sales at itsRadio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done <...> • How is this choice made?
Why is this important? • Reference is among the most basic cognitive operations performed by language users • It is the linguistic representation of what is known as attention and working memory in psychology • Reference constitutes a lion’s share of all information in natural communication • Consider text manipulation according to the method of Biber et al. 1999: 230-232
Referential expressions marked in green • Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done <...>
Referential expressions removed • Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done <...>
Referential expressions kept • Tandysaid consumer electronics salesat its Radio Shack storeshave been slow, partly because a lackof hot, new products. Radio Shackcontinues to be lackluster, saidDennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. HesaidTandyhas done <...>
Plan of talk I. Referential choice as a multi-factorial process II. The RefRhet corpus and the machine learning-based approach III. The probabilistic character of referential choice 8 8
Multi-factorial character of referential choice • Many factors of referential choice • Distance to antecedent • Along the linear discourse structure • Along the hierarchical discourse structure • Antecedent role • Referent animacy • Protagonisthood ......................................... • None of these factors alone can explain referential choice 9 9
Factors integration • At every poing in discourse factors are somehow summed and give rise to an integral characterization – the referent’s activation score • Activation score is the referent’s status with respect to the speaker’s working memory • Activation score predetermines referential choice • Low full NP • Medium full or reduced NP • High reduced NP 10 10
Multi-factorial model of referential choice (Kibrik 1999) Various properties of the referent or discourse context Referent’s activation score Referential choice Activation factors 11 11
Modeling multi-factorialprocesses: machine learning-based methods • Neural networks approach(Gruening and Kibrik 2005) • Machine learning algorithm • Automatic selection of factors’ weights • Automatic reduction of the number of factors («pruning») • However: • Small data set • Single method of machine learning • Low interpretability of results • Hence a new study • Large corpus • Implementation of several machine learning methods • Statistical model of referential choice 12 12
TheRefRhet corpus • English • Business prose • Initial material – the RST Discourse Treebank • Annotated for hierarchical discourse structure • 385 articles from Wall Street Journal • The added component – referential annotation • The RefRhet corpus • Over 30 000 referential expressions 13 13
Scheme of referential annotation • The ММАХ2 program • Krasavina and Chiarcos 2007 • All markables are annotated, including: • Referential expressions • Their antecedents • Coreference relations are annotated • Features of referents and context are annotated that can potentially be factors of referential choice 15 15
16 16
Work on referential annotation • O. Krasavina • A. Antonova • D. Zalmanov • A. Linnik • M. Khudyakova • Students of the Department of Theoretical and Applied Linguistics, MSU 17 17
Current state of the RefRhet referential annotation • 2/3 completed • Further results are based on the following data: • 247 texts • 110 thousand words • 26 024 markables • 7097 proper names • 8560 definite descriptions • 1797 third person pronouns • 3756 reliable pairs «anaphor – antecedent» • Proper names —1623 (43%) • Definite descriptions — 971 (26%) • Pronouns — 1162 (31%) 18 18
Factors of referential choice • Properties of the referent: • Animacy • Protagonisthood • Properties of the antecedent: • Type of syntactic phrase (phrase_type) • Grammatical role (gramm_role) • Form of referential expression (np_form, def_np_form) • Whether it belongs to direct speech or not (dir_speech) 19 19
Factors of referential choice • Properties of the anaphor: • First vs. nonfirst mention in discourse (referentiality) • Type of syntactic phrase (phrase_type) • Grammatical role (gramm_role) • Whether it belongs to direct speech or not (dir_speech) • Distance between the anaphor and the antecedent: • Distance in words • Distance in markables • Linear distance in clauses • Hierarchical distance in elementary discourse units 20 20
Goals for the machine learning-base study • Dependent variable: • Form of referential expression (np_form) • Binary prediction: • Full NP vs. pronouns • Three-way prediction: • Definite description vs. proper name vs. pronoun • Accuracy maximization: • Ratio of correct predictions to the overall number of instances 21 21
Machine learning methods (Weka, a data mining system) • Easily interpretable methods: • Logical algorithms • Decision trees (C4.5) • Decision rules (JRip) • Higher quality: • Logistic regression • Quality control – the cross-validation method 22 22
Examples of decision rules generated by the JRip algorithm • (Antecedent’s grammatical role = subject) & (Hierarchical distance ≤ 1.5) & (Distance in words ≤ 7) => pronoun • (Animate) & (Distance in markables ≥ 2) & (Distance in words ≤ 11) => pronoun 23 23
Main results • Accuracy • Binary prediction: • logistic regression –86.1% • logical algorithms – 85% • Three-way prediction: • logistic regression –74% • logical algorithms – 72% 24 24
According toKibrik 1999 Referential choice is a probabilistic process 26
Probabilistic character of referential choice in the RefRhet study • Prediction of referential choice cannot be fully deterministic • There is a class of instances in which referential choice is random • It is important to tune up the model so that it could process such instances in a special manner • We are working on this problem • Logistic regression generates estimates of probability for each referential option • This estimate of probability can be interpreted as the activation score from the cognitive model 27
Probabilistic multi-factorial model of referential choice Various properties of the referent or discourse context Activation score = probability of using a certain referential expression Referential choice Activation factors 28 28
Conclusions about the RefRhet study • Quantity: Large corpus of referential expressions • Quality: A high level of accurate prediction is already attained • And this is not the limit • Theoretical significance: the following fundamental properties of referential choice are addressed: • Multi-factorial character of referential choice • Probabilistic character of referential choice • This approach can be applied to a wide range of linguistic and other behavioral choices 29