240 likes | 411 Views
Towards a model of formal and informal address in English. Manaal Faruqui Language Technologies Institute, CMU (Work done at IIT Kharagpur, India) Sebastian Padó Univ. of Heidelberg, Germany. Formal and informal address.
E N D
Towards a model of formal and informal address in English Manaal Faruqui Language Technologies Institute, CMU(Work done at IIT Kharagpur, India) Sebastian Padó Univ. of Heidelberg, Germany
Formal and informal address • Most languages distinguish formal (V) and informal(T) address in direct speech (Brown & Gilman 1960) • Formal address: Neutrality, distance, used for “superordinates” • Informal address: used for friends, “subordinates” • Variety of realizations in languages • Frequently pronoun choice (French vous/tu, German Sie/du) • Verbal inflection (e.g. Japanese)
T/V and English • Contemporary English is conspicuous by not realizing the T/V distinction • Pronoun “you” is both formal and informal • No differences in verbal inflection • Does English really differ in such a fundamental way from virtually all other related languages?
Main goals of this work • Goal 1: Determine whether English distinguishes V and T consistently, but using different indicators • If yes, what are these indicators? • Goal 2: Develop a computational model that labels English sentences as T or V • Ideally without spending effort on annotation
Methodology • Use a parallel corpus to analyze aligned sentences with overt (German) T/V choice and covert English T/V choice • For Goal 1: Compare German and English address • For Goal 2: Project German labels onto English sentences
Digression: Creation of a parallel corpus • Current parallel corpora are not suitable • EUROPARL: overwhelmingly formal (>99%) • Newswire: no dialogue • Creation of a new corpus: English—German literary texts • 106 19th-century novels and stories (project Gutenberg) • Sentence-aligned: Gargantuan (Braune & Fraser 2010) • POS-tagged (Schmid 1994) • German sentences can be labeled as T, V or NONE • Rules for labeling follow on the next slide
Labeling German Pronouns as T/V • Du/du: Singular T • Sie:Singular V (except for utterance initial positions) • sie: Ignored • Third person pronoun (she/they) • ihr: Ignored • Plural T address or archaic sing./plural V address • Can be ideally distinguished by capitalization but errors present in the corpus • Dative form of 3rd person “she” pronoun sie • Neutral wrt T/V 6
Goal 1: Compare German and English address • Give English monolingual text to human annotators • Ask for T/V judgment • Their annotation provides the following information • How well do annotators agree on English text? • Does English monolingual text provide enough information to identify T/V? (1a) • How well do annotators agree with copied labels? • Is there a direct correspondence ? (1b) • Only if this is the case is the copying of labels appropriate
Experiment 1: Human Annotation • 200 randomly drawn English sentences • Two annotators (“A1”, “A2”) • Two conditions: • No context: just one sentence • In context: three sentences pre- and post-context each
Results: Reliability • Context improves reliability • Many sentences can not be tagged with T/V in isolation “And she is a sort of relation of your lordship’s,” said Dawson. “And perhaps sometime you may see her.” • Reliability in context is reasonable: • English does provide strong (if imperfect) clues on T/V Goal 1a ✓
Results: Correspondence • Agreement with German projected labels again reasonable, but not perfect • Error analysis showed strong influence of social norms • Example: Lovers in 19th cent. novels use V (!) • [...] she covered her face with the other to conceal her tears. “Corinne!”, said Oswald, “Dear Corinne! My absence has then rendered you unhappy!” Goal 1b ✓
Experiment 2: Prediction of T/V • Copy German T/V labels onto English: No annotation • Learn L2-regularized logit classifier on train set; optimize on dev set; evaluate on test set • Feature candidates : • Lexical features (bag-of-words, χ² feature selection) • Distributional semantic word classes • 200 word classes clustered with the algorithm by Clark (2003) • Politeness theory (Brown & Levinson 2003) • Polite speech has specific features, which are inherited by V
Parallel Corpus: Some statistics • German • #Sent_V: 37K & #Sent_T: 28K • Around 270 (<0.5%) sentences were both T & V • Ignored! • No error in manually verified randomly selected 300 German sentences • English • #Sent_V: 25K & #Sent_T: 18K • Training data: 74 novels (26K) • Development data: 19 novels (9K) • Test data: 13 novels (8K) • Corpus available at http://www.nlpado.de/ 12
Context • As shown by human annotation: Individual sentences often insufficient for classification • Simplest solution: Compute features over a window of context sentences • Problem: context typically includes non-speech sentences “I am going to see his ghost!” Lorry quietly chafed the hands that held his arm.
Context • Our solution: A simple “direct speech” recognizer CRF-based sequence tagger (Mallet) trained on 1000 sentences • Ideal results for 8 sentences of direct speech context +5% accuracy over no context Speech context Sentence context • B-SP: “I am going to see his ghost!” • O: Lorry quietly chafed the hands that held his arm. 15
Quantitative results • Onlylexical features yield significant improvement over frequency baseline Goal 2 ✓
Qualitative analysis: Lexical Features • Top 10 most-associated words for V (left) and T (right) • V: Titles, formulaic language • T: mixed bag, mostly very infrequent
Qualitative analysis: Semantic classes • Only 3-4 of 200 classes are associated with T or V
Qualitative analysis: Politeness features • Politeness features failed to yield a good result • Problem 1: Hand-built lists do have insufficient coverage • Difficult: what linguistic expressions convey “distance”? • Problem 2: Features (at least in their current version) do not distinguish well between T and V • p(f|V)/p(f|T) values for all classes between 0.9 and 1.3 • For 13 of 16 features, p(f|V)/p(f|T) >1: indicative of V
Conclusions • Formal and informal language exists in English as well • Indicators more dispersed across context • Bootstrapping a T/V classifier for English possible • Results still fairly modest • Asymmetry: V more marked than T → better features • Difficult to operationalize features with high recall (sociolinguistic features, first names, …)
Future Work • Learn social networks from the novel • Change the scope of T/V from the sentence level to a pair of interlocutors 21
References • M. Faruqui & S. Pado, “I thou thee, thou traitor”: Predicting formal vs. informal address in English literature. ACL 2011. • M. Faruqui & S. Pado, Towards a model of formal and informal address in English. EACL 2012. • Roger Brown and Albert Gilman. 1960. The pronouns of power and solidarity. In Thomas A. Sebeok, editor, Style in Language, pages 253–277. MIT Press, Cambridge, MA. • Penelope Brown and Stephen C. Levinson. 1987. Politeness: Some Universals in Language Usage. Number 4 in Studies in Interactional Sociolinguistics. Cambridge University Press. • Fabienne Braune &Alexander Fraser. Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora. COLING 2010 • Helmut Schmid. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of the International Conference on New Methods in Language Processing, pages 44–49, Manchester, UK. • Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. 22
Thank you! Questions? Please write to: mfaruqui@cs.cmu.edu pado@cl.uni-heidelberg.de