410 likes | 418 Views
This article discusses the classification of particles as a part of speech in the Russian language and the challenges associated with their categorization. It provides an analysis of the extent and distribution of particles in the corpus and presents experimental findings on disambiguating their uses. The article also addresses the theoretical problems of defining parts of speech and overlapping categories.
E N D
Do “particles” deserve to be classified as a part of speech? A view from Russian Francis M. Tyers Anna Endresen Robert Reynolds Laura A. Janda University of Tromsø: The Arctic University of Norway The 5th Conference of the Scandinavian Association for Language and Cognition Trondheim, August 19-21, 2015
Laura A. Janda Francis M. Tyers Robert Reynolds Anna Endresen
Particles: Russian ЖЕ (ŽE) Konečno, sgorel – nel’za že v polden’ ležat’ na solncepeke [«Domovoj», 2002] ‘Of course, you got a sunburn! You can’t ŽE lie in the hot sun in the middle of the day!’ “You are wrong! And more than that, you are capable of arriving at the correct conclusion yourself, but nevertheless you are sticking to the wrong conclusion.” McCoy (2003: 125)
Particles: Russian ЖЕ (ŽE) • Russian particles • Small uninflected words lacking referential content (Švedova et al. 1980) • Meaning: modal and pragmatic attitudes towards a proposition • Multifunctional allow multiple interpretations • Overlap with other parts of speech (adverbs, conjunctions, interjections, predicatives) • A part of speech Konečno, sgorel – nel’za že v polden’ ležat’ na solncepeke [«Domovoj», 2002] ‘Of course, you got a sunburn! You can’t ŽE lie in the hot sun in the middle of the day!’ “You are wrong! And more than that, you are capable of arriving at the correct conclusion yourself, but nevertheless you are sticking to the wrong conclusion.” McCoy (2003: 125) ?
Particles: small words, big problems • “The wide use of particles is a typical feature of colloquial Russian” (Vasilyeva 1972: 6) Emu ja mogu poverit’– ‘I can trust him’ Ved’emu ja mogu poverit’– ‘I can trust him, you know this’ Emu-toja mogu poverit’– ‘I know, I can trust him’ Emu ja eščemogu poverit’– ‘Well, I suppose, I can trust him’ Takemu ja mogu poverit’– ‘So I can trust him’ Votemu ja mogu poverit’– ‘He is the one I can trust’ Emu ja i mogu poverit’– ‘Therefore I can trust him’ Da emu ja mogu poverit’ – ‘Well, I can surely trust him’ Xot’emu ja mogu poverit’– ‘At least I can trust him’
Particles: small words, big problems • Active use of particles distinguishes L1 speakers from L2 learners (Nikolaeva 1985: 7) • Relevant for other languages too. • Heinrichs, W. 1981. Die Modalpartkeln im Deutschen und Schwedischen. Tübingen. • L2 German speaker: • Bitte geben Sie mir das Buch. • L1 German speaker: • Können Sie mir vielleicht mal das Buch da geben? • Ach, geben Sie nur doch bitte mal das Buch.
Outline • Particles in Russian • Extent • Distribution in the corpus • Our data: 9 words • Database • Analysis • Alternative annotation scheme & guidelines • Experiments 1 and 2 • Training a tagger to disambiguate between uses
Theoretical problems with particles as a part of speech • Langacker (2013: 96) on parts of speech: “Traditional terms lack precise definition, are inconsistent in their applications, and are generally inadequate” • Croft (2001: 63-107) Parts of speech are partly language-specific: the “same” categories might not coincide exactly across languages, though the focal points of certain categories, such as noun, pronoun, verb are typologically similar • Part of speech categories can be complex and can overlap:
Theoretical problems with particles as a part of speech • Langacker (2013: 96) on parts of speech: “Traditional terms lack precise definition, are inconsistent in their applications, and are generally inadequate” • Croft (2001: 63-107) Parts of speech are partly language-specific: the “same” categories might not coincide exactly across languages, though the focal points of certain categories, such as noun, pronoun, verb are typologically similar • Part of speech categories can be complex and can overlap: Russian words classed as “particles” are particularly prone to overlap across part of speech categories
How do we identify parts of speech? • Formal characteristics: morphological classes, e.g., nouns inflected for case, verbs for tense and person • Distributional characteristics: e.g., adpositions contiguous with noun phrases, pronouns substitute for nouns, conjunctions bind phrases • Semantic characteristics: e.g., nouns signify entities, verbs signify situations Ideally, a classification should take into consideration all three types of characteristics
How do we identify parts of speech? • Formal characteristics: morphological classes, e.g., nouns inflected for case, verbs for tense and person • Distributional characteristics: e.g., adpositions contiguous with noun phrases, pronouns substitute for nouns, conjunctions bind phrases • Semantic characteristics: e.g., nouns signify entities, verbs signify situations Ideally, a classification should take into consideration all three types of characteristics Russian words classed as “particles” lack a coherent definition for formal, distributional and semantic characteristics
Practical problems with particles as a part of speech • Automatic Part of Speech taggers are trained on a gold standard corpus • 1 Part of Speech error can foul up the parsing of a whole sentence • Manning 2011: Penn Treebank of English yields 97% accuracy in automatic Part of Speech tagging, but • This yields only 57% sentence parsing accuracy! • Main culprit is Part of Speech tagging errors • Accurate tagging is important not only for Natural Language Processing, but for all tools sourced by NLP: • spelling and grammar checkers • intelligent computer-assisted language learning • linguistic corpora • machine translation
Practical problems with particles as a part of speech • Automatic Part of Speech taggers are trained on a gold standard corpus • 1 Part of Speech error can foul up the parsing of a whole sentence • Manning 2011: Penn Treebank of English yields 97% accuracy in automatic Part of Speech tagging, but • This yields only 57% sentence parsing accuracy! • Main culprit is Part of Speech tagging errors • Accurate tagging is important not only for Natural Language Processing, but for all tools sourced by NLP: • spelling and grammar checkers • intelligent computer-assisted language learning • linguistic corpora • machine translation Russian words classed as “particles” are the most error-prone part of speech
Our proposal • Particle is not a valid category. • Russian particles have no coherent profile. • “Particle” looks like a garbage category that is used when one feels uncertain about how to classify a word. • Particle is not a classification but rather a failure to classify a word. • It is possible to reclassify the words commonly classed as “particles”. • We offer improved annotation guidelines that eliminate the class of particles altogether. • Descriptively more precise analysis.
Extent of particles in Russian Estimates of the number of Russian particles vary: • Zaliznjak (1980) designates over 100 Russian words as particles. • Nikolaeva (1985: 8) lists the following alternative counts: • 131 particles in the 17-volume Academydictionary • 110 particles in the 4-volume Academydictionary • 84 particles in Ušakov’s dictionary • 75 particles in Ožegov’s dictionary • Starodumova (1997: 8-9) claims that Russian is among the most “particle-rich” languages in the world, with approximately 300 particles.
Extent of particles in Russian Estimates of the number of Russian particles vary: • Zaliznjak (1980) designates over 100 Russian words as particles. • Nikolaeva (1985: 8) lists the following alternative counts: • 131 particles in the 17-volume Academydictionary • 110 particles in the 4-volume Academydictionary • 84 particles in Ušakov’s dictionary • 75 particles in Ožegov’s dictionary • Starodumova (1997: 8-9) claims that Russian is among the most “particle-rich” languages in the world, with approximately 300 particles. Only 42 particles appear in all four of these sources. 64 particles are found in only one source each.
Data: 9 particles with one additional part-of-speech reading Database: 100 random sentences for each word 900 rows in Excel spreadsheet Source of data: disambiguated part of the Russian National Corpus (gold standard RNC) Can we re-tag the uses of these lexemes and avoid the tag ‘particle’?
Analysis: Particle-free annotation
Analysis: Particle-free annotation
Analysis: Particle ЖЕ (ŽE) 13% 1. Adverbial conjunction (ADVCNJ) – syntactically optional, usually preposed. Konečno, sgorel – nel’za žev polden’ ležat’ na solncepeke. ‘Of course, you got a sunburn! After all, you can’t lie in the hot sun in the middle of the day!’ 2. Coordinating conjunction (CNJCOO) – usually postposed, obligatory for creating an explicit contrast between syntactic constituents: Satira i jumor. Odni ix rezko razdeljajut ... , drugie ževidjat v jumore ... raznovidnost’ satiry. ‘Satire and humor: some people keep them strictly distinct ... , others howeversee humor as a form of satire.’ 3. Emphasizer (EMPH) – syntactically optional, follows a phrasal stress-bearing word and brings it in focus of attention Seli s kraju ― i tut žeiz veščmeška Vovka izvlek butylku portvejna.‘They sat down ― and right awaythen Vovka pulled a bottle of portwine out of the supply bag.’ 6% 81%
Analysis: Particle ДА (DA) 50% • Interjection – ‘yes’ Da, zavtra ja priedu. ‘Yes, I will come tomorrow’ • Coordinating conjunction – ‘and’, ‘but’ Ded dababa ‘grandfather andgrandmother’ • Adverb – ‘after all, well’ Dav etix pal’to pol goroda xodit!‘Well, half of the town is wearing these coats!’ • Predicative – stands for entire proposition, carries stress: Neobxodimo ustanovit’, želaet li on vozvratit’sja. Esli da, to kogda. ‘It is necessary to find out whether he wants to come back. If so, when’. • Modal verb – unstressed, used with imperatives, infinitives, present tense finite forms: Da budet svet! ‘Let there be light!’ 25% 19% 3% 3%
Experiment 1: How badly does tagging of Russian particles perform? • Source: Russian National Corpus gold standard, using the tags manually assigned there • Database: 100 randomly-selected sentences for each of nine high-frequency particles (= 900 sentences) • Method: Hidden Markov Model (HMM), 10-fold cross-validation, each time using 90 sentences as training set and 10 sentences as test set We measure performance as improvement in accuracy (correct guesses/total guesses) over baseline (frequency of most common tag for each word)
Experiment 1: How badly does tagging of Russian particles perform? • Source: Russian National Corpus gold standard, using the tags manually assigned there • Database: 100 randomly-selected sentences for each of nine high-frequency particles (= 900 sentences) • Method: Hidden Markov Model (HMM), 10-fold cross-validation, each time using 90 sentences as training set and 10 sentences as test set We measure performance as improvement in accuracy (correct guesses/total guesses) over baseline (frequency of most common tag for each word)
Distribution of tags across the 900 sentences according to the RNC gold standard Most common tag taken as “baseline” for each word
Experiment 1: So how did the HMM tagger do? Total gain: 51% These results confirm our suspicion that the tagging of Russian “particles” in the RNC gold standard is not consistent
Experiment 2: Life without particles • Same source, database, and method as Experiment 1, but using our tags for the nine words instead of those in the RNC gold standard
Distribution of tags across the 900 sentences according to our tagging scheme
Experiment 2: So how did the HMM tagger do? Total gain: 127% More than twice the gain over baseline as in Experiment 1, despite much more complex tagging scheme
Conclusions • Can we eliminate particles from the part-of-speech classification in Russian? • Yes, “particle” is not a classification but a failure to classify a word. • It is possible to reclassify the words commonly classed as “particles”. • What are the practical benefits of this approach? • Particle-free annotation, where all categories are meaningful and useful for further applications. • Analysis that is descriptively more precise. • Our methods • Usage-based analysis of corpus data: 9 high-frequency “particles”. • Experiment: training an automatic tagger to disambiguate uses.
References (1) • Anna Wierzbicka. 1997. Jazyk. Kul’tura. Poznanie. Moscow: Russkie slovari. • Croft, William. 2001. Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford: Oxford University Press. • Heinrichs, W. 1981. Die Modalpartkeln im Deutschen und Schwedischen. Tübingen. • Kasatkina, R.F. 2004. “Častica že v roli tekstovogo konnektora (na materiale russkoj dialektnoj reči).” In Nikolaeva T.M. (ed.) Verbal’naja i neverbal’naja opory prostranstva mežfrazovyx svjazej. Moskva. Pp. 71-83. • Langacker, Ronald W. 2013. Essentials of Cognitive Grammar. Oxford: Oxford University Press.
References (2) • Manning, C. D. 2011. “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?” In Gelbukh, A. (ed.), Computational Linguistics and Intelligent Text Processing, 12th International Conference, CICLing 2011, Proceedings, Part I. Lecture Notes in Computer Science 6608, pp. 171-189. • McCoy, S. 2003. “Unifying the meaning of multifunctional particles: The case of Russian ŽE.” In University of Pennsylvania Working Papers in Linguistics: Vol. 9.1. Pp. 123-135. • Nikolaeva, T. M. 1985. Funkcii častic v vyzkazyvanii (na materiale slavjanskix jazykov). Moscow. • Nikolaeva, T. M. 2008. Neparadigmatičeskaja lingvistika. Istorija “bluždajuščix častic”. Moscow: Jazyki slavjanskix kul’tur.
References (3) • Sičinava, D. V. 2005. Obrabotka tekstov s grammatičeskoj razmetkoj: instrukcija razmetčika. http://ruscorpora.ru/sbornik2005/09sitch.pdf • Starodumova, E. A. 1997. Russkie časticy (pis’mennaja monologičeskaja reč’). Avtoreferat doktorskoj dissertacii. • Švedova, N. Ju. et al. 1980. Russkaja grammatika, tom I. Moscow: Nauka. • Vasilyeva, N.A. 1972. Particles in Colloquial Russian. Moscow: Progress Publishers. • Zaliznjak, A. A. 1980. Grammatičeskij slovar’ russkogo jazyka. Moscow: Russkij jazyk.
Particles in written vs. spoken subcorpora • Common claim: the higher use of particles is characteristic of spontaneous spoken Russian (Vasilyeva 1971) • Is it true for our data? • The difference is statistically significant: Chi-squared= 3709, degrees of freedom=1, p-value> 2.2e-16 • The effect size is very small: Cramer’s V=0.026 The minimum standard value for reportable small effect is 0.1 • We combine the results for both types of data. • Possible explanation: underrepresentation of informal dialog (only 7.8%) in the spoken subcorpus.
Analysis: Particle ЖЕ (ŽE) • Že never appears clause-initially. • Že is a clitic that forms a prosodic unit with a stressed lexeme, to which it is either preposed or postposed. • We suggest that the position of že with regard to its prosodic head is associated with different functions of že. • We differentiate between 3 uses of že: • Emphasizer • Adverbial conjunction • Coordinating conjunction