1 / 39

Are paradigms learnable? A view from Russian

This study examines whether paradigms are learnable using the Russian language as a case study. It explores relationships among word forms and the existence of partial overlapping groups that allow for the production of any potential form.

adank
Download Presentation

Are paradigms learnable? A view from Russian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Are paradigms learnable?A view from Russian Laura A. Janda, UiT The Arctic University of Norway Francis M. Tyers, Higher School of Economics, Moscow

  2. The point:Is partial input enough to learn a whole system? The Paradigm Cell Filling Problem: Native speakers of languages with complex inflectional morphology routinely recognize and produce forms that they have never heard or seen. How is this possible?

  3. Theoretical Background Word and Paradigm Morphology (Blevins 2016) The Paradigm Cell Filling Problem (Ackerman et al. 2009) Generating paradigms with a recurrent neural network (Sigmorphon 2016 & 2017 Shared Tasks;Malouf 2016, 2017)

  4. Part of the paradigm of contar ‘tell’ in Spanish lemma: the dictionary form, here: contar word form: the forms of the word, here: cuento, cuentas, etc. paradigm: the set of word forms of a word paradigm cell: one position in a paradigm, defined by features like 1st person (yo) Present lexeme: an abstraction (meaning) that unifies the set of word forms of a word

  5. But word forms are not equally common, and some might be missing altogether (data from UD Spanish Corpus of 400,000 words) AND: What happens when a speaker needs to inflect a new verb? What if we make a speaker give us the forms of *trontar? Key:bold >30, plain >10, grey 1-9, (blank) unattested

  6. And different lexemes have different profilesgustar ‘please, like’(data from UD Spanish Corpus of 400,000 words) Key:bold >10, plain >5, grey 1-5, (blank) unattested

  7. Hypotheses and Evidence Hypotheses • Russian does not contain paradigms, neither in aggregate, nor in the minds of speakers • Instead there are relationships among forms that constitute partially overlapping groups making it possible to produce any potential form Evidence • Russian and the relationship between paradigm size and number of full paradigms for nouns • Russian nouns: correspondence analysis showing partially overlapping subsets of forms • Russian nouns, verbs, and adjectives: computational experiment comparing training on full paradigms vs. single forms

  8. Hypotheses and Evidence All of our evidence is based on SynTagRus with > 1M hand-annotated tokens Hypotheses • Russian does not contain paradigms, neither in aggregate, nor in the minds of speakers • Instead there are relationships among forms that constitute partially overlapping groups making it possible to guess any potential form Evidence • Russian and the relationship between paradigm size and number of full paradigms for nouns • Russian nouns: correspondence analysis showing partially overlapping subsets of forms • Russian nouns, verbs, and adjectives: computational experiment comparing training on full paradigms vs. single forms

  9. High-frequency Russian Nouns ‘fear’ ‘soldier’ ‘department’ ‘concept’ ‘memory’ Key:bold >20%, plain >10%, grey 1-9%, (blank) unattested

  10. More High-Frequency Russian Nouns ‘background’ ‘chapmpion’ ‘extent’ ‘frame’ ‘difficulty’ Key:bold >20%, plain >10%, grey 1-9%, (blank) unattested

  11. Paradigm size and number of full paradigms • Full paradigms of word forms are rarely encountered in corpora • As the size of the paradigm increases, the percentage of lexemes for which all possible word forms are attested decreases • Russian is somewhere in the middle of the scale • For languages for which linguists claim the existence of truly large paradigms, there may be no lexeme that is ever attested in all possible word forms and even some word forms that have no attestations at all

  12. Relationship between paradigm size and number of full paradigms for nouns

  13. Zipf’s Law

  14. Zipf’s Law • In a corpus of natural language utterances, the frequency of a word is inversely proportional to its frequency rank. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus. • A corollary: • In a corpus about 50% or more of all unique words will be hapaxes (words that occur only once)

  15. Relationship between paradigm size and number of complete paradigms for nouns Because Zipf’s Law scales up, these numbers will never change substantially, no matter how large the corpus is

  16. Russian Nouns:Partially overlapping subsets of word forms Grammatical profile: Frequency distribution of word forms • Grammatical profiles of five types of lexemes: • Masculine inanimate ending in consonant • Masculine animate ending in consonant • Neuter inanimate • Feminine inanimate (II) ending in –a/-я • Feminine inanimate (III) ending in –ь Frequency threshold ≥ 50

  17. Examples of grammatical profiles We will look at correspondence analysis for each group, starting with masculine animates

  18. Correspondence Analysis of Grammatical Profiles Input: 95 vectors (1 for each lexeme) of frequencies for word forms Each vector tells how many attestations were found for each case/number value: Nominative Singular, Genitive Singular, etc. rows are lexemes, columns are case/number values of word forms Process: Matrices of distances are calculated for rows and columns and represented in a multidimensional space defined by factors that are mathematical constructs. Factor 1 is the mathematical dimension that accounts for the largest amount of variance in the data, followed by Factor 2, etc. Plot of the first two (most significant) Factors, with Factor 1 as x-axis and Factor 2 as the y-axis You can think of Factor 1 as the strongest parameter that splits the data into two groups (negative vs. positive values on the x-axis)

  19. Masculine animates

  20. Typically a lexeme is found in only 1-3 wordforms Masculine animates

  21. Typically a lexeme is found in only 1-3 wordforms The typical wordforms are motivated by constructions Masculine animates

  22. NomPl аналитики отмечают ‘analysts make the point that’ Typically a lexeme is found in only 1-3 wordforms The typical wordforms are motivated by constructions Masculine animates InsSg стать/быть чемпионом ‘become/be the champion’

  23. Feminine III

  24. Neuter inanimates

  25. Feminine II (minus рамка)

  26. Masculine inanimates

  27. Computational experiment: nouns, verbs, adjectives • Based on an ordered list of the most frequent forms in SynTagRus • Machine learning: • Given the 100 most frequent forms, predict the next 100 most frequent forms • Given the 200 most frequent forms, predict the next 100 most frequent forms • Given the 300 most frequent forms, predict the next 100 most frequent forms • Given the 400 most frequent forms, predict the next 100 most frequent forms • Given the 500 most frequent forms, predict the next 100 most frequent forms • … until 5400, when SynTagRus runs out of data

  28. Computational experiment Computational experiment: nouns, verbs, adjectives This is the training data • Based on an ordered list of the most frequent forms in SynTagRus • Machine learning: • Given the 100 most frequent forms, predict the next 100 most frequent forms • Given the 200 most frequent forms, predict the next 100 most frequent forms • Given the 300 most frequent forms, predict the next 100 most frequent forms • Given the 400 most frequent forms, predict the next 100 most frequent forms • Given the 500 most frequent forms, predict the next 100 most frequent forms • … until 5400, when SynTagRus runs out of data

  29. Computational experiment Computational experiment: nouns, verbs, adjectives This is the testing data • Based on an ordered list of the most frequent forms in SynTagRus • Machine learning: • Given the 100 most frequent forms, predict the next 100 most frequent forms • Given the 200 most frequent forms, predict the next 100 most frequent forms • Given the 300 most frequent forms, predict the next 100 most frequent forms • Given the 400 most frequent forms, predict the next 100 most frequent forms • Given the 500 most frequent forms, predict the next 100 most frequent forms • … until 5400, when SynTagRus runs out of data

  30. Computational experiment Computational experiment: nouns, verbs, adjectives • Based on an ordered list of the most frequent forms in SynTagRus • Machine learning: • Given the 100 most frequent forms, predict the next 100 most frequent forms • Given the 200 most frequent forms, predict the next 100 most frequent forms • Given the 300 most frequent forms, predict the next 100 most frequent forms • Given the 400 most frequent forms, predict the next 100 most frequent forms • Given the 500 most frequent forms, predict the next 100 most frequent forms • … until 5400, when SynTagRus runs out of data • Comparison of learning with full paradigms vs. learning with single forms • No overlap between training and testing data • This means that testing is always on previously unseen lemmas

  31. Data for training and testing from SynTagRus

  32. 100-200: Both models fail completely

  33. 300-1100: Better performance with full paradigms, but accuracy is low for both

  34. 1200-1700: Both models perform equally

  35. 1800-5400: Single forms model outperforms full paradigms

  36. Conclusions • A given lexeme typically appears in only a handful of word forms • Word forms are likely learned as partially overlapping sets of related items • Learning is potentially enhanced by focus only on the most typical word forms attested for given lexemes • It is possible to extract patterns that relate to the meaning of the lexeme and the constructions that it appears in and use these to strategically target learning • For nouns, number is the most strongly distinguished dimension; locative and instrumental case are most distinct

  37. A vision of the future? • Dictionaries that cite the most frequent word forms of lexemes, along with the constructions in which they typically appear • Learning materials that focus on the typical word forms, avoiding word forms no one is ever likely to use

More Related