Minimally Supervised Morphological Analysis by Multimodal Alignment

Minimally Supervised Morphological Analysisby Multimodal Alignment Idit Shoham Supervisor: Shlomo Yona

Contents Introduction Alignment Models Data Structures Conclusion Exit

Introduction Introduction • This project implements a unique morphological analyzer, based on the algorithm presented by D. Yarowsky and R. Wicentowski. • Comparing to other morphological analyzers, this method combines the following features: • Handling BOTH regular and highly irregular forms. • With NO direct supervision. • Starting with NO examples for training. • And NO prior seeding of legal morphological transformations.

Introduction Lemma Alignment • An algorithm which extracts morphological rules relating roots (lemmas) and inflected forms of verbs. • Performing unsupervised, but not completely knowledge-free, learning. • The algorithm uses a combination of four probabilistic models to find pairs that are likely to be morphologically related.

Alignment Models Alignment Models • Frequency Similarity • Context Similarity • Weighted Levenshtein Distance • Morphological Transformation Probabilities

Alignment Models Frequency Similarity • Minimum disparity between the frequency distribution of the lemma and the inflection. <sang,sing> <singed,singe> *<singed,sing> Frequency(sang) = 1427 Frequency(singed) = 9 Frequency(sing) = 1204 Frequency(singe) = 2 log(sang/sing) = 0.17 log(singed/sing) = -4.9 log(singed/singe) = 1.5 log(sang/singe) = 5.06

POSi LF-POSi log Alignment Models Frequency Similarity • Lemma Frequency:Considering the frequencies of all the parts of speech already found.

Alignment Models Frequency Similarity • Problem: • Some inflections are relatively rare. • Simplifying assumption: • The frequency ratios is similar between regular and irregular morphological processes • Solution: • Quantifying how well an inflection fits (or deviates from) expected frequency distributions.

Alignment Models Context Similarity • Minimizing noise by verbs' location in the sentences. The fat woman used to sing here The young man has sang at the opera

Alignment Models Context Similarity • Roughly identified regular expressions are used to find close sentences. ((d?j*n)|o)(a|c)*vq?p?d?(n) N - noun v - verb j - adjective d - determiner q - quantifier a - auxiliary p - preposition o - object c - contrary The fat woman used to sing here d j n v p v n The young man sang at the opera d j n v p d n

Alignment Models Context Similarity • A traditional cosine similarity is computed between the vectors which represent the sentences. vec=(d j n v p) vec1=(FSthe,FSfat,FSwoman+FShere,FSused+FSsing,FSto) vec2=(FSthe,FSyoung,FSman+FSopera,FSsang,FSat)

Levenshtein Distance Alignment Models WeightedLevenshtein Distance • Calculating a distance between an inflection and a candidate lemma. • The naïve Levenshtein Distance:

Alignment Models WeightedLevenshtein Distance • Weights are given to 4 cases: • Vowel turns to another vowel (very common  highest score) • Vowel cluster turns to another vowel cluster (common  high score) • Consonant turns to another consonant (very rare  lowest score) • Consonant turns to vowel cluster (rare  low score) i - a ea - au m - r m - au

Weighted Levenshtein Distance Alignment Models WeightedLevenshtein Distance • Default costs: • Vowel to vowel: 0.5 • Vowel cluster to vowel cluster: 0.6 • Consonant to consonant: 1 • Consonant to vowel cluster: 0.98

Alignment Models MorphologicalTransformation Probabilities • Generalizing the <inflection, lemma> mapping function via a generative probabilistic model: P(inflection | root, suffix, POS) = =P(stemchange | root, suffix, POS) ≈ ≈ 1P(inflection | last3(root), suffix, POS) + (1-1)( 2P( | last2(root), suffix, POS) + (1-2)( 3P( | last1(root), suffix, POS) + (1-3)( 4P( | suffix, POS) + (1-4) P() stemchange is the  rule. Lastk(root) indicates the final k characters of the root. i is a function of the relative sample size of the conditioning event.

Alignment Models MorphologicalTransformation Probabilities • Baseline Model: • Using Levenshtein Distance between  and  to get the cost of stem change probability. • The cost of a change increasing geometrically as the distance from the end of the root increases. • Improvement by Iterative Re-estimation

Hash node Word1 Filename1 Freq posSigns Next fileOccur sentenceLength Hash node Word2 Filename1 Freq posSigns Next fileOccur sentenceLength Hash node Word5 Filename1 Freq posSigns Next fileOccur sentenceLength Hash node Word3 Filename1 Freq posSigns Next fileOccur sentenceLength Hash node Word4 Filename1 Freq posSigns Next fileOccur sentenceLength Data Structures corpusTable

Inf name1 Freq candLemmasNum posNames candLemmaList alignedLemma selectedPos Next CandLemma lemma1 score calculated next prev CandLemma lemma2 score calculated next prev Inf name2 Freq candLemmasNum posNames candLemmaList alignedLemma selectedPos Next CandLemma lemma3 score calculated next prev CandLemma lemma4 score calculated next prev Data Structures infTable

A hashTable of words taken from the dictionary file. No use of following fields:fileOccursentenceLengthfreqnext posName: a field used only by the dictionary. Hash node word: sing "dictionary.txt" Freq posName: V Next:NULL fileOccur sentenceLength Hash node word: bring "dictionary.txt" Freq posName: V Next:NULL fileOccur sentenceLength Hash node word: teach "dictionary.txt" Freq posName: V Next:NULL fileOccur sentenceLength Data Structures Dictionary

POS1: VBD Suffixes: ed, t posNameLength next POS2: VBG Suffixes: ing posNameLength next POS3: VBZ Suffixes: s posNameLength next Data Structures POS & Suffixes • This data holds the canonical suffixes and their parts of speech.

FW1: a wordsList: is are… next FW2: n wordsList: not next FW3: p wordsList: of, on… next Data Structures Context Similarity – funcWords • The functional signs are held in a linked list. The words of each sign are one string each, separated by spaces.

Data Structures Regular expressions • The regular expressions are stored at: char** regexes. • Struct Expr:Used for generating all possible regular expressions from one sentence. Expr expression combinations bases posArrays overflow numWords

<take, took>: stemChange From: ake To: ook lcList next lemmaContext lastOfLemma: ake lastNum: 3 Suffix: "" Count: 1 next lemmaContext lastOfLemma: ke lastNum: 2 Suffix: "" Count: 1 next lemmaContext lastOfLemma: e lastNum: 1 Suffix: "" Count: 1 next <used, use>: stemChange From: e To: "" lcList next lemmaContext lastOfLemma: e lastNum: 1 Suffix: ed Count: 1 next Data Structures Transformation Probabilities • Data structures - example StemChangeLemma stemChange* sc globalCount: 2

Conclusion Demonstration • working :work pos: VBG Stem Change: "" -> "" Suffix: ing • liked :like pos: VBD VBN Stem Change: "" -> d Suffix: "" • used :use pos: VBD VBN Stem Change: "" -> d Suffix: "" • sang :sing pos: VBD Stem Change: ing -> ang Suffix: "" • drank :drink pos: VBD VBZ VBG VBN Stem Change: ink -> ank Suffix: "" • eaten :eat pos: VBN Stem Change: "" -> "" Suffix: en • eats :eat pos: VBZ Stem Change: "" -> "" Suffix: s • using :use pos: VBG Stem Change: e -> "" Suffix: ing • likes :like pos: VBZ Stem Change: "" -> "" Suffix: s • danced :dance pos: VBD VBN Stem Change: "" -> d Suffix: "" • ate :eat pos: VBD Stem Change: eat -> ate Suffix: "" • sings :sing pos: VBZ Stem Change: "" -> "" Suffix: s • songs :use pos: VBZ Stem Change: use -> song Suffix: s • singed :singe pos: VBD Stem Change: "" -> d Suffix: "" • worked :work pos: VBD VBN Stem Change: "" -> "" Suffix: ed • checks :check pos: VBZ Stem Change: "" -> "" Suffix: s • eating :eat pos: VBG Stem Change: "" -> "" Suffix: ing • teaching :teach pos: VBG Stem Change: "" -> "" Suffix: ing • singes :singe pos: VBZ Stem Change: "" -> "" Suffix: s • checking :check pos: VBG Stem Change: "" -> "" Suffix: ing • taught :teach pos: VBD VBN Stem Change: each -> augh Suffix: t • checked :check pos: VBD VBN Stem Change: "" -> "" Suffix: ed • interested :singe pos: VBN Stem Change: singe -> interest Suffix: ed • dancing :dance pos: VBG Stem Change: e -> "" Suffix: ing • teaches :teach pos: VBZ Stem Change: "" -> "" Suffix: es • singing :sing pos: VBG Stem Change: "" -> "" Suffix: ing

Conclusion Algorithm’s Limitations • This model works only on inflectional morphology languages. • In addition it cannot be applied to a new target language without having some a priory knowledge about some of its linguistic properties. • Thus, it cannot be used in cases which the grammar of the target language has not been properly described yet, or when the relevant information is not available for other reasons.

Conclusion Further work • The algorithm can be extended to other morphological relations, other than relations between roots and their inflected forms. • Further related reference:"Unsupervised discovery of morphologically related words based on orthographic and semantic similarity", by M. Baroni, J. Matiasek and H. Trost.

Conclusion Conclusion • In relation to other analyzers, this morphological analyzer: • Achieves high results with hardly no supervision. • performs fewer alignment comparisons. • gets small number of misalignments. • New classes of inductions of the selected language may be discovered by this algorithm. Linguists can be assisted by this feature for exploring languages.

Minimally Supervised Morphological Analysis by Multimodal Alignment