Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis Stefan Bordag University of Leipzig • Components • Detailing • Compound splitting • Iterated LSV • Split trie taining • Morpheme Analysis • Results • Discussion

1. Components • The main components of the current LSV-based segmentation algorithm • Compound splitter (new) • LSV component (new: iterated) • Trie classificator (new: split in two phases) • Morpheme analysis (entirely new) is based on • Morpheme segmentation (see above) • Clustering of morphs to morphemes • Contextual similarity of morphemes • Main focus on modularity so that each module has a specific function that could be replaced by a better algorithm by someone else

2.1. Compound Splitter • Based on the observation that for LSV especially long words pose a problem • Simple heuristic: whenever a word is decomposable into several words which have • minimum length of 4 • minimum frequency of 10 (or some other arbitrary figures) results in many missed, but at least some correct divisions (Precision at this point being more important than Recall) • P=88% R=10% F=18% • Decompositions which have more words with higher frequencies win in cases where several decompositions possible

root ly clear late late ear ¤ ¤ ¤ cl ¤ ¤ 2.2. Original solution in two parts clear-ly lately early … compute LSV s = LSV * freq * multiletter * bigram The talk 1 Talk was 1 … Talk speech 20 Was is 15 … The talk wasvery informative similar words co-occurrences sentences train classifier clear-ly late-ly early … apply classifier

2.3. Original Letter successor variety • Letter successor variety: Harris (55) where word-splitting occurs if the number of distinct letters that follows a given sequence of characters surpasses the threshold. • Input 150 contextually most similar words • Observing how many different letters occur after a part of the string: • #cle- only 1 letter • -ly# but reversed before –ly# 16 different letters (16 different stems preceding the suffix –ly#) # c l e a r l y # 28 5 3 1 1 1 1 1 f. left (thus after #cl 5 various letters) 1 1 2 1 3 16 10 14 f. right (thus before -y# 10 var. letters)

2.4. Balancing factors • LSV score for each possible boundary is not normalized and needs to be weighted against several factors that otherwise add noise: • freq: Frequency differences between beginning and middle of word • multiletter: Representation of single phonemes with several letters • bigram: Certain fixed combinations of letters • Final score s for each possible boundary is then: s = LSV * freq * multiletter * bigram

2.5. Iterated LSV • The Iteration of LSV based previously found information • For example when computing • ignited with the most similar words already analysed into: • caus-ed, struck, injur-ed, blazed, fire, … • Then there is more evidence for ignit-ed because most words ending with -ed were found to have -ed as a morpheme • Implementation in the form of a weight iterLSV iterLSV = #wordsEndingIsMorph / #wordsSameEnding • hence: s = LSV * freq * multiletter * bigram * iterLSV

2.6. Pat. Comp. Trie as Classificator root ly clear late root late ear ¤ ¤ ly ly=2 clear ¤=1 late ¤=1 ¤ cl ¤ late ly=1 ear ly=1 ¤ ¤ ¤=1 ¤=1 ¤ ¤ ly=1 cl ly=1 ¤ ¤=1 Apply deepest found node retrieve known information ¤ ly=1 Amazing?ly add known information dear?ly clear-ly, late-ly, early, Clear, late amazing-ly dearly

2.7. Splitting trie application • The trie classificator could decide for ignit-ed based on top-node in trie from back • –d with classes –ed:50;-d:10;-ted:5;… • Hence not taking any context in the word into account • New version save_trie (aus opposed to rec_trie) trains one trie from LSV data and decides only if • at least one more letter additionally to the letters in the proposed morpheme matches in the word • save_trie andrec_trie are thentrained andapplied conecutively ed ed=2 save_trie => ignited r ed=1 s ed=1 rec_trie => ignit-ed caus-ed injur-ed

2.8. Effect of the improvements • compounds • P=88% R=10% F=18% • compounds + recTrie • P=66% R=28% F=39% • compounds + lsv_0 + recTrie • P=71% R=58% F=64% • compounds + lsv_2 + recTrie • P=69% R=63% F=66% • compounds + lsv_2 + saveTrie + recTrie • P=69% R=66% F=67% • Most notably these changes reach the same performance level as the original lsv_0 + recTrie (F=70) on a corpus three times smaller • However, applying on three times bigger corpus only increases number of split words, not quality of those split!

3. Morpheme Analysis • Assumes visible morphs (i.e. output of a segmentation algorithm) • This enables to compute co-occurrence of morphs • Which enables computing contextual similarity of morps • which enables clustering morphs to morphemes • Traditional representation of morphemes • barefooted BARE FOOT +PAST • flying FLY_V +PCP1 • footprints FOOT PRINT +PL • For processing equivalent representation of morphemes • barefooted bare 5foot.6foot.foot ed • flying fly inag.ing.ingu.iong • footprints 5foot.6foot.foot prints

3.1. Computing alternation for each morph m for each cont. similar morph s of m if LD_Similar(s,m) r = makeRule(s,m) store(r->s,m) for each word w for each morph m of w if in_store(m) sig = createSignature(m) write sig else write m m=foot s={feet,5foot,…} LD(foot,5foot)=1 _-5 -> foot,5foot barefooted {bare,foot,ed} foot has _-5 and _-6sig: foot.5foot.6foot

3.2. Real examples Rules: • m-s : 49.0 barem,bares blum,blus erem,eres estem,estes etem,etes eurem,eures ifm,ifs igem,iges ihrem,ihres jedem,jedes lme,lse losem,loses mache,sache mai,sai • _-u : 46.0 bahn,ubahn bdi,bdiu boot,uboot bootes,ubootes cor,coru dejan,dejuan dem,demu dem,deum die,dieu em,eum en,eun en,uen erin,eurin • m-r : 44.0 barem,barer dem,der demselb,derselb einem,einer ertem,erter estem,ester eurem,eurer igem,iger ihm,ihr ihme,ihre ihrem,ihrer jedem,jeder Signatures: • muessen muess.muesst.muss en • ihrer ihre.ihrem.ihren.ihrer.ihres • werde werd.wird.wuerd e • Ihren ihre.ihrem.ihren.ihrer.ihres.ihrn

3.3. More examples kabinettsaufteilung kabinet.kabinett.kabinetts aauf.aeuf.auf.aufs.dauf.hauf tail.teil.teile.teils.teilt bung.dung.kung.rung.tung.ung.ungs entwaffnungsbericht enkt.ent.entf.entp waff.waffn.waffne.waffnet lungs.rungs.tungs.ung.ungn.ungs berich.bericht grundstuecksverwaltung gruend.grund stuecks nver.sver.veer.ver walt bung.dung.kung.rung.tung.ung.ungs grundt gruend.grund t

4. Results (competition 1) • GERMAN AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 63.20% 37.69% 47.22% Bernhard 2 49.08% 57.35% 52.89% Bordag 5 60.71% 40.58% 48.64% Bordag 5a 60.45% 41.57% 49.27% McNamee 3 45.78% 9.28% 15.43% Zeman - 52.79% 28.46% 36.98% Monson&co Morfessor 67.16% 36.83% 47.57% Monson&co ParaMor 59.05% 32.81% 42.19% Monson&co Paramor&Morfessor 51.45% 55.55% 53.42% Morfessor MAP 67.56% 36.92% 47.75% • ENGLISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 72.05% 52.47% 60.72% Bernhard 2 61.63% 60.01% 60.81% Bordag 5 59.80% 31.50% 41.27% Bordag 5a 59.69% 32.12% 41.77% McNamee 3 43.47% 17.55% 25.01% Zeman - 52.98% 42.07% 46.90% Monson&co Morfessor 77.22% 33.95% 47.16% Monson&co ParaMor 48.46% 52.95% 50.61% Monson&co Paramor&Morfessor 41.58% 65.08% 50.74% Morfessor MAP 82.17% 33.08% 47.17%

4.1. Results (competition 1) • TURKISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 78.22% 10.93% 19.18% Bernhard 2 73.69% 14.80% 24.65% Bordag 5 81.44% 17.45% 28.75% Bordag 5a 81.31% 17.58% 28.91% McNamee 3 65.00% 10.83% 18.57% McNamee 4 85.49% 6.59% 12.24% McNamee 5 94.80% 3.31% 6.39% Zeman - 65.81% 18.79% 29.23% Morfessor MAP 76.36% 24.50% 37.10% • FINNISH AUTHOR METHOD PRECISION RECALL F-MEASURE Bernhard 1 75.99% 25.01% 37.63% Bernhard 2 59.65% 40.44% 48.20% Bordag 5 71.72% 23.61% 35.52% Bordag 5a 71.32% 24.40% 36.36% McNamee 3 45.53% 8.56% 14.41% McNamee 4 68.09% 5.68% 10.49% McNamee 5 86.69% 3.35% 6.45% Zeman - 58.84% 20.92% 30.87% Morfessor MAP 76.83% 27.54% 40.55%

5.1. Problems of Morpheme Analysis • Surprise #1: nearly no effect on evaluation results! Possible reasons: • rules: not taking type frequency into account (hence overvaluing errors) • rules: not taking context into account (instead of _-5 better _5f-_fo) • segmentation: produces many errors, analysis has to put up with a lot of noise

5.2. Problems of Segmentation • Surprise #2: Size of corpus has no large influence on quality of segmentations • it influences only how many nearly perfect segmentation are found by LSV • but that is by far outweighted by the errors of the trie • Strength of LSV is to segment irregular words properly • because they have high frequency and are usually short • Strength of most other proposed methods with segmenting long and infrequent words • Combination evidently desireable

5.3. Further avenues? • Most notable problem currently is assumption of clustering of phonemes that represent a morph / morpheme, that is AAA + BBB usually becomes AAABBB, not ABABAB • For languages that merge morphemes this is inappropriate • Better solution perhaps similar to U-DOP by Rens Bod? • that means generating all possible parsing trees for each token • then collating them for the type and generating possible optimal parses • possibly generating tries not just for type, but also for some context, for example relevant context highlighted: Yesterday we arrived by plane.

THANK YOU!

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis