680 likes | 688 Views
ParaMor. Across Mor phology. Finding Para digms. C hristian M onson. Turkish Morphology – Beads on a String. present progressive. 2 nd person singular. take. pass ive. negative. You are not being taken. Turkish Morphology – Beads on a String. götür. ül. m. ü yor. s u n.
E N D
ParaMor Across Morphology Finding Paradigms Christian Monson
Turkish Morphology – Beads on a String present progressive 2nd person singular take passive negative You are not being taken
Turkish Morphology – Beads on a String götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken
Applications of Computational Morphology • Machine Translation • Turkish-English (Oflazer, 2007) • Czech-English (Goldwater and McClosky, 2005) • Speech Recognition • Finnish (Creutz, 2006) • Information Retrieval
Challenges of Computational Morphology • Time Consuming for a New Language • Kemal Oflazer estimates • 3-4 months to build basic Turkish analyzer • Plus lexicon development and maintenance • Expertise Needed • Greenlandic • Official language of Greenland • Agglutinative Inuit language • 50,000 speakers • Per Langaard
The Solution Raw Text Unsupervised Morphology Induction
ParaMor – Paradigm Morphology • ParaMor • Unsupervised morphology induction system • Paradigm • The natural structure of morphology
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor sun götür present progressive 2nd person singular take passive negative
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um present progressive take passive negative 1st person singular
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø present progressive take passive negative 3rd person singular
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø uz present progressive take passive negative 1st person plural
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø uz present progressive take passive negative
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür yecek um Ø uz take passive negative future
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür yecek um Ø uz take passive negative
Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um yecek um Ø uz
Paradigms – The Structure of Morphology Paradigms ül m üyor um yecek um Ø uz
Paradigms – The Structure of Morphology Paradigms • Paradigm • Set of mutually replaceable strings ül m üyor um yecek um Ø uz
Paradigms – The Structure of Morphology Paradigm • Paradigm • Set of mutually replaceable strings ül m üyor um yecek um Ø uz
The ParaMor Algorithm • Identify suffix paradigms in 3 steps
The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms
The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm
The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter
The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter • Segment words • Using the discovered paradigms
Search for Candidate Paradigms • All character boundaries are candidate morpheme boundaries
Search for Candidate Paradigms • Begin search with the most frequent word-final string Spanish autorizaciones buscabamos costas importadoras vallas … s 10662
Search for Candidate Paradigms • Identify the most frequent mutually replaceable string • Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish autorizaciones buscabamos costas importadoras vallas … Ø s 5501 s 10662
Search for Candidate Paradigms • Stop adding suffixes • When the most frequent mutually replaceable string severly decreases the stem count. Ø r s 287 autorizaciones buscabamos costas importadoras vallas … Ø s 5501 s 10662
Search for Candidate Paradigms • Move on to the next most frequent word-final string Ø r s 287 Ø s 5501 a 8981 s 10662
Search for Candidate Paradigms a as o os 892 a o os 1410 Ø r s 287 a o 2304 Ø s 5501 a 8981 s 10662
Search for Candidate Paradigms Ø dadas do dos n ndo r ron 118 a as o os 892 Ø do n r 354 Ø n r 509 a o os 1410 Ø r s 287 Ø n 1874 a o 2304 Ø s 5501 n 6051 a 8981 s 10662
Search for Candidate Paradigms Ø dadas do dos n ndo r ron 118 a as o os 892 Ø do n r 354 Ø n r 509 a o os 1410 Ø r s 287 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 es 2751 n 6051 a 8981 s 10662
Search for Candidate Paradigms a adaadasadoados an ar aronó 149 Ø dadas do dos n ndo r ron 118 a an ar ó 353 a as o os 892 Ø do n r 354 a an ar 413 Ø n r 509 a o os 1410 Ø r s 287 a an 1049 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 an 1786 es 2751 n 6051 a 8981 s 10662
Search for Candidate Paradigms ra rada radas rado rados ran rarraronró 23 a adaadasadoados an ar aronó 149 Ø dadas do dos n ndo r ron 118 strada stradas strado strar stró 7 a an ar ó 353 a as o os 892 strada strado strar stró 8 rada radas rado rados 53 Ø do n r 354 strada strado stró 9 rada rado rados 67 a an ar 413 Ø n r 509 a o os 1410 Ø r s 287 strada strado 12 rada rado 89 a an 1049 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 strado 15 rado 167 an 1786 es 2751 n 6051 a 8981 s 10662 ... ...
Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types
Cluster Candidates per Paradigm 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types
Cluster Candidates per Paradigm 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types
Cluster Candidates per Paradigm 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.715 532 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types
Filter Candidate Paradigms • 2 types of filtering • Remove small unclustered candidate paradigms • Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)
Segment Words Using Paradigms administradas
Segment Words Using Paradigms administradas a ada adas ado ados an ar aron ó ...
Segment Words Using Paradigms administrada administradas a adaadas ado ados an ar aron ó ...
Segment Words Using Paradigms administrada administradas administr +adas a ada adas ado ados an ar aron ó ...
Segment Words Using Paradigms administrada administradas administr +adas aas o os
Segment Words Using Paradigms administrada administradas administr +adas, administrad +as Old way: Separate alternative analysis aas o os
Segment Words Using Paradigms administrada administradas administr +adas, administrad +as New way: Augment the current segmentation administr +ad +as aas o os
Segment Words Using Paradigms administradaØ administradas administr +adas, administrad +as, administrada +s administr +ad +a +s Øs
Morpho Challenge 2007 • Peer operated competition • For unsupervised morphology induction algorithms • 4 languages • English • German • Finnish • Turkish
ParaMor in Morpho Challenge 2007 • Developed on Spanish • ParaMor’s free parameters were frozen
2 Methods of Evaluation • Linguistic Segmentations compared to a morphologically analyzed lexicon
2 Methods of Evaluation • Linguistic Segmentations compared to a morphologically analyzed lexicon