1 / 68

ParaMor

ParaMor. Across Mor phology. Finding Para digms. C hristian M onson. Turkish Morphology – Beads on a String. present progressive. 2 nd person singular. take. pass ive. negative. You are not being taken. Turkish Morphology – Beads on a String. götür. ül. m. ü yor. s u n.

Jimmy
Download Presentation

ParaMor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ParaMor Across Morphology Finding Paradigms Christian Monson

  2. Turkish Morphology – Beads on a String present progressive 2nd person singular take passive negative You are not being taken

  3. Turkish Morphology – Beads on a String götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken

  4. Applications of Computational Morphology • Machine Translation • Turkish-English (Oflazer, 2007) • Czech-English (Goldwater and McClosky, 2005) • Speech Recognition • Finnish (Creutz, 2006) • Information Retrieval

  5. Challenges of Computational Morphology • Time Consuming for a New Language • Kemal Oflazer estimates • 3-4 months to build basic Turkish analyzer • Plus lexicon development and maintenance • Expertise Needed • Greenlandic • Official language of Greenland • Agglutinative Inuit language • 50,000 speakers • Per Langaard

  6. The Solution Raw Text Unsupervised Morphology Induction

  7. ParaMor – Paradigm Morphology • ParaMor • Unsupervised morphology induction system • Paradigm • The natural structure of morphology

  8. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor sun götür present progressive 2nd person singular take passive negative

  9. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um present progressive take passive negative 1st person singular

  10. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø present progressive take passive negative 3rd person singular

  11. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø uz present progressive take passive negative 1st person plural

  12. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür um Ø uz present progressive take passive negative

  13. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür yecek um Ø uz take passive negative future

  14. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um götür yecek um Ø uz take passive negative

  15. Paradigms – The Structure of Morphology Tense & Mood Person & Number Stem Voice Polarity ül m üyor um yecek um Ø uz

  16. Paradigms – The Structure of Morphology Paradigms ül m üyor um yecek um Ø uz

  17. Paradigms – The Structure of Morphology Paradigms • Paradigm • Set of mutually replaceable strings ül m üyor um yecek um Ø uz

  18. Paradigms – The Structure of Morphology Paradigm • Paradigm • Set of mutually replaceable strings ül m üyor um yecek um Ø uz

  19. The ParaMor Algorithm • Identify suffix paradigms in 3 steps

  20. The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms

  21. The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm

  22. The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter

  23. The ParaMor Algorithm • Identify suffix paradigms in 3 steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter • Segment words • Using the discovered paradigms

  24. Search for Candidate Paradigms • All character boundaries are candidate morpheme boundaries

  25. Search for Candidate Paradigms • Begin search with the most frequent word-final string Spanish autorizaciones buscabamos costas importadoras vallas … s 10662

  26. Search for Candidate Paradigms • Identify the most frequent mutually replaceable string • Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish autorizaciones buscabamos costas importadoras vallas … Ø s 5501 s 10662

  27. Search for Candidate Paradigms • Stop adding suffixes • When the most frequent mutually replaceable string severly decreases the stem count. Ø r s 287 autorizaciones buscabamos costas importadoras vallas … Ø s 5501 s 10662

  28. Search for Candidate Paradigms • Move on to the next most frequent word-final string Ø r s 287 Ø s 5501 a 8981 s 10662

  29. Search for Candidate Paradigms a as o os 892 a o os 1410 Ø r s 287 a o 2304 Ø s 5501 a 8981 s 10662

  30. Search for Candidate Paradigms Ø dadas do dos n ndo r ron 118 a as o os 892 Ø do n r 354 Ø n r 509 a o os 1410 Ø r s 287 Ø n 1874 a o 2304 Ø s 5501 n 6051 a 8981 s 10662

  31. Search for Candidate Paradigms Ø dadas do dos n ndo r ron 118 a as o os 892 Ø do n r 354 Ø n r 509 a o os 1410 Ø r s 287 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 es 2751 n 6051 a 8981 s 10662

  32. Search for Candidate Paradigms a adaadasadoados an ar aronó 149 Ø dadas do dos n ndo r ron 118 a an ar ó 353 a as o os 892 Ø do n r 354 a an ar 413 Ø n r 509 a o os 1410 Ø r s 287 a an 1049 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 an 1786 es 2751 n 6051 a 8981 s 10662

  33. Search for Candidate Paradigms ra rada radas rado rados ran rarraronró 23 a adaadasadoados an ar aronó 149 Ø dadas do dos n ndo r ron 118 strada stradas strado strar stró 7 a an ar ó 353 a as o os 892 strada strado strar stró 8 rada radas rado rados 53 Ø do n r 354 strada strado stró 9 rada rado rados 67 a an ar 413 Ø n r 509 a o os 1410 Ø r s 287 strada strado 12 rada rado 89 a an 1049 Ø es 874 Ø n 1874 a o 2304 Ø s 5501 strado 15 rado 167 an 1786 es 2751 n 6051 a 8981 s 10662 ... ...

  34. Cluster Candidates per Paradigm 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types

  35. Cluster Candidates per Paradigm 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types

  36. Cluster Candidates per Paradigm 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types

  37. Cluster Candidates per Paradigm 17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.715 532 Covered Types 16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría ó Cosine Similarity: 0.664 451 Covered Types 15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó 25 Stems: anunci, aplic, apoy, celebr, consider, … 375 Covered Types 15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó 23 Stems: anunci, apoy, confirm, consider, declar, … 345 Covered Types 15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó 22 Stems: anunci, aplic, apoy, celebr, concentr, … 330 Covered Types

  38. Filter Candidate Paradigms • 2 types of filtering • Remove small unclustered candidate paradigms • Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)

  39. Segment Words Using Paradigms administradas

  40. Segment Words Using Paradigms administradas a ada adas ado ados an ar aron ó ...

  41. Segment Words Using Paradigms administrada administradas a adaadas ado ados an ar aron ó ...

  42. Segment Words Using Paradigms administrada administradas administr +adas a ada adas ado ados an ar aron ó ...

  43. Segment Words Using Paradigms administrada administradas administr +adas aas o os

  44. Segment Words Using Paradigms administrada administradas administr +adas, administrad +as Old way: Separate alternative analysis aas o os

  45. Segment Words Using Paradigms administrada administradas administr +adas, administrad +as New way: Augment the current segmentation administr +ad +as aas o os

  46. Segment Words Using Paradigms administradaØ administradas administr +adas, administrad +as, administrada +s administr +ad +a +s Øs

  47. Morpho Challenge 2007 • Peer operated competition • For unsupervised morphology induction algorithms • 4 languages • English • German • Finnish • Turkish

  48. ParaMor in Morpho Challenge 2007 • Developed on Spanish • ParaMor’s free parameters were frozen

  49. 2 Methods of Evaluation • Linguistic Segmentations compared to a morphologically analyzed lexicon

  50. 2 Methods of Evaluation • Linguistic Segmentations compared to a morphologically analyzed lexicon

More Related