960 likes | 969 Views
Explore ParaMor, a morphological parsing algorithm for unsupervised learning in computational linguistics. Learn computational techniques, language structures, and tools to enhance morphological analysis and improve machine translation.
E N D
ParaMor& Morpho Challenge 2008 Christian Monson Jaime Carbonell, Alon Lavie, Lori Levin
Turkish Morphology – Beads on a String One Turkish Word götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken
Computational Morphology Improves: Machine Translation Turkish-English (Oflazer, 2007) Czech-English (Goldwater and McClosky, 2005) Information Retrieval English, German, Finnish (Kurimo et al., 2008) Speech Recognition Finnish (Creutz, 2006) Grapheme-to-Phoneme Conversion German (Demberg, 2007)
Morphology is Complex – Operations Prefixation Suffixation
Morphology is Complex – Operations Prefixation Suffixation Reduplication
Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation
Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation
Morphology is Complex – Operations Prefixation Suffixation Reduplication Infixation
Morphology is Complex – Morphophonology götür ül m üyor sun present progressive 2nd person singular take passive negative You are not being taken
Morphology is Complex – Morphophonology götür ül m sun yecek 2nd person singular take passive negative future You will not be taken
Morphology is Complex – Morphophonology götür ül m sun yecek 2nd person singular take passive negative future You will not be taken
Morphology is Complex – Morphophonology götür ül me sun yecek 2nd person singular take passive negative future You will not be taken
Morphology is Complex – Morphophonology götür ül me sin yecek 2nd person singular take passive negative future You will not be taken
Morphology is Complex – Morphophonology götür ül me sin yecek 2nd person singular take passive negative future You will not be taken
Morphology is Complex – Ambiguity Hungarian mentek men +tek go +Present.2nd.Plural ‘yinz go’
Morphology is Complex – Ambiguity Hungarian mentek men +tek go +Present.2nd.Plural ‘yinz go’ men +t +ek go +PastParticiple +Plural ‘those who have gone’
In Morphology Systems for New Languages Complexity Time + Expertise
In Morphology Systems for New Languages Complexity Time + Expertise Kemal Oflazer Expert on Turkish Computational morphology Time 3 - 4 Months to manually build a basic Turkish analyzer Plus lexicon development and maintenance
The Solution Raw Text Unsupervised Morphology Induction
The Solution Raw Text ?
The Solution Raw Text Language Structure
Techniques for Unsupervised Morphology Induction Transition Likelihood Harris (1955) – Finite State Automata Bernhard (2007)
Techniques for Unsupervised Morphology Induction Transition Likelihood Harris (1955) – Finite State Automata Bernhard (2007) Minimum Description Length Goldsmith (2001, 2006) Creutz’s Morfessor (2006)
Techniques for Unsupervised Morphology Induction Contextual Similarity Wicentowski (2002) Schone (2002)
Techniques for Unsupervised Morphology Induction Contextual Similarity Wicentowski (2002) Schone (2002) The Paradigm Snover (2002) ParaMor (2007)
What is a Paradigm? ül m üyor sun götür present progressive 2nd person singular take passive negative
Paradigms Structure Inflectional Morphology Person & Number ül m üyor sun götür present progressive 2nd person singular take passive negative
Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um present progressive take passive negative 1st person singular
Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um Ø present progressive take passive negative 3rd person singular
Paradigms Structure Inflectional Morphology Person & Number ül m üyor um götür um Ø uz present progressive take passive negative
Paradigms Structure Inflectional Morphology Paradigm Paradigm Mutually substitutable morphological operations ül m üyor um götür um Ø uz present progressive take passive negative
Paradigms Structure Inflectional Morphology yecek Tense & Aspect Person & Number Voice Polarity üyor ül m um um Ø uz
Paradigms Structure Inflectional Morphology yecek Paradigms Paradigm Mutually substitutable morphological operations üyor ül m um um Ø uz
The ParaMor Algorithm yecek Paradigm Paradigm Mutually substitutable strings üyor ül m um um Ø uz
The ParaMor Algorithm yecek Candidate Stems Paradigm üyor ül m um um Ø uz 1 Morpheme Boundary
The ParaMor Algorithm Simplifying Assumptions Suffixes only 70% of the World’s Languages are Suffixing (Dryer, 2005) Strict Concatenation
The ParaMor Algorithm Simplifying Assumptions Suffixes only 70% of the World’s Languages are Suffixing (Dryer, 2005) Strict Concatenation Only a High-Level Overview
The ParaMor Algorithm Identify Paradigms in 3 Steps
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter least likely candidates
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter least likely candidates Segment Words Using the discovered paradigms
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter Segment Words Using the discovered paradigms Today
The ParaMor Algorithm Identify Paradigms in 3 Steps • Search for candidate paradigms • Cluster candidates modeling the same paradigm • Filter Segment Words Using the discovered paradigms
Search for Candidate Paradigms Propose a morpheme boundary at every character boundary in every word Consolidate identical candidate suffixes into paradigm seeds Spanish Example autorizaciones buscabamos costas importadoras vallas … Word List 50,000 Types s 10697
Search for Candidate Paradigms Identify the most frequent mutually replaceable candidate suffix Stems that occur with one suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish Example autorizaciones buscabamos costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697
Search for Candidate Paradigms A Parameter halts the introduction of suffixes When the most frequent mutually replaceable candidate suffix severely decreases the stem count Ø r s 281 autorizaciones buscabamos costar costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697
Search for Candidate Paradigms Parameters set to produce High-recall Spanish paradigms And then frozen Ø r s 281 autorizaciones buscabamos costar costaØ costas importadoraØ importadoras vallaØ vallas … Ø s 5513 s 10697
Search for Candidate Paradigms Move on to the next most frequent paradigm seed Ø r s 281 Ø s 5513 a 9020 s 10697
Search for Candidate Paradigms a as o os 899 a o os 1418 Ø r s 281 a o 2325 Ø s 5513 a 9020 s 10697