430 likes | 501 Views
Finite-State Methods in Natural Language Processing. Lauri Karttunen LSA 2005 Summer Institute August 3, 2005. August 1 Non-concatenative morphotactics Reduplication, interdigitation Realizational morphology Readings Chapter 8. “Non-Concatenative Morphotactics”
E N D
Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute August 3, 2005
August 1 • Non-concatenative morphotactics • Reduplication, interdigitation • Realizational morphology • Readings • Chapter 8. “Non-Concatenative Morphotactics” • Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) • Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. • August 3 • Optimality theory • Readings • Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Background • Two old strains of finite-state (morpho)phonology • rewrite rules (Chomsky&Halle 1968) • two-level constraints (Koskenniemi 1983) • Optimality theory (Prince & Smolensky 1993) • two-level model with ranked, violable constraints • Formal Power • OT is not a finite-state system if it involves unlimited counting of constraint violations. (Ellison 1994, Eisner 1997, Frank&Satta 1998) • But a finite-state model can be useful for OT.
Optimality theory • Prince & Smolensky 1993 • eliminate • rules • derivations • introduce • violable ranked constraints • Instant success!
Brief Introduction to OT • Input • A language of underlying lexical forms. • GEN • A function that generates alternate surface realizations for each input form, possibly an infinite set. • Constraints • A finite set of principles, preferrably universal, that filter out unwanted realizations. • Ranking • A language-specific ordering of the constraints.
Computational perspective • Ellison 1994 • OT deals with regular sets and relations: a finite-state system • constraint transducers mark violations, marks sorted and counted • Tesar 1995 • dynamic algorithm for optimal path computations • Eisner 1996 • two-level typology of optimality constraints: restrict, prohibit • “FootForm Decomposed” MIT Working Papers in Linguistics, 31:115-143 proposes Primitive Optimality Theory (no generalized alignment) • Karttunen 1998 • Introduces lenient composition • Frank & Satta 1998 • Prove that OT is regular if # of violations is bounded.
Application Merging rewrite rules composition composition intersecting composition two-level constraints intersection lenient composition lenient composition optimality constraints Comparisons
Finnish OT Prosody Lauri Karttunen CLS-41 April 7, 2005
Finnish Prosody: basic facts • The nucleus of a Finnish syllable must consist of a short vowel, a long vowel, or a diphthong. • Main stress is always on the first syllable, secondary stress occurs on non-initial syllables. • Adjacent syllables are never stressed. • Stressed syllable is initial in the foot. • ilmoittautuminen ‘registering’ (Nom Sg) • (íl.moit).(tàu.tu).(mì.nen)
Ternary feet in Finnish • Stress that would fall on a light syllable shifts on the following heavy syllable creating a ternary foot. • (ká.las).te.(lèm.me) ‘we are fishing’ • (íl.moit).(tàu.tu).mi.(sès.ta) ‘registering’ (Ela Sg) • (rá.kas).ta.(jàt.ta).ri.(àn.sa) ‘his mistresses’ (Par Pl) • Can we get these facts to come out “for free”, from the interaction of independently motivated principles? • Yes! • Paul Kiparsky “Finnish Noun Inflection” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
Non-OT and OT solutions • It is possible to define a cascade of replace rules that produce the desired result. • http://www.stanford.edu/~laurik/fsmbook/examples/FinnishProsody.html • But, following Kiparsky, we are going to do OT today, and in a more elegant way than is shown at • http://www.stanford.edu/~laurik/fsmbook/examples/FinnishOTProsody.html
Prelude: Built-in Functions in fst • Case conversion • UpCase( OptUpCase( • DownCase( OptDownCase( • Cap( OptCap( • AnyCase( • Cap({hello}) is equivalent to {Hello} • OptUpCase(a:b, L) is equivalent to [a:B | a:b] ; • Symbol manipulation • Explode( Implode( • regex Explode("+Test") is equivalent to regex {+Test};
Functions: User-defined • The function definition is attached to a symbol ending with ( • The definition is any regular expression. • There may be any number of arguments. • define Redup(X) [X X]; • define Apply(X, Y) [X .o. Y].l ; • When the function is used in a regular expression, the arguments are bound and the function is evaluated. • regex Apply({abc}, a -> x || _ b); • print words • xbc • The definition of a function may contain other functions.
Pig Latin • # This script creates a function for translating from English to Pig Latin: • # pig -> igpay, brown -> ownbray, script -> iptscray define C [b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|y|z]; define V [a|e|i|o|u] ; define Redup(X) [X "." X]; define DelCons(X) [X .o. C+ @-> 0 || .#. _ ]; define TailToAy(X) [X .o. V ?* @-> {ay} || "." C* _ ]; define DelMiddle(X) [X .o. "." -> 0]; define Pig(X) [DelMiddle(TailToAy(DelCons(Redup(X))))];
Demo! • fst -l piglatin.script
Input language Compose the input language with GEN to produce a mapping from each input form to all of its output candidates .o. GEN Eliminate suboptimal candidates by applying constraints in the ranked order. At least one output candidate always survives. Constraint 1 Constraint 2 Computing with OT By what finite-state operation?
a b x y b c z w R = { , } Q = { , } a b c x y w R .P. Q = { , , } Priority union .P. All pairs from R and those pairs from Q that do not conflict with the mapping established by R. R .P. Q = [ R | [~[R.u] .o. Q] Kaplan 1987
Lenient Composition .O. • Let R be a relation that maps each input string to one or more outputs. • Let C be a constraint that eliminates some outputs. • R .O. Cis the relation that maps each input string that can meet the constraintC to the outputs that meet C and leaves the rest of the relation R unchanged. (Karttunen 1998) • R .O. C = [ [R .o. C] .P. R ] • Is constraint ranking rule ordering in disguise? Yes.
ka.la ka.lá ka.là ka.(là) ka.(lá) ká.la ká.lá ká.là ká.(là) ká.(lá) kà.la (kà.la) (ká).la (ká).lá (ká).là (ká).(là) (ká).(lá) (ká.là) (ká.lá) (ká.la) ☜ (ka.là) (ka.lá) Need a prolific GEN kà.lá kà.là kà.(là) kà.(lá) (kà).la (kà).lá (kà).là (kà).(là) (kà).(lá) (kà.là) (kà.lá) kala ‘fish’ (Nom Sg) 33 candidates
Basic definitions 1 • Using Parc/XRCE regular expression syntax: • define C [b | c | d | f | g | h | j | k | l | m | • n | p | q | r | s | t | v | w | x | z]; # Consonant • define HighV [u | y | i]; # High vowel • define MidV [e | o | ö]; # Mid vowel • define LowV [a | ä] ; # Low vowel • define USV [HighV | MidV | LowV]; # Unstressed Vowel • define MSV [á | é | í | ó | ú | ý | ä’ | ö’]; • define SSV [à | è | ì | ò | ù | y` | ä` | ö`]; • define SV [MSV | SSV]; # Stressed vowel • define V [USV | SV] ; # Vowel
Basic definitions 2 • define P [V | C]; # Phone • define B [[\P+] | .#.]; # Boundary • define E .#. | "."; # Edge • define Light [C* V]; # Light syllable • define Heavy [Light P+]; # Heavy syllable • define S [Heavy | Light]; # Syllable • define SS [S & $SV]; # Stressed syllable • define US [S & ~$SV]; # Unstressed syllable • define MSS [S & $MSV] ; # Syllable with main stress
GEN 1 • define MarkNonDiphthongs [ • [. .] -> "." || [HighV | MidV] _ LowV, # i.a, e.a • LowV _ MidV, #a.e • i _ [MidV - e], # i.o, i.ö • u _ [MidV - o], # u.e • y _ [MidV - ö], # y.e • $V i _ e, # poiki.en • $V u _ o, # • $V y _ ö ]; # • Insert a syllable boundary between vowels that cannot form • a diphtong: i.a, e.a, a.e, i.o, u.e, y.e, etc. • define Syllabify C* V+ C* @-> ... "." || _ C V ; • Insert a syllable boundary after a maximal C* V+ C* pattern that is followed by C V. For example, strukturalismi -> struk.tu.ra.lis.mi.
GEN 2 • define Stress a (->) á|à, e (->) é|è, i (->) í|ì, • o (->) ó|ò, u (->) ú|ù, y (->) "y´"|"y`", • ä (->) "ä´"|"ä`", ö (->) "ö´"|"ö`"; • Optionally stress any vowel with a primary or secondary stress. • define Scan [[S ("." S ("." S)) & $SS] (->) "(" ... ")" || E _ E] ; • Optionally group syllables into unary, binary, or ternary feet when there is at least one stressed syllable. • define Gen [MarkNonDiphthongs .o. Syllabify .o. • Stress .o. Scan];
Demo! • fst -utf8 -l gen.script • regex {kala} .o. Gen (compose) • print lower-words (show output candidates) • print size (count them)
Kiparsky's nine constraints • Clash • AlignLeft • MainStress • FootBin • Lapse • NonFinal • StressToWeight • Parse • AllFeetFirst
Counting constraint violations • We use asterisks to mark constraint violations. We need a way to prefer candidates with the least number of violation marks. • define Viol ${*}; • define Viol0 ~Viol; # No violations • define Viol1 ~[Viol^2]; # At most one violation • define Viol2 ~[Viol^3]; # At most two violations • define Viol3 ~[Viol^4]; • This eliminates the violation marks after the candidate set has been pruned by a constraint. • define Pardon {*} -> 0;
Defining OT Constraints • Three types: • Unviolable constraints • Primary stress in Finnish • Ordinary violable constraints • Lapse • Gradient alignment constraints • All-Feet-First • Strategy: • We define an evaluation template for each of the three types and then define the individual constraints with the help of the templates.
Evaluation Template for Unviolable Constraints • define Unviolable(Candidates, Constraint) [ • Candidates • .o. • Constraint ]; • Example: • define MainStress(X) Unviolable(X, B MSS ~$MSS); • # B is the left edge of the word or "(". • # MSS is a syllable with a primary stress.
Evaluation Template for Ordinary Constraints • define Eval(Candidates, Violation, Left, Right) [ • Candidates • .o. • Violation -> ... {*} || Left _ Right • .O. • Viol3 .O. Viol2 .O. Viol1 .O. Viol0 • .o. • Pardon ]; • where Viol0 is ~${*}, Viol2 is ~[[${*}]^2], etc. and • Pardon is {*} -> 0 deleting all violation marks.
Evaluation Template for Left-Oriented Gradient Alignment • define EvalGradientLeft(Candidates, Violation, Left, Right) [ • Candidates .o. • Violation -> {*} ... || .#. Left _ Right • .o. • Violation -> {*}^2 ... || .#. Left^2 _ Right • .o. • Violation -> {*}^3... || .#. Left^3 _ Right • .o. • Violation -> {*}^4 ... || .#. Left^4 _ Right • .o. • Violation -> {*}^5 ... || .#. Left^5 _ Right • .o. • Violation -> {*}^6 ... || .#. Left^6 _ Right • .o. • Violation -> {*}^7 ... || .#. Left^7 _ Right • .o. • Violation -> {*}^8 ... || .#. Left^8 _ Right • .O. • Viol12 .O. Viol11 .O. Viol10 .O. Viol9 .O. Viol8 .O. Viol7 .O. • Viol6 .O. Viol5 .O. Viol4 .O. Viol3 .O. Viol2 .O. Viol1 .O. • Viol0 .o. Pardon ];
Clash, AlignLeft, MainStress • Clash • No stress on adjacent syllables. • define Clash(X) Eval(X, SS, SS B, ?*); • Align-Left • The stressed syllable is initial in the foot. • define AlignLeft(X) Eval(X, SV, .#. ~[?* "(" C*], ?*); • Main Stress • The primary stress in Finnish is on the first syllable. • define MainStress(X) Unviolable(X, B MSS ~$MSS);
FootBin, Lapse, NonFinal • Foot-Bin • Feet are minimally bimoraic and maximally bisyllabic. • define FootBin(X) Eval(X, "(” Light ") "|” ("S["." S]^>1, • ?* ,?*); • Lapse • Every unstressed syllable must be adjacent to a stressed syllable or to the word edge. • define Lapse(X) Eval(X, US, [B US B], [B US B]); • Non-Final • The final syllable is not stressed. • define NonFinal(X) Eval(X, SS, ?*, ~$S .#.);
StressToWeight, Parse, AllFeetFirst • Stress-To-Weight • Stressed syllables are heavy. • define StressToWeight(X) Eval(X, SS & Light, ?*, ")"| E); • License-s • Syllables are parsed into feet. • define Parse(X) Eval(X, S, E, E); • All-Ft-Left • The left edge of every foot coincides with the left edge of some prosodic word. • define AllFeetFirst(X) [ • EvalGradientLeft(X, "(", $".", ?*) ];
Finnish Prosody • Kiparsky 2003: • define FinnishProsody(Input) [ • AllFeetFirst( Parse( StressToWeight( • NonFinal( Lapse( FootBin( MainStress( • AlignLeft( Clash( Input .o. Gen)))))))))];
FinnWords • regex FinnishProsody( {kalastelet} | {kalasteleminen} | • {ilmoittautuminen} | {järjestelmättömyydestänsä} | • {kalastelemme} | {ilmoittautumisesta} | • {järjestelmällisyydelläni} | {järjestelmällistämätöntä} | • {voimisteluttelemasta} | {opiskelija} | {opettamassa} | • {kalastelet} | {strukturalismi} | {onnittelemanikin} | • {mäki} | {perijä} | {repeämä} | {ergonomia} | • {puhelimellani} | {matematiikka} | {puhelimistani} | • {rakastajattariansa} | {kuningas} | {kainostelijat} | • {ravintolat} | {merkonomin} ) ; • Demo!
(ér.go).(nò.mi).a (íl.moit).(tàu.tu).mi.(sès.ta) (íl.moit).(tàu.tu).(mì.nen) (ón.nit).(tè.le).(mà.ni).kin (ó.pis).(kè.li).ja (ó.pet).ta.(màs.sa) (vói.mis).te.(lùt.te).le.(màs.ta) (strúk.tu).ra.(lìs.mi) (rá.vin).(tò.lat) (rá.kas).ta.(jàt.ta).ri.(àn.sa) (ré.pe).(ä`.mä) (pé.ri).jä (pú.he).li.(mèl.la).ni (pú.he).li.(mìs.ta).ni (mä’.ki) (má.te).ma.(tìik.ka) (mér.ko).(nò.min) (kái.nos).(tè.li).jat (ká.las).te.(lèm.me) (ká.las).te.(lè.mi).nen (ká.las).(tè.let) (kú.nin).gas (jä’r.jes).tel.(mä`l.li).syy.(dèl.lä).ni (jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä) (jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä) Result
Two Errors • (ká.las).te.(lè.mi).nen • (jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni • The interaction of Lapse and StressToWeight does not produce the desired result in these cases.
What is wrong? • define Debug(Input) [ • DebugStressToWeight( • NonFinal( Lapse( FootBin( MainStress( AlignLeft( • Clash( Input .o. Gen))))))) ]; • regex Debug({kalasteleminen}); • (ká*.las).te.(lè*.mi).nen <-- actual winner • (ká*.las).(tè*.le).(mì*.nen) <-- desired output • (jä´r.jes).tel.(mä`l.li).syy.(dèl.lä).ni <-- actual winner • (jä’r.jes).(tèl.mäl).li.(sy`y.del).(lä`*.ni) <-- desired output • The StressToWeight constraint eliminates some of the desired winning candidates.
Nine Elenbaas • A unified account of binary and ternary stress. Ph.D. dissertation. University of Utrecht. 1999. Based on Kiparsky&Hanson 1996. The only difference is that Elenbaas has a special constraint *(L’ H) or AntiLStressH( in place of Kiparsky’s more general StressToWeight constraint. • define FinnishProsody(Input) [ • AllFeetFirst( Parse( AntiLStressH( • NonFinal( Lapse( AlignLeft( FootBin( • MainStress( Clash( Input .o. Gen))))))))) ]; • define AntiLStressH(X) Eval(X, SS & Light, "(" , "." Heavy);
(ér.go).(nò.mi).a (íl.moit).(tàu.tu).mi.(sès.ta) (íl.moit).(tàu.tu).(mì.nen) (ón.nit).(tè.le).(mà.ni).kin (ó.pis).(kè.li).ja (ó.pet).ta.(màs.sa) (vói.mis).te.(lùt.te).le.(màs.ta) (strúk.tu).ra.(lìs.mi) (rá.vin).(tò.lat) (rá.kas).ta.(jàt.ta).ri.(àn.sa) (ré.pe).(ä`.mä) (pé.ri).jä (pú.he).li.(mèl.la).ni (pú.he).li.(mìs.ta).ni (mä’.ki) (má.te).ma.(tìik.ka) (mér.ko).(nò.min) (kái.nos).(tè.li).jat (ká.las).te.(lèm.me) (ká.las).te.(lè.mi).nen (ká.las).(tè.let) (kú.nin).gas (jä’r.jes).(tèl.mäl).li.(syy’.del).(lä’.ni) (jä’r.jes).(tèl.mät).tö.(my`y.des).(tä`n.sä) (jä’r.jes).(tèl.mäl).(lìs.tä).mä.(tö`n.tä) Result
Did She Know? Six syllables (Appendix of Elenbaas thesis) X X L L L L áterìanàni áteriànani 'meal (Ess 1SG)' érgonòmiàna 'ergonomics (Ess)' káinostèlijàna 'shy person (Ess)' káinostèlijàni 'shy person (Nom 1SG)' kúnnallìsenàni 'council (Ess 1SG)' kúnnallìsiàni ’ councils (Part 1SG)' kúnnallìsinàni 'councils (Ess 1SG)' mérkonòmiàni 'degree in economics (Part 1SG)' mérkonòminàni 'degree in economics (Ess 1SG)' ópiskèlijàni 'student (Nom 1SG)' púhelìmenàni 'telephone (Ess 1SG)' púhelìmiàni ’telephone (Part 1SG)’ Missing pattern: X X L L L H
Conclusion • Can we get ternary feet in Finnish “for free”, from the interaction of independently motivated principles? • We don’t know. • We know that the Kiparsky and Elenbaas accounts fail. • Optimality Prosody is computationally very difficult. • The number of initial candidates is huge: • kalasteleminen 70653 • järjestelmällisyydelläni 21767579 • Simple tableau methods do not work. • Finite-state implementation guards against errors made by a human GEN and EVAL. • But even when an error can be pinpointed, the fix is not obvious. • Debugging OT constraints is as hard as debugging two-level rules, in practice more difficult than rewrite systems.
Final Thoughts • Morphology is a regular relation. • The composition of words (morphosyntax), morphological alternations, and prosody can be described in finite-state terms. • A complex relation can be decomposed in different ways. • There are many flavors of finite-state morphology: Item-and-Arrangement, Rewrite rules, Two-level rules, Realizational Morphology, Classical optimality constraints. • Computing with finite-state tools is fun and easy. • We have sophisticated formalism for describing regular relations, efficient compilers and runtime software. • ‘Pen-and-pencil’ morphology badly needs computational support. • It is difficult to get globally correct results relying on a handful of interesting words, rules, and constraints.