Finite-State Methods in Natural Language Processing

Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 25, 2005

Course Outline • July 18: • Intro to computational morphology • XFST • Readings • Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. • Karttunen and Beesley, “25 Years of Finite-State Morphology” • Chapter 1: “Gentle Introduction” (B&K) • July 20: • Regular expressions • More on XFST • Readings • Chapter 2: “Systematic Introduction” • Chapter 3: “The XFST interface”

July 25 • More on XFST: Date Parser • Concatenative morphotactics: The LEXC language • Readings • Chapter 4. “The LEXC Language” • July 27 • Constraining non-local dependencies: Flag Diacritics • Non-concatenative morphotactics • Reduplication, interdigitation • Readings • Chapter 5. “Flag Diacritics” • Chapter 8. “Non-Concatenative Morphotactics”

August 1 • Realizational morphology • Readings • Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) • Lauri Karttunen, “Computing with Realizational Morphology”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. • August 3 • Optimality theory • Readings • Paul Kiparsky “Finnish Noun Inflection”Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. • Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.

Solution to Assignment 1, Part 1 • define Hundreds [OneToNine { hundred} • ({ } OneToNinetyNine)]; • define OneTo999 [OneToNine | Teens | Tens | • Hundreds ]; • define Thousands [OneTo999 { thousand} • ({ } OneTo999)]; • define UpToMillion [OneToNine | Teens | Tens | • Hundreds | Thousands ];

What is this? • xfst[0]: source Dutch.script • print random-lower 3 • tweeennegentig • vierenveertig • eenennegentig • xfst[1]: define Dutch • xfst[0]: source English.script • xfst[1]: print random-lower 3 • twenty-seven • ninety-one • forty-five • xfst[1]:define English • xfst[0]: regex Dutch.i .o. English ;

Syllabification define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "." || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k . t u . r a . l i s . m i

Finnish Syllabification • # -*- coding: utf8 -*- • define FinnWords {kala}|{riippuu}|{tietoinen}|{sataa}| • {satoi}|{saata}|{saatoin}|{auta}|{laiva}| • {leipä}|{häijy}|{koulu}|{köyhä}|{lea}| {viestien}|{tuote}| {virtu.ositeetti}| • {laukaus}|{lakkautan}|{voimistelijoiden}| • {heittäen}|{heittäisin}|{laulaen}]; • define HighV [u | y | i]; # High vowel • define MidV [e | o | ö]; # Mid vowel • define LowV [a | ä] ; # Low vowel • define V [HighV | MidV | LowV]; # Vowel • define LongV [{aa}|{ee}|{ii}|{oo}|{uu}|{yy}|{ää}|{öö}]; • define Diph [[[MidV | LowV] HighV]|{ie}|{uo}|{yö}];

Syllabification (Continued) • define Nuc [V | LongV | Diph]; • define C [b | c | d | f | g | h | j | k | l | m | • n | p | q | r | s | t | v | w | x | z]; • define Syllabify [ C* Nuc C* @-> ... "." || _ C V ] ; • regex FinnWords.o. Syllabify ; • print lower-words

Syllabification (continued) • Problem cases • Incorrect Correct • lea le.a • lau.laen lau.la.en • lau.kaus lau.ka.us • define Syllabify [ C* Nuc C* @-> ... "." || _ C V • .o. • [. .] -> "." || [a | ä | i] _ [e | u | y] (C) .#. , • e _ a ] ;

Best result Today is Monday, [July 25, 2005]. Today is [Monday, July 25], 2005. Today is Monday, [July 25], 2005. Today is [Monday], July 25, 2005. Bad results Parsing Dates Today is [Monday, July 25, 2005]. Need left-to-right, longest-match constraints.

Defining the Language of Dates define OneToNine [1|2|3|4|5|6|7|8|9]; define ZeroToNine ["0"|OneToNine]; define Day [{Monday} | {Tuesday} | {Wednesday} | {Thursday} | {Friday} | {Saturday} | {Sunday}] ; define Month29 {February}; define Month30 [{April} | {June} | {September} | {December}]; define Month31 [{January} | {March} | {May} | {July} | {August} | {October} | {December}] ; define Month [Month29 | Month30 | Month31];

Language of Dates (Continued) # Date is a number from 1 to 31 define Date [OneToNine | [1 | 2] ZeroToNine | 3 [%0 | 1]]; # Year is a number from 1 to 9999 (watch out for the Y10K bug!) define Year [OneToNine ZeroToNine^<4]; # A date expression consists of a Day (Monday) or a Month and a Date (July 25) with an optional Day (Monday, July 25) and Year (July 25, 2005) or both (Monday, July 25, 2005). define AllDates [Day | (Day {, }) Month { } Date ({, } Year)];

0 1 0 1 0 Mon 0 1 2 1 2 1 Tue 1 2 2 3 2 Wed 2 4 4 3 3 3 Thu 5 5 3 4 4 4 Fri 6 6 4 5 5 5 7 7 Sat 5 6 6 6 8 8 Sun 6 7 7 7 9 9 7 8 8 8 8 9 9 9 9 0 1 All Dates from 1.1.1 to 31.12.9999 , , Jan Feb Mar Apr May Jun 3 Jul Aug Sep Oct Nov , Dec 13 states, 96 arcs 29 760 007 date expressions , Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Parser for Dates Compiles into an unambiguous transducer (136 states, 2798 arcs). AllDates @-> “<DT>“ ... “</DT>“ Today is <DT>Monday, July 25, 2005</DT> because yesterday was <DT>Sunday</DT> and it was <DT>July 24</DT> so tomorrow must be <DT>Tuesday, July 26</DT> and not <DT>July 27> as it says on the program.

Problem of Reference Valid dates Monday, July 25, 2005 Tuesday, February 29, 2000 Monday, September 16, 1996 Invalid dates Wednesday, April 31, 1996 Thursday, February 29, 1900 Tuesday, July 25, 2005

AllDates MaxDays In Month LeapYears ~$[Month29 { 30}]; Feb 29 => _ ... WeekdayDate Refinement by Intersection Valid Dates

MaxDays define MaxDays30 ~$[Month29 { 30}]; define MaxDays31 ~$[[Month29 | Month30] { 31}]; define MaxDays [MaxDays30 & MaxDays31];

LeapYear constraint define Even [{0} | 2 | 4 | 6 | 8] ; define Odd [1 | 3 | 5 | 7 | 9] ; define N [Even | Odd]; define Div4 [4 | 8 | N* [Even [%0 | 4 | 8] | Odd [2 | 6]]]; define LeapYear [Div4 - [[N+ - Div4] {00}]] ;

LeapYear Constraint (Continued) • Bad Solution 1 • define LeapDates {February 29, } => _ LeapYear ; • Bad Solution 2 • define NotLeapYear [Year - LeapYear]; • define LeapDates ~${February 29, } NotLeapYear]; • Almost Correct • define LeapDates [ • {February 29, } => _ [?* - [NotLeapYear [\N]*]]]; • Good Solution • define LeapDates [ • {February 29, } => _ [?* - [NotLeapYear [\N]*]]] .#.;

Vacuous Context Conditions • A context condition L _ R is compiled as ?* L _ R ?*. • Any expression that contains the empty string is “swallowed up” when concatenated with ?*. • (a) ?* == ?* (a) == ?* • [?* - a] ?* == ?* [?* - a] == ?* • ~a ?* == ?* ~a == ?* • Not vacuous: • a -> b || _ c* [.#.| \c] ;

DateParsers define ValidDates [AllDates & MaxDays & LeapDates]; define ValidDateParser [ValidDates @-> "<DATE>" ... "</DATE>" || _ [.#. | \N]]; define InValidDates = [AllDates - ValidDates]; define InvDateParser [InValidDates @-> "<INV-DATE>" ... "</INV-DATE>" || _ [.#. | \N]]; define DateParser [InvDateParser .o. ValidDateParser]; <INV-DATE><DATE>February 29</DATE>, 1900</INV-DATE>

Date/NonDate parser 1 define DateParser [ValidDateParser .o. InvDateParser]; <DATE>February 29<DATE>, 1900 No nested tags for the input "February 29, 1900” because InvDateParser does not apply to strings that have been tagged already.

Date/NonDate parser 2 define DateParser [ValidDates @-> "<DATE>" ... "</DATE>", InvalidDates @-> "<NON-DATE>" ... "</NON-DATE>" || _ [\N | .#.]] Parallel replacement of two patterns with the same constraint on the right context. <NON-DATE>February 29, 1900</NON-DATE> <DATE>February 29, 2000<DATE>

Observations • For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar. • Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them. • This is a fundamental advantage over higher-level formalisms.

The LEXC Formalism

What is LEXC? • A special application for making lexical transducers (On the B&K book CD). • A language for describing morphotactic constraints by way of sublexicons and continuation classes. • Why another regular expression formalism? • The general regular expression compiler in XFST is oriented towards compiling networks from symbols and symbol pairs, not from words. LEXC is word-based. • Compiling large lexicons (tens of thousands of words) by the standard union operator is inefficient. LEXC has another, a more efficient algorithm for building networks from lists of words, stems, and affixes.

Multichar_Symbols +Noun +Sg +Pl Lexicon Root cat SgPl ; dog SgPl ; goose Sg ; goose:geese Pl; Lexicon SgPl Sg; 0:s Pl; Lexicon Sg +Noun+Sg:0 #; Lexicon Pl +Noun+Pl:0 # ; Multicharacter symbols need to be declared. There must be a sublexicon called ‘Root’ Entries consist of optional string or string pair followed by an obligatory continuation class. Every continuation class must refer to a sublexicon, except for #, the termination class. LEXC Syntax

in eg et ge o hund j n ec bon ne mal eg et a Esperanto chart

Esperanto chart 2 in eg et ge o hund j n ec bon ne mal eg et a

Esperanto chart 5 NPrefix in eg et ge NMod NSuff NTag o NStem Plur Acc Noun Infl hund j n AtoN NDeriv Root ec AStem ADeriv Adjective bon ne mal ATag eg et a APrefix ASuff AMod

LEXICON Root Adjective; Noun; LEXICON Adjective APrefix; AStem; LEXICON APrefix Neg+:ne AStem; Op+:mal AStem; LEXICON AStem bon ADeriv; LEXICON ADeriv ATag; AMod; LEXICON AMod +Aug:eg ATag; +Dim:et ATag; LEXICON ATag +Adj:0 ASuff; +Adj:0 AtoN; LEXICON ASuff +ASuff:a Infl; ... etc. Esperanto.lexc

Constraints +Fem MF+ in eg et ge o +Pl hund j n ec bon ne mal eg et a

Constraints 2 +Fem MF%+ => _ ~$[%+Fem] %+Pl ; MF+ in eg et ge o +Pl hund j n ec bon ne mal eg et a

Constraints 3 • xfst[0]: read lexc < esperanto.lexc • Reading from 'adj-noun.lexc' • Root...2, Nouns...2, NounRoots...4, Nmf...5, .... • Building lexicon...Minimizing...Done! • 2.7 Kb. 45 states, 70 arcs, Circular. • Closing 'adj-noun-tags.lexc' • xfst[1]: regex MF%+ => _ ~$[%+Fem] %+Pl ; • 1.2 Kb, 2 states, 7 arcs, Circular • xfst[2]: compose • 3.2 Kb, 61 states, 89 arcs, Circular • Less words, bigger network!

Finite-State Methods in Natural Language Processing