250 likes | 383 Views
ACL 4 NCLT Seminar Presentation, 7 th June 2006. John Tinsley. Morphological Analysis of Spanish Using Finite-State Transducers. Introduction. What is this project about? Provide morphological information on Spanish strings Generate strings from morphologcal descriptions
E N D
ACL 4 NCLT Seminar Presentation, 7th June 2006 John Tinsley Morphological Analysis of Spanish Using Finite-State Transducers
Introduction • What is this project about? • Provide morphological information on Spanish strings • Generate strings from morphologcal descriptions • What were my aims? • Robust, fast, application – easily integrated into other systems • 80% token coverage on unrestricted text • 100% coverage of Spanish morphology
Design Methodology • Formalisation • Discovery of Spanish morphological rules • Implementation • Coding of morphological model with Xerox Finite-State Tools • Evaluation • Check for accuracy & well-formedness • Assess language coverage
Spanish Morphology - Verbs • Inflected for person, tense/mood, number • Regular verbs • 3 regular conjugations identified by infinitive endings • ‘-ar’, ‘-er’, and ‘-ir’ • Irregular verbs • 66 distinct irregularities • Varying degrees of irregularity
Spanish Morphology - Nouns • Inflected for number, gender • 7 types of noun • Feminine, masculine, neutral, derivative, profession, number invariant, proper • Irregularities • All arise via pluralisation • Accentuation, character alterations
Spanish Morphology - Adjectives • Inflected for number, gender • 4 types of adjective • Neutral, derivative, profession, irregular • Adverbs derived from adjectives by addition of suffix ‘mente’
Xerox-Finite State Tools - lexc • Lexicon compiler • Compiles ‘continuation classes’ into lexical transducers
Xerox Finite-State Tools - xfst • Xerox finite-state tool • Compiles regular expressions into networks • Regular expression replace rules [ String -> Replacement || left-context _ right-context ]
Xerox Finite-State Tool - example • conocer - ‘to know’ • 1st person, pres. ind. ‘conozco’ • Lexical transducer mappings • conoc:conoc • er+Verb:ε • +PresInd:^PresInd • +1P+Sg:o
Xerox Finite-State Tool - example cont… • Composed replace rule [ c -> {zc} || _ ^PresInd ] • Triggered by the ^PresInd tag • Makes required changes, remove trigger
Verb Lexicon • Coded in lexc • Model has 3 regular paths • 66 varieties of irregularity • e.g. poder ‘to be able to’ LEXICON Irreg43 0:^UE^VSoue^PRET1^FR ErV ; [o -> {ue} || _Consonant^<4 [%^UE ?* [[%^PresInd | %^PresSubj] ?* [%^1PSg | %^2PSg | %^3PSg | %^3PPl] ]
Noun Lexicon LEXICON NounFem ! Feminine Nouns !STEM !CONT. CLASS ! GLOSS acción fIsNounEs ; ! action LEXICON fIsNounEs ! feminine pluralised with 'es' +Noun:0 fNounPluralES ; LEXICON fNounPluralES +Sg+Fem:0 # ; +Pl+Fem:^NZ^NOes # ; [z -> c || _ %^NZ] [ó -> o || _ ?^<5 %^NO ]
Adjective Lexicon • Same process as noun lexicon • Uses the same replace rules • One exception for adverbs LEXICON nIsAdjS +Adj:0 nAdjPluralS ; +Adj|+Adv:^AAOmente # ; [o -> a || _ %^NAO %^AAO {mente}]
Other Transducers • Overgeneration Filter • llover ‘to rain’ • Capitalisation • Trigger Remover • Execution script ~[ $[{llov} ?* [[%+1P | %+2P] [%+Sg | %+Pl] | [%+3P %+Pl] ] ] [ a (->) A || .#. _ ] [ %^IE -> 0 ]
Testing • Accuracy • Maintaining integrity of existing rules • Projection • Subtraction • Well-formedness • Ensuring tag order
Assessing Coverage • Aim – 80% on unrestricted text • Statistical predictions (Crystal 1997) • Corpus compilation and processing • Europarl, 3 corpora (http://people.csail.mit.edu/koehn/publications/europarl/ ) • Phase 1 – augmentation • Phase 2 – 81% coverage • Final assessment – 84.15% coverage
Further Details • Generates approx. 44,000 unique morphological descriptions • Evaluation corpus – 1.26 analyses per input token on average
Possible improvements • Increase coverage • lexicon augmentation • Disambiguation using POS tagger • More derivational morphology • Deal with different dialects of Spanish
References • (Beesley & Karttunen 2003) Beesley, K. and Karttunen, L., Finite State Morphology, CSLI Publications, United States, 2003. • (Claret 2005) Los Verbos Castellanos Conjugados, Sexta Edición, Editorial Claret, Barcelona, 2005 • (Crystal 1997) Crystal, D., The Cambridge Encyclopedia of Language. (2nd. ed.) Cambridge University Press, 1997 • Europarl - Europarl Parallel Corpus http://people.csail.mit.edu/koehn/publications/europarl/ - Last Accessed 19/05/2006 • (Kendris 1990) Kendris, C. Spanish Grammar. Barron’s, 1990. • (Mateo & Rojo Sastre 1997) Mateo, F. and Rojo Sastre, A.J. Collection Bescherelle - Les verbes espagnols. Hatier, 1997. • Real Academia Española – http://www.rae.es/ - Last Accessed 25/05/2006
Conclusions Demonstration
LEXICON ArVerbs !STEM !CONT. CLASS !GLOSS abord ArV ; !to approach LEXICON ArV ar+Verb:0 ArConj ; LEXICON ArConj !TAGS !CONT.CLASS +PresInd:^PresInd ArPresInd ; +PretInd:^PretInd ArPretInd ; LEXICON ArPresInd ! Present Indicative +1P+Sg:o^1PSg #; +2P+Sg:as^2PSg #; +3P+Sg:a^3PSg #;