Suppletive Morfology: How Far Can You Go?

Suppletive Morfology: How Far Can You Go? Jorge BAPTISTA Universidade do Algarve – FCHS Laboratório de Engenharia da Linguagem – CAUTL – IST Campus de Gambelas, Faro, Portugal, P-8005-139 jbaptis@ualg.pt

For many NLP applications, fully tagged texts are required. • Even if statistical methods may be used to tag texts, electronic dictionaries are essential tools for high quality tagging of large-sized texts. • Large electronic dictionaries of both simple and multiword lexical units have been built to European Portuguese.

In spite of their size, a non-trivial number of tokens of large-sized corpora remain untagged. • Suppletive, morphological parsing rules can be used to cope with many lacunae, especially with regularly derived words. • However, there are empirical limits to morphological parsers, so that other methods for automatic lexical analysis must also be envisaged.

Introduction • Automatic lexical analysis of texts can be carried out using different methods (see Ranchhod 2001, for an over view). • Most systems, however, even if they use statistical methods predominantly, also use to some degree an electronic dictionary, where lexical information, idiosyncratic by nature, is stored.

Large electronic dictionaries of both simple and compound words have been built for several languages, including Portuguese (Eleutério et al. 1995, Ranchhod et al. 1999). • In spite of their size, when these lexical resources are applied to large corpus, a non-trivial number of tokens remain to be tagged. • The lexicon being an evolving object, one cannot hope the dictionaries to be so comprehensive and exhaustive that would contain all possible words. This is particularly the case for regularly derived words (-ly adverbs, -ize verbs, for example).

Morphological parsers have been built, which can be used with or without dictionaries. • With such tools it is possible to complete the dictionary’s lacunae, that is, it is possible to formalize morphological rules so that the system may recognize (and tag) words that have not been previously included in the dictionaries. • These rules may be used in a suppletive way (or in connection with the dictionary), and results from their application can then be manually checked by linguists and used to extend the coverage of the initial dictionary.

In this paper, an attempt was made to estimate the how much of the unknown, hence untagged, tokens of a large size corpus can be adequately recognized by a morphological parser, • using only regular derivational rules, • trying to evaluate the precision and to determine empirically the limitations of this methodology. • A set of morphological rules was built, focusing on a list of unknown tokens. • Results from this morphological module are here described and its precision will be evaluated.

Methods • The CETEMPúblico corpus http://www.linguateca.pt/CETEMPublico/ - fragment 1 (text file ~57,6 Mb, ~9.6 M (177,368 different) simple word-forms) • INTEX 4.33 (Silberztein 1993, 2004); http://www.nyu.edu/pages/linguistics/intex/ • Portuguese DELAF (public lexical resources built by LabEL (Eleutério et al. 1995, Ranchhod et al. 1999) http://label.ist.utl.pt • .

Table 1: Lexical Analysis of Training Corpus.WF = word forms (in millions, M); DWF = different word forms ; DLF =simple word entries; ERR-0 =unknown word-forms; NProp =candidates to the status of proper names; ERR-1=remaining unknown word-forms : ERR list to be tested analysis

NProp 40.957 DWF; 23,09 %DWF; 56,65 %ERR-0 N+Sigla 1074 DWF; 0,6 %DWF; 1,49 %ERR-0

Overview of ERR list • Many forms in ERR are perfectly ‘normal’ words that were just missing the dictionary. • With the help of an inverse list of ERR (also obtained with INTEX dictionary tools), it was possible to determine some of the most productive derivational rules at stake:

adjectives formed from verbs (partially homographs with the past participles): • electrizada (electrified-fs) from electrizar (electrify); • adverbs with –mente (-ly), formed on adjectives: • controladamente (controlled-ly, in a controlled manner) from controlado (controlled); • nouns derivationally related with verbs with suffixes –ção (-ation/-ing) and –mento (-ment/-ing): • agilização (‘agilization’), from agilizar (to ‘agilize’, make something more agile, swift); • silenciamento (silencing), from silenciar (to silence);

nouns formed with suffix –ismo (-ism): vegetarianismo (vegetarianism), from vegetariano (vegetarian); • nouns and adjectives formed with suffixes ‑logia (-logy), -ólogo (-logue), -ologista (-logist), and ‑lógico (-logic) designating names of scientific/technologic domains, the designation of professionals in those domains and the relation adjective associated to them: paleontologia (paleontology), paleontólogo or paleontologista (paleontologist), paleontológico (‘paleontologic’, related to paleontology).

nouns and adjectives formed with suffixes ‑mancia and -mância (-mancy), -mante (-mant), and –mântico (-mantic) designating names of divinatory arts, the designation of their professionals and the relation adjective associated to them: quiromancia / quiromância (chiromancy, palmistry), quiromante (‘chiromant’, palmist, psychic who reads palms to devine the future), quiromântico (‘chiromantic’, related to chiromancy, palmistry).

Besides these, many derivate words were found formed with prefixes (Pfx); for these, a list of the 170 most current prefixes was established, based on the lists available in grammars and new prefixes found in the text, e.g.: anti-, auto‑, bi-, contra-, des-, equi-, etno-, extra-, farmaco-, foto- (photo), geo-, hepta-, hidro-, hipo-, homo-, in-, (and variants: i-, im-, ir‑), inter-, macro-, mega-, micro-, mono-, neo-, opto-, pluri-, proto, pseudo-, psico- (psych-), radio-, re-, retro-, semi-, socio-, super-, tele-, tetra-, trans-, tri-, ultra-, uni-, video-, xeno-, zoo-, etc.

Obviously, many words can be polysynthetic, i.e. formed by simultaneous prefixation and suffixation: descontroladamente < des-Pfx+ controlar V+ -ada Sfx-a + -mente Sfx-adv> (uncontrolledly). • after a certain point, derivation rules have a very low productivity, i.e., the number of words regularly formed becomes negligible.

Morfological rules • new morphological parser of Intex (version 4.33, February 24, 2004; Silberztein 2004:130‑142) • a set of morphological rules were built • these rules are enhanced finite-state transducers (FST).

Example: • in face of the new, unknown word form umbilicalmente, formed from the adjective umbilical (idem, related to the navel), the system checks if there is an adjective umbilical in the lexicon (in fact, there is), and if so it produces the lexical entry: umbilicalmente,umbilicalmente. ADV+A=umbilical+Sfx=mente The context of this new word in the corpus is: “a imagem de um PS umbilicalmente ligado ao modelo jurídico-penal” (the image of a Socialist Party umbilically connected to the juridical-penal model). From this context, the meaning of this adverb should be something like ‘closely, intimately, or inextricably’.

A set of morphological rules to analyze and tag the most frequent, derivationally well-formed, unrecognized word forms in ERR. • These rules were grouped in different FST by ‘derivative families’ Remark: • We have built a small module of rules to deal with the derivation of diminutive, augmentative and superlative forms. • The number of rules built by us is given here as a mere indication, • C. Mota (2003) has build a larger and more complex module of rules for the same derivational processes. We did not use her work here. Therefore, results will ignore this module. Table 2: Morphological rules

The number of prefixes used (approximately 170) influences significantly the number of rules in each family. • As we will see below, some of these prefixes give rise to significant erroneous analyses, so it is possible that in a future version some of them will be removed, and only used in more constrained rules. • As this is an on-going research, the number of rule-families will surely increase.

The morphological module that integrates all these rules is a 20 Kb FST with 595 states and 990 transitions. • It takes 35 seconds to analyze the 31,337 ERR list of the training corpus and to produce the 3,533 entries of the resulting DLF

Results • First, an evaluation was made of the application of the set of FSTs to the training corpus. • We will present first the lexical coverage of the morphological module and then assess it success rate

Table 3: Lexical coverage of morphological rules: results from training corpus. WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word‑forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

Table 4: Results from training corpus

some comments • As we can see, global success rate is high (approx. 93 %). The most important cause of error consists of initial strings incorrectly analysed as prefixes. Some adverbs ending in –mente are analysed in spite of the fact that they present an (incorrectly spelled) accented vowel: diáriamente,diáriamente.ADV+ADJ=diária+SFX=mente (he correct form (diariamente, ‘daily’), does not have any accent)

Testing corpus: Lexical analysis Table 5: Lexical Analysis of Test Corpus. WF = word forms (in millions, M); DWF = different word forms ; DLF=simple word entries; ERR-0=unknown word-forms; NProp = candidates to the status of proper names; ERR-1=remaining unknown word-forms : ERR list to be tested.

Comparison CP1 vs. CP2 • The two fragments do not have exactly the same size: the testing corpus is 6,2 Mb larger • has 1,3 million words (1,175 different word forms) more than the learning corpus. • The DLF size is also 1,180 entries larger. • However, the number of unknown word forms (ERR-0) and of proper names (NProp) is almost the same. • The remaining ERR list (ERR-1) after the proper names have been discarded is also of comparable size.

Testing corpus: Lexical Coverage Table 6: Lexical coverage of morphological rules: results from testing corpus. WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word-forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

Table 7: Results from training corpus.

Comparing results

Table 8: Comparison of results from training and testing corpora. WF = word forms; DLF=simple word entries; n‑tuples=different entries produced for the same word-forms; %DLF=n‑tuples’ percentage of DLF entries; %ERR=percentage of ERR list.

Comparing results • in spite of the different sizes of each corpus, the number of word forms in each ERR list is almost the same. • The results from the morphological rules are also equivalent: both the number of different word forms recognized by the FSTs and the number of entries of the two DLF are approximate. • The combined DLF obtained by the application of the morphological rules to both corpora contains 6,058 different entries, corresponding to 4,253 different word forms (including diminutives, augmentatives and superlatives).

There is a slightly greater number of n‑tuples on the testing corpus, but the percentage of DLF is practically the same. • The first major difference is the lexical coverage of the morphological module (%ERR), i.e. the percentage of matched word forms of each ERR list: • while, in the training corpus, this was about 16%, it becomes little less than 9 % in the testing corpus. • Secondly, success rate diminishes significantly, from 92.68 % to 77.93 %.

Remaining ERR • Even if the morphological rules may constitute an effective tool to analyze candidate words that can then be manually checked by linguists in order to extent lexical coverage of electronic dictionaries, it is clear that a very substantial part of the text’s different words remain to be tagged: • 28,841 in the training corpus and 28,808 in the test corpus; • if all uppercase words were ignored, these could be reduced to about half of those: 13,469 in the training corpus and 13,641 in the testing corpus).

It is possible that new morphological rules yet to be build may contribute to increase lexical coverage of unknown words. • But as we include new rules, these apply to an increasingly small number of word forms. • We estimate that the number of Portuguese, correctly formed but unknown words analyzable by the method of suppletive morphologic rules could still be increased up to 20 %. • Furthermore, as new rules interact with previously made rules, the number of words with multiple analysis increases, thus diminishing the precision of the results.

What is the nature of remaining unknown words? The most common cases found were: • spelling errors: many unknown words are just due to typing or spelling errors: abanadonaria (abandonaria, ‘would abandon’), abastenção (abstenção, ‘abstention’), abatecimento (abastecimento, ‘supplying’), etc. Some errors are due to conversion between character sets: bonificaÁões (bonificações, ‘bonifications’) consÛrcios (consórcios, consortium (pl))

words derived fromproper names (mainly adjectives): aladinescos (from Aladin), balzaquiano (Balzaquian), hartleyano (from Hartley) hitchcockiana, hitchcoquiana (from Hitchcock, notice the orthographic adaptation to Portuguese spelling rules: – ck > -qu - ) deskhomeinização(des-Pfx + Khomeni Nprop +iz Sfx-v + ation Sfx-n, from Khomeni, Nprop)

foreign words: in real texts, and in particular in journalistic texts, there are many foreign words. Mostly, these came from English and French, but other languages can also be found: accelerating access […] brick bricoleur brie briefing briefings bright brit british britpop broadcast broadsheet broken broker brokers […] chief child childhood children chill chills […] destroyed destroyer […] dégradable déja déjà délégations délire déluge démarche démarches démocratie démodé déplacement désir désire désordre […] engineer engineering engines english englishman […]fatwa fatwas faune faut faute fauteuil fauves faux […]grillons grind griots grip grisaille grizzly groove grossen grotesk […] handicapés handicappées handicaps handling hands handy hankseana hantavirus hants happen happened happy harakiri harandjita hard hardback hardcover […]international internazzionale internet […]jazzman jazzmen jazzy […] killer killers killing kills kilobits kilohertz kilowatts […] laid laird laisser laissez lait […] mailing mailings maillots main mainframe mainframes mainland mains mainstream […] notebook notebooks nothing nothingness […] opinion opium opposed opting option options […] partenaire partenaires partenership partial […] queries quest question […] rappel rapping rapport […] sell seller sellers selling selon […] talk talkie talkies talking talks tall […] und under underacting underground underplaying understand understanding underworld unfinished […] yiddish yields yodel yodelling yoga yokozuna yop yorker yoruba yorubas you young youngboy your yourself yuppie yuppies yuppy […] zappeur zapping […]

From the examples above, it is clear that no real corpus is free from many of these problems • for robust (non-statistical) lexical analysis, several strategies must be used in combination with dictionaries and a suppletive morphologic analyzer • error detection and correction, comparing unknown forms and lexicalized forms by letter changing, permutation, and so on; • development of morphologic rules based on dictionaries ofproper names; • language identification procedures, enabling the system to work with texts with mixed languages.

While strategies (1) and (3) have already been put in place independently in orthographic correctors in text-editors (MS-Word, for instance) and web browsers, strategy (2) has not seen much effort from (Portuguese) lexicographers, specially in view of automatic lexical analysis. • It combines encyclopedic dictionaries with morphologic analyzers, an approach similar to the one here shown. However, to our knowledge, the combination of these strategies (eventually others) in the same system has not been done yet.

Conclusion • From results obtained so far, precision of morphological rules is high (90% average), • it is clear that the goal of zero unknown tokens is still far from being achieved • only less than 20% of ERR were matched, by means of suppletive morphologic rules. • In real life, there is no such thing as a ‘clean’ corpus: typos, foreign words, and proper names’ derivates are the sets of unknown tokens most responsible for this insufficiency in automatic lexical analysis.

For robust lexical analysis of these forms, other strategies must be found, • these may involve not only language identification procedures (and use of the corresponding dictionaries) but also correction of deviating or erroneous forms. • The combination of different strategies in a single system may constitute both a linguistic and a computational challenge in the near future.

Acknowledgements Research for this paper was partially funded by Fundação para a Ciência e a Tecnologia (project grant POSI/PLP/34729/99). Thanks are due to C. Mota for making available her DimAum module. References Eleutério, S.; Ranchhod, E.; Freire, H; Baptista, J. 1995. A system of electronic dictionnaries of Portuguese. Linguisticae Investigationes 17-2: 57-82. Amsterdam: John Benjamins B. V. Ranchhod, E.; Mota,C.; Baptista, J. 1999. A Computational Lexicon of Portuguese for Automatic Text Parsing. SIGLEX’99: Standardizing Lexical Resources. Proceedings of a Workshop Sponsored by the Special Interest Group on the Lexicon of the Association for Computational Linguistics and the National Science Foundation (June 21-22, 1999, University of Maryland, College Park, Maryland, USA). pp. 74 80: Maryland: University of Maryland. Baptista, J.; Faísca, J. 2001. Um filtro para palavras exóticas frequentes do Português. Seminários de Linguística 4: 65-86. Faro: UALG-FCHS/CELL. Baptista, J.; Faísca, J. 2003. Mapping, filtering and measuring impact of ambiguous words of Portuguese, 6th Intex Workshop, Sofia, Bulgaria (May 28-30, 2003). Silberztein, M. 2004. Intex Manual. http://intex.univ-fcomte.fr/downloads/Manual.pdf Mota, C. 2003. A Renewed Portuguese Module for Intex 4.3x. 6th Intex Workshop, Sofia, Bulgaria (May 28-30, 2003). Mota, C. 2000, Analysis of Derivational Morphology by Finite State Transducers, in Dister, A. (ed.), Actes des Troisièmes Journées INTEX, Revue, Informatique et Statistique dans les Sciences Humaines, 36, pp. 273-287, Université de Liège. Ranchhod, E. 2001, O uso de dicionários e de autómatos finitos na representação lexical das línguas naturais, in Ranchhod, E. (Org.) Tratamento das línguas por computador. Uma introdução à Linguística Computational e suas aplicações, pp. 13-47. Lisboa: Caminho

Suppletive Morfology: How Far Can You Go?

Suppletive Morfology: How Far Can You Go?

Presentation Transcript

Lecture 5 OE Grammar

Taste and olfaction