250 likes | 426 Views
Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autònoma de Barcelona. 1 . The Format Of the Romanian Electronic Dictionary. 1.1. The Macrostructure 1.2. The Microstructure 2. The Noun Inflection System. NooJ Graphs Implementation
E N D
Constructing a Romanian Electronic DictionaryAndrei FilipUniversitat Autònoma de Barcelona
1. The Format Of the Romanian Electronic Dictionary. 1.1. The Macrostructure 1.2. The Microstructure 2. The Noun Inflection System. NooJ Graphs Implementation 2.1. The Gender and Determination Issue 2.2. The Grammatical Category of Number 2.3.The Grammatical Category Of Case 1. The Format of the Romanian Electronic Dictionary: 1.1.The Macrostructure - is composed by the different lexical units which make up the dictionary (in our case about 30 738 entries) What makes it different from paper dictionaries? In what we call traditional dictionaries, each entry generally corresponds to a basic unit form, therefore it implies the separation of syntax (structures in which the respective units can be combined) and lexicon (inventory of associated forms to one or more meanings).
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case At least two major problems raise from this treatment as far as natural language processing is concerned: a) polysemy b) idiomatic expressions Therefore, they describe either a part of the lexical unit or more lexical units at the same time. The strategy to adopt is to consider the entry not as a form but as a lexical unit – which is made up by a form a, a meaning ‘a’ and a combinatory ∑a. e.g. Este o veste însemnată.(une nouvelle importante) Vaca care este însemnată îi aparţine. (marquée) Este un om însemnat. (personne estropiée.)
1.TheFormat of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case In the previous sentences, each of the uses of the adjective “însemnat” is characterized by a combinatory and single meaning which correspond to an independent lexical unit. Moreover, as we have already seen, each lexical unit corresponds to a different translation unit in the target language. If we define the lexical units as such, lexical ambiguity is no longer a problem as each form corresponds to a single meaning. We should also distinguish between simple and compound lexical units. For the time being we concentrate only in the Romanian dictionary of simple forms and leave behind for a further research the dictionary of compound lexical forms.
We should also mention here that spelling variants have been treated separately, that is they are given a new different entry and description in the dictionary. e.g. atunci/atuncea; acum/acuma; flutur/fluture We have also approached a different perspective as far as gender is regarded. For instance, we have given different entries for the masculine and feminine nouns (what we could also term as correlative nouns) : e.g. bunic-bunică; copil-copilă; cuscru-cuscră; cumnat – cumnată; profesor – profesoară; italian – italiancă; leu – leoaică; ţăran – ţărancă; doctor – doctoriţă; cârciumar – cârciumăreasă, păun – păuniţă etc. Therefore they also correspond to different inflection graphs and do not come out as inflections of the corresponding masculine noun. The aim is also to facilitate the lexicographical treatment of natural gender. 1.The Format of the Romanian Electronic Dictionary 1.1. The Macrostructure 1.2. The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2. The Grammatical Category of Number 2.3.The Grammatical Category of Case
1.2. The Microstructure The microstructure of an electronic dictionary is made up by the different lexicographic information which is mentioned, that is information on the lemma, on its possible arguments and on lexical units related from a semantic point of view to the respective lemma (i.e. lexical restrictions and translation equivalents). All this information is divided in the different descriptive fields of the data base. Each entry is characterised first of all according to its morphologic description (G field). It corresponds to the different inflection graphs that characterise the parts of speech: N, A, V, ADV, PREP, DET, PRO and Residual. According to the inflection codes we attach to each entry we can also make out information on gender for instance. 1.1.The Format of the Romanian Electronic Dictionary. 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case
1. The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case The next field, T, provides the information about the syntactico-semantic features of each entry. They concern mainly nouns. We distinguish between Hum, Inc, Anl, Veg, Loc, Tps, and Abs (which is further subdivided into states, actions and events). The fourth field, C is reserved to the “classes d’objets” (Gross, G. 1994; Le Pesant et Mathieu Colas, 1998). They have been established from the syntactic characteristics of the lexical units. A class of elementary arguments is defined by the predicates which select arguments belonging to the same class of objects. The superior order predicates which accept other predicates in their argument domain are also regrouped in “classes d’objets”. For the time being 59 classes have been implemented in our dictionary. e.g. cântăreaţă: C: artist
1. The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case So as to provide more precision to our description we shall also include the D field, which corresponds to the domains that have been accurately described by the Laboratoire de Linguistique Informatique of Paris 13 (about 91). “un ensemble d’expressions dénommant dans une langue naturelle des notions relevant d’un domaine de connaissance thématisé” (Lerat, 1995) This kind of description will allow us to disambiguate polysemantic lexical units. For further precision, the field SD (subdomain) has been introduced. e.g. cineast D: cinema-photography SD: cinema
Our next field corresponds to the translation equivalent (Fr/Es). It is highly important to state that we do not consider this field as a metalinguistic information relative to one lexical unit but rather as a pointer to another lexical unit which has a corresponding linguistic description in the target language dictionary. Our aim is creating monolingual coordinated electronic dictionaries (cf. Blanco 2001) as in most cases the morphological and syntactic description differ from one language to the other. We have also introduced a further field P (cf. Garrigues 1997) so as to account for the use a speaker would give to one lexical entry or the other. Two criteria are taken into consideration when it comes to this field: - we consider the (non)existence of a mental image of a given word in the mental lexicon of a person; - we consider how often a given word would occur in everyday speech (we refer here not to the form but to the association form/meaning). 1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation. 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3The Grammatical Category of Case
A final field is to be introduced and it has to do to with what Hausmann (1989) calls “diasystematics”. 1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case
2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues As far as gender is considered, we distinguish three main classes in Romanian: • Masculine: un frate – doi fraţi • Feminine: o colegă – două colege • Neuter: un drum – două drumuri From a morphologic point of view neutre nouns behave like a masculine noun in the singular and as a feminine in the plural. Therefore they will select different operators according to number. 1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. Nooj Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case
From a semantic point of view we could assert that it is quite a homogenous class as it includes mostly Inc nouns (e.g. ciocan – ciocane), HumColl nouns (e.g. popor, trib, grup, colectiv etc.) and Anl which denote the species (e.g. mamifer, gasteropod, dobitoc). As far as the grammatical category of determination is taken into account we shall concentrate here only on the definite article. All the other Det have their own inflection system depending either on the case and on whether they precede or not the NG. The definite article in Romanian is an adjoined enclitic morpheme which needs to be described in the inflection graph: e.g. studentul,steaua, cartea, regele, codrul ţară – ţara, popă – popa, poezie – poezia etc. 1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2. The Microstructure 2. The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of case As far as the plural nouns are concerned, the definite article morpheme depends only on the gender of the corresponding noun: • “i” for the masculine nouns: e.g. studenţi – studenţii; fraţi – fraţii; copaci – copacii • “le” for the feminine and neuter nouns: e.g. studente – studentele; poezii – poeziile popoare – popoarele; sigilii – sigiliile.
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case 2.2. The Grammatical Category of Number When it comes to inflectional morphemes that designate the opposition singular-plural, we could distinguish three main classes of nouns in Romanian: • Variable nouns with a regular inflection paradigm: e.g. casă – case; şcolar – şcolari; drum– drumuri; • Variable nouns with an irregular inflection paradigm: e.g. om – oameni; soră – surori; c) Invariable nouns: e.g. tei – tei; învăţătoare ; pronume So far we have created 11 different inflection graphs for masculine nouns, 16 for the feminine and 9 for neuter nouns.
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number . 2.3.The Grammatical Category of Case We need to add that several nouns have two plural forms (especially feminine and neuter ones): e.g. coală – coli/coale; vreme – vremi/vremuri chibrit – chibrituri/chibrite; hotel – hoteluri/hotele. However, in some cases there is a different lexico-semantic description that we should add to these nouns. As a matter of fact we speak about the same form, but different meaning and combinatory. Therefore they are going to be treated under different entries in our dictionary. e.g. corn – coarne vs. corn - cornuri mâncare – mâncări vs. mâncare - mâncăruri A special attention should be paid to Singularia Tantum and Pluralia Tantum nouns. The strategy we adopt is to mention the fact that they are devoid of this inflection feature in the graph when we label the entry in the G field. e.g. ochelari N11P moaşte N23P
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case 2.3. The Grammatical Category of Case A third main factor we have to consider when building up our inflection graph is case. From the point of view of the internal structure, nouns can be grouped in the same three main classes determined by the number opposition: • Variable nouns with a regular inflection pattern; • Variable nouns with an irregular inflection pattern; • Invariable nouns. Let’s first consider nouns in the Nominative and the Accusative. They can either be inflected or not with the enclitic definite article (om – omul, oameni – oamenii, casă, case – casele etc.). The uninflected noun can be accompanied or not by the indefinite article or any other determinant which takes over the inflection pattern.
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case The main noun forms in the Nominative/Accusative are: • With the definite article:
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case • With the indefinite article:
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Numbre 2.3.The Grammatical Category of Case There were plenty of ortographic constraints that we had to consider when concieving our inflection graphs but for the sake of concision we are not going to enter in detail here.
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case . As far as the Genitive and the Dative are taken into account we distinguish the following main forms: • Articulated Forms:
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case b) Unarticulated Forms
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical category of Number 2.3.The Grammatical Category of Case We have to note that the only nouns that change their forms in the Dative and Genitive are feminine nouns in the singular: e.g. casă – casei / unei case basma – basmalei / unei basmale vulpe – vulpii / unei vulpi. In this case the Dative and the Genitive in the sg. are indicated both by the form that the noun takes and by the form of the inflected ( the definite article “-i” and the indefinite article “unei”). As in the case of the Nominative/Accusative nouns, we also have to deal here with exceptions from an orthographic point of view. We can identify four main types but we are to refer here only to one example.
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case Feminine nouns ending in the Nominative sg. in vowel or diphthong are written with final “-ei” or “-ii” when they are inflected with the definite article. In order not to get confused, we would rather use the form of the unarticulated noun in the Nominative pl. e.g. N.pl.unart. D/G sg. unart. D/G sg.art (nişte) case (unei) case casei vulpi vulpi vulpii femei femei femeii
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2..2.The Grammatical Category of Number 2.3.The Grammatical Category of Case Finally, when it comes to the nouns in the Vocative, we can distinguish four different cases: • There are some nouns which have specific forms for the Vocative: e..g. bărbate; cumetre; bunicule (masculine) bunico; cuscro (feminine) 2. Some nouns can have specific Vocative forms, but they also acceptan alternative form which is identical with that in the Nominative/Accusative inflected form: e.g. bunico - bunica 3. The majority of nouns have specific forms for the Vocative case but when they want to emphasize the appellative function we use the same form as for the Nominative/Accusative uninflected nouns: e.g. frate; tată; mamă
1.The Format of the Romanian Electronic Dictionary 1.1.The Macrostructure 1.2.The Microstructure 2.The Noun Inflection System. NooJ Graphs Implementation 2.1.The Gender and Determination Issues 2.2.The Grammatical Category of Number 2.3.The Grammatical Category of Case 4. For the masculine and feminine plural nouns in the Vocative we use the same forms as for the Nominative/Accusative uninflected nouns or Genitive/Dative inflected forms: e.g. Veniţi, fraţi! Staţi, fraţilor/vecinilor/fetelor! With the support of the Universitat Autònoma de Barcelona