690 likes | 912 Views
Compositional vs. Frozen Sequences Jorge Baptista University of Algarve, Portugal jbaptis@ualg.pt http://w3.ualg.pt/~jbaptis. Lexicon-Grammar Workshp, Beijing, 16-17 Oct. 2004. 1. Introduction. Compound words and frozen expressions constitute a major part of the lexicon of many languages.
E N D
Compositional vs. Frozen Sequences Jorge Baptista University of Algarve, Portugal jbaptis@ualg.pt http://w3.ualg.pt/~jbaptis Lexicon-Grammar Workshp, Beijing, 16-17 Oct. 2004
1. Introduction • Compound words and frozen expressions constitute a major part of the lexicon of many languages. • Their definition is not easy, and conceptual and terminological discussions abound in the literature.
Traditionally defined on semantic grounds • criterion of non-compositionality, • the global meaning of a multiword expression can not be calculated based on the meaning of its individual elements when they are used separately in the language. • formal, syntactic (or combinatorial) constraints.
semantically ‘opaque’ compound words dog-collar, dogfight • only ‘half opaque’ compound words : dogfish , fish knife , half-life • semantically ‘transparent’ compound words heavy element , <date> before present (present =1950). • spelling rules –are just writing conventions (orthography consecrates writing habits) fish knife / fish-knife, fish finger / fish-finger
Formal constraints on word combinations (non semantically motivated): e.g. the set of time‑related nouns (dawn, morning, afternoon, sunset, evening, night), and prepositions, determiners or modifiers. at noon / *at morning in the evening / *on the evening in the morning / *in morning by morning / by the morning
meaning of individual, isolated words. • meaning of a word is related to the word’s syntax, i.e. the words it co-occurs with. • determining the meaning of a given word by inserting it in several, different sentences and, by carefully controlling formal changes on those sentences, looking for changes (or invariance) in meaning.
Disagreement about ‘transparent’, half-transparent’ or even ‘opaque’ word-combinations. • Intuitions about meaning are almost always vague and too imprecise to be used in a reproducible way. • rather use syntactic, formal criteria to identify compounds, • Show that words are ‘frozen’ together, even if the meaning of the combination is relatively ‘transparent’.
‘frozen’ = two or more elements of the expression do not show any distributional variation. e.g. the set of time‑related nouns • unpredictable blocking of distributional variation • acceptable combinations have to be included in the lexicon therefore they should be treated as compound lexical units.
Every part-of-speech (PoS) shows both simple and compound words. • For example, word-combinations such as the man in the street could very well be accounted as an indefinite pronoun (similar to everyone): Politicians always cared about the opinion of the man in the street
Usually, many compound prepositions and conjunctions have already been included in current dictionaries: John stopped in the middle of the street John came to Paris by way of Madrid John came to Paris in spite of my warnings against it John came to Paris because of my warnings
There are some (productive?) rules to produce compound adjectives: -like: to be life-like, Algol-like languages -proof: to be (bullet + water + …) -proof • Other compound adjectives are frozen on purely combinatorial ways: John is (sick and tired + *tired and sick) of that
Moreover, in English, verb + particle combinations forming phrasal verbs, can be considered a especial case of compound verb: John ran(for a mile) John ran away (to Brazil) The batteries are running down John ran into Mary John ran off to Brazil John ran off with a book John’s lecture ran on The printer ran out of paper The truck ran over the dog John ran through the entire proceeding
Some compound words can be described in a regularly way, by means of finite-state transducers, as, for example, the (potentially infinite) set of compound numerals: twenty-one, one hundred and twenty-one, twenty-one thousand two hundred and twenty-one …
High number of compound words in texts, particularly in scientific and technical texts • meaning units • must be identified as a block and not as a string of simple words. • unpredictable overall meaning, that cannot be directly calculated from the meaning their internal elements.
In this lecture, we will focus on syntactic properties that can be used to identify compounds. • Being a major part of many languages’ lexicon, the task of retrieving and describing them into dictionaries is not trivial, especially if these dictionaries are meant to be used in natural language processing.
many statistical methods to retrieve compound (or multiword) lexical units from texts, • the linguist’s task : to validate those word combinations as compound lexical units and to build the dictionaries for them. • In order to do this, linguists have to rely on syntactical properties, which can only be done by learning the language’s syntactic general rules. • It is only then that linguists can find out the combinatorial constraints on those rules shown by multiword expressions.
This presentation is structured in two parts: • first we will present some of the major syntactical properties distinguishing compound nouns from ordinary noun phrases; and • in the second part we will give some examples of how the same methodology can be applied to the identification of compound adverbs.
1. Compound nouns. • Probably the most known case of compounding, • compound nouns constitute the largest of all compound word classes. • In every domain (scientific, technical, economical, political, etc.) there is a constant need for coining new denominations for new objects, tools, concepts, products and so on, the nouns being the most natural part-of-speech (PoS) to accommodate such new designations.
compound nouns formed by sequences of grammatical categories identical to those appearing in ordinary (i.e. not frozen) noun phrases: a nice dog (a dog) a hot dog (a sandwich) a square table (a table) a square root (a mathematical function) Adam’s orange(an orange) Adam’s apple (a part of the human body)
differences between compounds and free word combinations • this distinction is not as clear-cut as dictionaries and grammars sometimes could lead one to believe. • This presentation will show some of the basic syntactic properties that can help distinguishing compounds from free word combinations.
compounding in the framework of traditional grammar studies (Morphology). • Lexicon-grammar approach: compounds are described with the very same tools used to describe the syntax of noun phrases.
In order to identify a compound as such it is necessary to check if that particular word combination shows any constraintsto the combinatorial properties that one would expect to find in a noun phrase (NP) formed by the same internal PoS sequence (G. Gross 1988, 1989).
compare the grammar of noun phrases to syntactical properties of a word-combination candidate for the status of compound word. • our examples here will consist of already well-known compound nouns. • By analogy, the same methodology can be extended to other, more complex, word combinations.
Let’s take the examples square table / square root. • In a free NP with the internal structure Adjective + Noun (AN), where the adjective is often a free modifier of the noun, • the predicative function of the adjective on the noun is an explicit paraphrase with relative clause with auxiliary verb be: a square table : a table that is square • This is not the case with the compound square root: a square root : *a root that is square and also with many other compound nouns where we say that the adjective looses his predicativity.
Also, free adjectives can be further modified by an adverb: a square table : a perfectly square table a table that isperfectly square but: a square root : * a perfectly square root *a root that is perfectly square
When the AN combination is free, both the adjective and the noun can vary, provided that basic distributional constraints are respected. • Therefore, table can be replaced by other nouns: a square (table + door + carpet + …) in the same way as square can be replaced by other distributionally similar adjectives: a (square + oval + triangular + oblong + …) table • However, when an AN combination forms a compound noun, distributional variation is blocked: a square (root + *twig + *branch + …) a (square + *oval + *triangular + *oblong + …) root
Ambiguous strings round table (free combination or compound noun). • only syntactic environment may help to disambiguate it: I have bought a round tablefor my dining room (a piece of furniture) I have attended a round table on French syntax (an event) • Even if many compound nouns are ambiguous with free word combinations, usually they are much less ambiguous then simple words.
in free NP, adjectives are just facultative modifiers of the noun. • They can be deleted without changing the overall meaning of the NP (nor the meaning of the sentence where the NP is inserted): John bought a (E + square) table
However, with some abstract nouns that express predicates and are hence called predicative nouns (M.Gross 1981; see below), the presence of a modifier is often obligatory (Meunier 1981; Giry-Schneider 1995; Laporte 1997): He had an immense esteem for tradition(Henry James, Portrait of a Lady) *He had esteem for tradition *He had an esteem for tradition
When the adjective is not a mere modifier of the noun, usually it cannot be deleted, for it is the AN combination that forms a compound lexical unit. • This is particularly clearer with semantically opaque compound nouns: John attended a round table on Chinese Syntax *John attended a table on Chinese Syntax John calculated the square root of 9 *John calculated the root of 9
But in some compounds, even frozen adjectives can be deleted. • For example, most of the times people calculate square roots, so that in some languages – Portuguese, for instance –, unless otherwise stated, the adjective quadrada (equivalent to square) can be zeroed without loss of information: O João calculou a raiz(E + quadrada) de 9 (John calculated the (E + square) root of 9)
In many other cases, however, the adjective in a compound noun functions as a classifier of the noun, distinguishing a particular type of object: John likes to drink (red + white + … ) wine In this case, the adjective can be zeroed, with some loss of information: John likes to drink (E + red) wine
The classifying function of an adjective can be detected by means of classifying sentences: A red wine is a type of wine NP with free modifiers cannot enter classifying sentences: *A square table is a type of table Of course, compound nouns cannot enter these sentences either: *A square root is a type of root
When an adjective functions as a modifier, it is sometimes possible to see a (usually) small distribution paradigm: John calculated the (square+ cubic) root of that value John likes to drink (red+ white+ … ) wine which is closed for distributional variation: John calculated the (square + cubic + *triangular + *spherical) root of that value John likes to drink (red + white + *yellow + *blue… ) wine
In this sense, AN combinations where the adjective is a classifier can be described as compound nouns. • The extension of distributional paradigm of the classifier adjective can be rather large (acids) and open to the coining of new terms; or relatively small (teeth and vertebrae) and closed to further additions: John poured some (ascorbic+ citric+ nitric+ … ) acid into the solution The dentist repaired one of my (incisive+ canine+ molar+ …) teeth John was injured in one of his (cervical + lumbar + …) vertebrae
in the compounds of wine, one finds that many toponyms (Ntop) designating wine-producing regions can replace wine: John likes to drink a glass of (wine + Porto + Bordeaux + …) • These combinations can be derived from a deleted occurrence of wine : John likes to drink a glass of (E + Porto + Bordeaux + …) wine • The number of Ntop wine combinations is very large (every wine region), but highly conventional, determined by extra-linguistic factors. Extensive lists can be made, but of small linguistic interest.
Some adjectives combine in a highly exclusively way with a very short set of nouns (often only one): This noun is inflected in the nominative case • In these cases, the noun of some AN compounds (but not all) can be zeroed, leaving the adjective in a (superficial) noun slot: This noun is inflected in the nominative(E + case) The dentist repaired my (canine+ molar+…)(E + tooth) • with less ‘exclusive’ adjectives, N can be zeroed depending on the syntactic context: John prefers to drink red(E + wine) to white (E + wine)
This is probably one of the reasons why dictionaries have classified so many adjectives both as adjectives and nouns (see M. Gross 1998 for further discussion of this subject). • This is not always the case: John was injured in a (*cervical + *lumbar + …) • or it may depend on the language and the NA involved. For Portuguese, for instance, zeroing of N in a similar case is observed with some Adj but not others: O João ficou ferido numa (E + vértebra) (cervical + *dorsal + *lombar + *sacra)
A particular case of AN combinations : relation adjectives, i.e. adjectives derived from nouns, such as presidential (from President). • These adjectives never allow the formation of the relative clause, neither the insertion of an adverbial modifier: The presidential address to the Congress *The address to the Congress that was presidential *The very presidential address to the Congress <was very disturbing>
Nouns such as address express predicates and are therefore called predicative nouns. (M. Gross 1981) • Relation adjectives, such as presidential, when combined with predicative nouns, do not function as mere modifiers of the noun. Instead, they are derived from a complement NP: The President’s address to the Congress < was very disturbing >
In this sentence, President is interpreted as an argument (in this case, the subject) of the predicative noun address. • This syntactic and semantic relation between the two nouns (President – address) is of the same nature as the relation between a subject and verb, and it has a formal counterpart in the sentence: The President made an address to the Congress
We consider this to be an elementary sentence, the predicative node is the noun address, which selects its two arguments (President, Congress). • In this sentence, to make is a support verb(Vsup; also called light verb): • it is devoid of meaning and it functions as a morphological tool to actualize the predicative noun, carrying the tense morphemes that the noun cannot express.
Now, the adjective presidential can enter many other AN combinations, involving predicative nouns: The presidential campaign <…> However, some of these combinations cannot be derived from the reduction of support verb sentences.
In fact, the NP: The presidential campaign above is ambiguous : (a) ‘the campaign that the President is making’, NP is equivalent to: The president’s campaign <has been extremely violent> b) it is a campaign where many people run for the office of President (and not necessarily the President himself), NP can appear in sentences such as: The presidential campaign <takes place in September> Notice that the regularly derived NP cannot appear in this context: *The president’s campaign takes place in September
It is therefore necessary to study in detail the properties of all AN combinations where Adj is a relational adjective and N a predicative noun in order to determine if this combination can be regularly derived from an elementary sentence with a support verb or, else, if this derivation is blocked in some way, and has become a compound noun. (A. Monceaux 1999)
The next case illustrates a curious type of blocking involving relation adjectives. relational adjectives: solar (sun) or lunar (moon) AN noun phrases regularly derived from elementary sentences where moon or sun are an argument of a predicative noun, such as eclipse: the eclipse of the (moon + sun) <lasted 20 minutes> the (lunar + solar) eclipse<lasted 20 minutes> ?*the (moon + sun)’s eclipse<lasted 20 minutes> *the (moon + sun) eclipse <lasted 20 minutes>
There are, however, many AN combinations that one cannot derive from moon or sun: the lunar month <lasts 28 days> *the moon’s month <lasts 28 days> *the month of the moon <lasts 28 days> *the moon month<lasts 28 days> the solar year <lasts 365,25 days> *the sun’s year <lasts 365,25 days> *the year of the sun <lasts 365,25 days> ?*the sun year <lasts 365,25 days>
Finally, some compounds show morphosyntactic constraints: while their elements can vary in gender or/and number when used independently, together they do not show any variation. For example, national waters, is always used in the plural, in spite of the uncountable nature of water: They prevented the ship from entering (national waters + *national water)
There is a certain degree of institutionalization in compounding. • Sometimes several, different structures may be available in the language in order to designate the same concept or object, but the language retains only one of them. ‘machine used to take photographs’ : • photographic machine (AN) • photographing machine (V-ing N, as in washing machine) • photo(graph) machine (NN, as in copy machine) • photographier (N-er, as in photocopier) Instead, it is the simple word camera that is used to name this object.
When comparing different languages, one finds out that each may adopt a different strategy, hence: FR: appareil photo (NN) ‘photo aparatus’ *appareil à photographier (N à V), *appareil photographique (NA) *photograph(i)euse / *photograph(i)eur (N-eur) PT: máquina fotográfica (NA) ‘photographic machine’ *máquina de fotografar (N de V) * foto-máquina (NN) * fotografiadora (N-ora)/*fotografadora (V-ora) In view of these language differences, many dictionaries used in machine translation may have to include some word combinations regardless of its semantic transparence.