430 likes | 664 Views
Morpho logy , Morpho log ical Process es and Morpho log ical Process ing. John Barnden School of Computer Science University of Birmingham Natur al Language Process ing 1 2010/11 Semester 2. Overview.
E N D
Morphology,Morphological Processes andMorphological Processing John Barnden School of Computer Science University of Birmingham Natural Language Processing 1 2010/11 Semester 2
Overview • Morphology is to do with the “shape” (internal structure) of words and how the shape changes to reflect certain common, fairly systematic changes of meaning. E.g.: • forming plurals of many nouns by adding an “s”; but also irregular plurals (e.g. “goose” to “geese”); • forming the past tense of many verbs by adding “ed”; • going from “buy” to “buyer”; • going from “happy” to “unhappy”; • forming “doghouse”. • Many such changes (not all) involve adjustments of substructure. • Such substructure is a matter of decomposition into meaning-bearing/affecting subunits (as opposed, e.g., to individually meaningless letters or letter-strings).
Overview, contd • Morphological processesare the ways, alluded to on the previous slide, in which words can, so to speak, change [more exactly: certain ways in which words are related to each other]. • Morphological processingis about how to computationally convert between words according to morphological processes, how to analyse words into their components if any, and how to create words from such components. • Lectures will cover a start on this, with further detail left to the textbook. • The basic tools are regular expressions or (equivalently) finite state automata.
Morphemes • Morphemes are the components of words that we will be considering. • They are variously described as (for any particular language): • the minimal units of grammatical analysis • the minimal units of meaning • the minimal units that bear meaning [J&M p.81] • Possibly better : • the minimal units that bear or affect meaning.
Examples of Morphemes • Consider the word “unhappiness”: • composed of three morphemes, each carrying a certain amount of meaning: • “un” here means opposite of [or not in many other cases] • “ness”means being in a state or condition • “happy”: the familiar word (slightly modified by being combined on the right). • One classification of morphemes: • “happy” is a free morpheme because it can appear on its own and still mean the same as in the word above. • “un” and “ness” are boundmorphemes as they have to be attached to a free morpheme – they can’t mean what they do above when standing on their own. • But: • There is a completely unrelated word “ness”. • There is a rather informal word “un” derived from the “un” morpheme, meaning something like “unimportant, characterless, ineffectual, ...” • “happy” can act as part of a bigger word in other ways, as in “trigger-happy”.
Affixes • Affixes are an important type of (usually bound) morpheme, usually small, and making a largely predictable meaning change that’s largely independent of what they are applied to. • In “unhappiness”, there are two kinds of affix: • a prefix, “un” • a suffix, “ness”. • There are also infixes in some languages (and perhaps, in special cases, in English). • Bontoc [from the Philippines uses] infix “um”tochange adjectives and nouns into verbs. So the word “fikas”,which means strong is transformed into “fumikas” meaning be strong. • English: “unhappy” “unbloodyhappy” [slang] – and similarly with some other favourite swear words. • See textbook for circumfixes.
Affixes, contd • Other examples of prefixes: • “re”: conveys repetition or renewal, as in “redevelop” or “retile” or “re-examine” • “mis”: conveys doing something wrongly, as in “misremember” • “de” and “dis”: convey removal, undoing or reversal, as in “depopulate”, “disembowel”, “disappear” • “in” [or “im” before some consonants]: often conveys negation, as in “indecisive”, “immeasurable”, “imperfect” • “in” [or “im”]: can also convey being/putting into a state, as in “invigorate”, “inflammable” • “anti”: indicates opposition, as in “antisemitic” • “ante”: indicates beforeness[spatial or temporal], as in “antenatal” • “pre”: indicates beforeness [spatial or temporal], as in “prefix” !
Affixes, contd • Other examples of suffixes: • “ing”: changes a verb infinitive into progressive form, as in “buying” • “s” [changed to “es” after some consonants]: makes a noun plural or changes a verb infinitive into 3rd-person singular • “ed” [or “d” or “t”]: changes a verb infinitive into past tense or past participle form • “ity”: makes a noun out of an adjective, as in “activity”, “purity” • “less”: indicates lack of something [could perhaps be considered a free morpheme because so close in meaning to the word “less”] • “ish”: indicates likeness, closeness or somewhatness, as in “bluish” and “city-ish” • I invented “somewhatness” by freeish application of morphology!
Cautions re Affixes, etc. • Letters at beginning or end of a word can of course look like, but not be, a particular affix. Two examples: • “re” is sometimes not the abovementioned prefix, as in “regal”, “ready”, “region” • “ly” is sometimes not the adverb-creating suffix, as in “holy”, “lily”, “hilly” • A word need not contain any free morpheme: • E.g., “inhere”, “cohere” and “adhere” are all formed from a commonly prefixed bound morpheme plus the morpheme “here” – the latter means to stick (from a Latin verb) but is not itself usable as a word of English with such a meaning. • Affixes can be concatenated (strung together) to some extent : • E.g. : “morphologically”, “antidisestablishmentarianism”, “moralizing” • Some languages, e.g. Turkish, allow concatenation much more extensively (see textbook).
Cautions re Affixes, etc., contd. • Affixes can adjust the meaning of what they’re affixed to in somewhat subtler ways than you might expect: • E.g., “entomb” comes from “en” [meaning put in] and “tomb”, but it usually has a broad metaphorical meaning, and less usually means put in a [literal] tomb. • Suchformation of verbs from nouns often uses metaphorical meanings of the main morpheme rather than literal meanings.
Word Stems • A word often has one intuitively-main morpheme: • “unhappiness”: main morpheme is “happy” • The main morpheme is called the “stem” of the word, and may be the whole word. • The stem is often a free morpheme, but need not be. • We’ll see below that a word can contain more than one free morpheme, and in such cases the idea of a stem may be more difficult.
Types of Morphological Process • It’s convenient to divide morphological processes into four rough types: • Inflection • Derivation • Cliticization • Compounding • It’s difficult to devise a precise definition of these types, even within a single language. • There’s some overlap.
Inflection • Inflection is a morphological process that varies a word • in certain very limited, standard, predictable ways, • typically via affixes, • keeping some large part of meaning intact, • but changing the values of certain standard parameters. • The variations are usually tightly related to the grammatical structure of the surrounding expression. • Examples in English on next slide ...
Inflection: Examples • Nouns: • Pluralizing a singular noun (a basic example: “cat” to “cats”) • Forming possessive forms of a noun (a basic example: “cat’s ” and “cats’ ”) • Pronouns and related adjectives: • Setting the case/number (e.g., in varying between nominative “I/we/who”, accusative “me/us/whom”, possessive forms “my/mine/our/ours/whose”) . Also demonstrative pronouns and adjectives: “this/these”, “that/those”. • Verbs: • Setting the case, number, tense, etc. (infinitive “eat” to “eats/ate/eaten”; “be” to “am/is/are/was/were”) • Forming the present participle by adding “ing”, used in progressive constructions (“I am / was / will be buying”) and as a gerund (a form of noun, as in “the cutting of the cake”). • Adjectives and adverbs: • Forming the comparative and superlative forms (e.g., “big” to “bigger” and “biggest”; “fast” [as adverb] to “faster” and “fastest”; and in colloquial English “quickly” to “quicker” and “quickest”).
Inflection, contd. • Other languages may do things not done, or not done much, in English, e.g.: • inflect nouns for case (nominative, accusative, etc.) • inflect a definite article for case and number (whereas in English it’s always just “the”). • Conversely, other languages do not do some things English does (e.g., Japanese nouns don’t have plural or possessive forms).
Inflection, contd. • Inflection often involves certain systematic spelling changes to the stem, e.g. • Final “c” becomes “ck” as in “picnic” to “picnicking” • Dropping of a single, not separately pronounced “e” when adding “ing” (but don’t drop when have “ee”) • Doubling of final consonant when adding a suffix starting with “e” or “i” (as in “beg” to “begging”, and “big” to “bigger”). • Inflection includes cases where meaning variations of the sorts on previous slide are reflected by irregular word forms (consider irregular verbs such as “be”, irregular plurals such as “geese” and “mice”). • So inflection is not just about systematic lexical changes. (The textbook is slightly inconsistent on this.) • Inflection includes the case where the word form is actually unchanged (e.g. “hit” : infinitive and past-tense form and past participle).
Derivation • Derivation is a morphological process that varies a word • in ways not covered by inflection, and less systematic • but still by means of relatively small changes in form such as adding affixes (or no change at all), • usually involving a bigger shift of meaning • that is somewhat unpredictable • Examples in English ....
Derivation: Examples • Making adjectives into adverbs by suffixing with “ly”. • Making nouns (etc.) into adverbs by suffixing with “wards”, as in “sidewards”. • Nominalizing (= “nounifying”) verbs by suffixing with “ation” or “ment” (as in “payment”), “ee” (as in “payee”), “er” (as in “payer”). • Making nouns into verbs without changing the spelling (as in “pencil”, “book”, “impact”, “carpet”, “bus”, “powerpoint”). • Verbifying nouns by suffixing with “ify”! Or with “ise/ize”. • Nominalizing adjectives by suffixing with “ness”, “ity”. • Making nouns into adjectives by suffixing with “ish”, “y” (as in “frilly”), • “[-]like”, “less”, “[e]d”. • Making verbs into adjectives by suffixing with “able/ible”. • Other more ad hoc cases such as in “iffy” (from “if”), use of “big” as a verb (in “big it up”), ... Can you think of any other cases?
Unclear inflection/derivation Boundary • Inflection usually doesn’t change the [traditional] POS of the affected word (e.g. verbs stay as verbs) whereas derivation usually does change it, but there are exceptions. • E.g. The textbook includes within inflection the formation of the gerund (i.e. noun) form of a verb by adding “ing”, even though this changes the POS. • Adding the affix “dom” (as in “kingdom” and “martyrdom”) makes too big and unpredictable a difference in meaning to fit with inflection, but doesn’t change the POS (still a noun). • Adding “er” to get a noun indicating the doer of something is a derivation process that can be done not only on verbs (“baker”) but also on some nouns (“philosophy” to “philosopher”). • Similarly the suffix “ist” converts between nouns (“art” to “artist”).
Unclear Boundary contd • Example: is adding “ish” an act of inflection or derivation or both? • It can deliver an adjective from an adjective or a noun, but seems odd to say it’s inflection in one case and derivation in the other. • When modifying an adjective, it’s not obviously making a more major meaning change than forming a comparative or superlative, or going from one tense to another of a verb.
Compounding • Compounding is the morphological process whereby one or more words are combined to form a word, as in • football, basketball, raquetball • doghouse • blackboard • winklepicker • bargain-hunter • postman • catch-all, know-all • pear-shaped • yes-man • eggbeater • cheeseburger [compound of “cheese” and the abbreviated word “burger”, with some confusion about “ham”!]
Compounding, contd. • When an affix such as “able” is a free morpheme (i.e., also a word, with a very similar meaning) should we also view the affixing as compounding?
Compounding, contd • Joining words without a hyphen is largely a matter of convention (i.e. implicit agreement) in English – you can’t just freely form such compounds. • But in some other languages, e.g. German, you have much more freedom. • And in English, compounding with a hyphen is relatively free, and putting nouns next to each other without joining or hyphenating to form so-called “noun-noun compounds” gets a similar effect and is extremely free. • telephone-licker • telephone licker • telephone licker defender • telephone licker defender scandal • Joined-up compounds, hyphenated compounds and noun-noun compounds don’t have meanings that are predictable in any simple, uniform way from the individual words: • Postman: man who delivers post • Fireman: man who delivers fire?! • But note famous South of France forest firemen arson scandals
Cliticization • Cliticization is • [in my view] a special, exceptional form of (joined-up) compounding, where • the joining up is done for brevity or ease of pronunciation of a phrase rather than specifically to create a word: • the resulting word acts like the phrase it would have been had the joining not been done, or like similar phrases, rather than like a normal single word. • But despite the mere brevity/ease motive, the cliticization can be obligatory. • Cliticization is unlike normal compounding in that it can act on a whole phrase, not some separate words. • But so can fairly normal compounding (see South of France example and these words here!) • Examples in English on next slide ...
Cliticization: Examples • Adding “not” in the form “ n’t ” to certain verbs: to be, to have or auxiliaries: • “isn’t”, “mustn’t”, “don’t”, “didn’t”, “haven’t”, “can’t”, “shouldn’t”, etc. • NB: special cases “can’t”, “won’t”, “shan’t”, “ain’t”. • Exercise: in what ways are these special?? • NB also: “cannot”. • Can’t usually do the above when the verb is not to be, to have, or an auxiliary. Can’t say: • “I don’t my push-ups any more” to mean “I don’t do my push-ups any more”. • “I didn’t him in” to mean “I didn’t do him in” i.e. “I did not kill him.” • “I can’t my tomatoes on Saturdays” to mean “I don’t can my tomatoes on Saturdays”.
Cliticization: Examples, contd • Adding “is”, “are”, “will”, “would” and “am” to previous word as in • “It’s in the garden” • “The cat’s in the garden” • “They’re in the garden” • “The horse’ll be in the garden” • “I’ll be in the garden” • “I’m in the garden” • “I’d be in the garden if three cats, five dogs, a horse and strange professor weren’t already there.” • Adding “has”, “have”, “had” as in • “The cat’s already been in the garden for five hours”, • “The cats’ve already been in the garden for five hours”, • “You’ve ten minutes to get that horse out of there”, • “I thought you’d already got it out”.
Cliticization: Some Special Cases • “of” in the form “ o’ ” in “clock”, • and in proper nouns : “O’Connell” “O’Gaunt”. • “my” in the form “mi” or “ m’ ” in “milord”, “m’lord”, “m’lud”, “m’boy”. • “to” in the form “a” in some verbs as in colloquial “gonna”, “wanna”. • “the” in the form “ th’ ” as in “th’morn” in older English esp. poetry. • “the” in the form “ t’ ” as in “ t’cat ”in some Northern English dialects. • “it” in the form “ ’t” as in “ ’twas / ’twere / ’twill / ’twould ”. • “one” in the form “un” as in “biggun”, “smallun”. • “and” in the form “ ‘n’ ” as in “pick’n’mix”. • Remember, AI systems have to deal with dialect, slang, etc. not just proper Queen’s English (or Kate Middleton’s)!
Cliticization, contd. • The small added word is called a “clitic”. • Proclitic if before the other word • Enclitic if afterwards • Clitics in (modern everyday) English are almost all enclitics. You find some proclitics in isolated forms and in dialect or older versions of English as above. • Proclitics are common in other languages, e.g. French: • adding “le” and “la” to the next word in the form “ l’ ” when it starts with a vowel or an “h” (usually), as in “l’arbre” and “l’homme”. • In English, the clitic is almost always separated off by an apostrophe and abbreviated. This doesn’t necessarily carry over to other languages. (See textbook.) • The notion of clitic is not easy to define clearly – see textbook and dictionaries. Lack of stress on the clitic in the pronunciation of the word is typically mentioned, but I’m not convinced that this is a valid criterion.
Cliticization and Affixes • The textbook says that a clitic is somewhere between being a word and being an affix. • You might ask why clitics aren’t just classified as a particular form of affix. • One reason: words involving clitics act grammatically like the phrases they came from rather than like single words (even when the compounding is obligatory): • E.g. “You’ve” acts grammatically like the phrase “You have”. • Although in French we can’t actually use “le arbre”, nevertheless “l’arbre” acts grammatically as that phrase would have done had it been allowed, and acts like analogous phrases such as “le chat”. • And a clitic can be added to a whole phrase, as in • “The man I was speaking of’shorse”
Special Practical Aspects of Morphology • In informal English and particularly in computer-mediated chat, repetition of letters is used for emphasis of meaning, as in “baaaad”, “grrrrrand”, “grrrrr”, “hmmmm”, “oooooh”. • Although the repeated letter is not itself a morpheme, letter repetition could be said to be a morphological process as it fairly systematically changes meaning. • Capitalization of all or parts of words for emphasis could perhaps be said to be a morphological process, though this would probably cause arguments! • Exercise: when only parts are capitalized, what sort of part do they tend to be? • The phenomena on this slide (plus repetition of exclamation and question marks) are very important for the practical processing of internet chat, etc.
Apostrophes are Complicated • Apostrophes are often used in abbreviations to indicate missing letters, as in “ ’phone” [old-fashioned], “B’ham” on road signs and “mornin’” in dialect. • Apostrophes are often left out in computer-mediated chat, texting (SMS), etc. • There’s a modern tendency to miss out apostrophes in possessive forms of nouns in official building signs, street signs, etc. (“Snooker Players Convelescent Home”). • Many people – including students who should know better – often write “ it’s ” instead of “ its ”. • “ It’s ” is the cliticized form of “it is”. “ Its ” is the adjective meaning “of it”. • Small-shop owners often wrongly include apostrophes in plurals – “potato’s”. • There’s a strong but misguided tendency to insert an apostrophe when pluralizing unusual words such as acronyms, as in “PDF’s”. It’s perfectly fine to write “PDFs”! • Many people think that the possessive of (e.g.) “James” is “James’ ”, when really it should be “James’s” normally.
Observations • General observation: • NLP systems are going to have to cope with many sorts of “error” and short-cut. (Cf. also strange spelling used in texting.) • Systems can’t afford to insist on compliance with “correct” spelling, morphology, grammar, etc. • And: language is created and evolved by ordinary people using it, and ultimately there is no point in talking about what’s “correct” in any absolute sense. • But as a basis for work on developing liberal, robust systems, you need to know about techniques for handling traditionally grammatical language. • These techniques are often used (or used in extended form) in present-day practical NLP applications. • Quite a lot of language does actually conform to “correct” grammar, etc.!
Towards Morphological Processing • NLP systems for English often don't include any or much morphological processing, especially if they are small-scale systems or systems with specialized purposes such as informational retrieval. • Just list all the different word forms separately. • Or may just “stemming” (finding the stem of each word) — a limited morphological processing • Inflectional morphology • For other languages, e.g. French and German, NLP systems often include inflection analysers. • When inflectional analysis is done: • A standard technique is Finite State Automata – see below and textbook. • FSAs are powerful and economical for the regular cases, and exceptions (the irregular words) are just included as extra regions in the network of states.
Morphological Processing, contd. • Derivational morphology • Can be used to reduce the number of separate word forms to be stored. • Eg, given an entry for the base form of the verb sing, then use rules to map the nouns singer and singers ontothe same entry. • Derivational morphology is particularly useful for Machine Translation (MT) ...
Morphological Processing, contd. • In either single-language or MT systems, words may actually be previously-unseen words or actual neologisms (newly invented words). E.g.: • Neologisms often have a proper name as their root. A knowledge of how Thatcherite and Blairism were formedfrom proper names could, e.g., enable an MT system to translate them into an idiomatic equivalent in the target language.
Morphological Processing, contd. • Previously-unseen words or neologisms in MT, contd: • Analyser reduces these words to their base form. • It may be able to translate the base form • It can then (in effect) coin a word in the target language by simply following rules.
A Small Morphological Analyser[courtesy of Dr Peter Hancox] • Designed for a tiny fragment of English, and treats even that fragment incompletely. • Covers just the nouns the, girl, girls, cat, cats • and the verbs trust, trusts, trusting, trusted. • Produces two types of information: • Syntactic category: noun, verb, determiner. • Grammatical features: • Number: ie singular, plural • Person: ie first, second, third • Tense: ie past, present • Participle: ie yes, no. • Takes the form of an FSN (Finite State Network) ...
Small Morphological Analyser: the FSN • SEE DIAGRAM IN SEPARATE FILE available via the module slides page • Here: http://www.cs.bham.ac.uk/~jab/Modules/NLP1/10-11/Slides/morphology.FSNdgm.pdf
Small Morphological Analyser: Prolog Implementation of the FSN • Syntactic category result for a word is e.g.: noun(cat), verb(trust) • Features result for a word is of form • Features(Number, Person, Tense, Participle, Extra) • where e.g. Number is e.g. the term: number(singular) • The Extra parameter was included for future expansion. • Parameters that are inappropriate for a word are left unbound in the result. • The syntactic category and appropriate features info must be returned every time the FSN gets to a final state. • States are identified by integers in the program, with 1 being the initial state and 9999 the final state. • States from 8000 onwards are used for cases where a regular word stem has been found, so arcs from it deal with regular endings.
Small Morphological Analyser: The Controller Code • Here:http://www.cs.bham.ac.uk/~jab/Modules/NLP1/10-11/Tools/ file: fsn1morph.pl • %% - We can use this controller as follows: • | ?- morph(1,Word,Features,[g,i,r,l], []). • | ?- morph(1,Word,Features,[t,r,u,s,t,e,d],[]). • % 1 - terminating condition • morph(State, Word, Features, S, S0) :- • final_state(Final), • arc(State, Final, S, S0, Word, Features). • final_state(9999). • % 2 - recursive condition • morph(State, Word, Features, S, S0) :- • arc(State, Next, S, S1, Word, Features), • morph(Next, Word, Features, S1, S0).
Small Morphological Analyser: Examples of Coding of the FSN’s Arcs • arc(1, 2, [c|S], S, _Word, _Features). • arc(1, 4, [g|S], S, _Word, _Features). • arc(1, 10, [t|S], S, _Word, _Features). • arc(2, 3, [a|S], S, _Word, _Features). • arc(4, 5, [i|S], S, _Word, _Features). • arc(5, 6, [r|S], S, _Word, _Features). • arc(3, 8000, [t|S], S, noun(cat), _Features). • arc(6, 8000, [l|S], S, noun(girl), _Features). • arc(8000, 9999, S, S0, _Word, • features(number(singular),person(_),_Ten,_Part, _)) :- • punctuation(S, S0). • arc(11, 9999, [e|S], S0, det(the), • features(_Numb, person(third),_Ten,_Part,_)) :- • punctuation(S, S0).
Small Morphological Analyser: Examples of Coding of the FSN’s Arcs, contd • % Coding for the plural form: • arc(8000, 8001, [s|S], S, _Word, _Features). • arc(8001, 9999, S, S0, _Word, • features(number(plural), person(third),_Ten, _Part, _)):- • punctuation(S, S0).
More Detail, and Transduction • Please read sections 3.1 to 3.7 of J&M. • The above program is merely a recognizer. • J&M 3.4 etc. goes into transduction, i.e. conversion of forms (e.g. from singular to plural and back). • Also involves a somewhat different way of specifying the result of recognition.