290 likes | 380 Views
Introduction to Computational Linguisitics. The Lexicon. Introduction. An inventory of words is an essential component of programs for a wide variety of language sensitive applications, such as: Spellchecking, stylechecking IR, IE, message understanding parsing, generation, MT TTS and STT
E N D
Introduction toComputational Linguisitics The Lexicon
Introduction • An inventory of words is an essential component of programs for a wide variety of language sensitive applications, such as: • Spellchecking, stylechecking • IR, IE, message understanding • parsing, generation, MT • TTS and STT • Such an inventory usually called a dictionary or lexicon.
Dictionaries • The purpose of a dictionary is to provide a wide range of information about words • Some of this is linguistic information, e.g. syntactic category, pronunciation, distribution. • But dictionaries also contain definitions of word senses thus providing knowledge about not just language but about the world itself.
What is "dog"? dog (ANIMAL) Show phoneticsnoun [C]a common four-legged animal, especially kept by people as a pet or to hunt or guard things:my pet dogwild dogsdog foodWe could hear dogs barking in the distance.(from Cambridge Advanced Learner's Dictionary)
"Dictionary" versus "Lexicon" • A dictionary is a collection of words • A lexicon is a collection of lexemes. • A lexeme roughly corresponds to a set of words that are different forms of "the same word". • For example, English run, runs, ran and running are forms of the same lexeme. • A lexeme can also be regarded as a single word sense of a word.
Senses of Dog • dog was found in the Cambridge Advanced Learner's Dictionary at the entries listed below. • dog (ANIMAL) • dog (PERSON) • dog (FOLLOW) • dog (PROBLEM) different senses or lexemes for dog
Two Views of the Lexicongive rise to different issues • Lexicon as word database • How to represent the word collection • Access: given an arbitrary word, how to access the relevant entries • What information to provide and how to express it. • Lexicon as database about word senses • What are the relations between word senses? • How do word senses hook up with concept knowledge
Representing the Word Collection • Some possible representations: • Text file, 1 entry per line • Finite state automaton. • Other specialised data structure which allows for common prefixes, e.g. letter tree • Full form vs. lexeme + morphological analysis
FSA for Sublexicon Fragment o t h e s a e i t s
Letter Tree ltree([ [b, [a, [r, [k, bark]]]], [c, [a, [r, [r, [y, carry]]], [t, cat, [e, [g, [o, [r, [y, category]]]]]]]], [d, [e, [l, [a, [y, delay]]]]], [h, [e, [l, [p, help]]], [o, [p, hop, [e, hope]]]], [q, [u, [a, [r, [r, [y, quarry]]]], [i, [z, quiz]], [o, [t, [e, quote]]]]] ]).
Informal Definition of a Letter Tree • Tree is a list of branches • Each branch is a list • whose first element is a letter • whose remaining elements are either • another branch, or • a lexical entry for a word • These elements are in a specific order. Lexical entry (if any) comes first, and branches are in alphabetical order by their first letters.
Branch representingcat, category and cook [c,[a,[t,cat, [e,[g,[o,[r,[y category]]]]]]] [o,[o,[k,cook]]]]
Full Form Dictionary • There is an entry for every possible word. • No need for morphological processing • Exceptions are handled automatically • OK when number of entries is not too large. • Repeated information. • Because languages have different morphological properties, full form is better for some languages than for others.
Morphological Analysis + Lexicon Input Word cats Morphological Analysis
Morphological Analysis • Very roughly, morphological analysis of a word involves 2 subproblems: • A segmentation problem: how to get from the written text to the sequence of morphemes that make it up. • A morphotactic problem: how to combine the individual morphemes together in a legitimate way.
Segmentation/MorphotacticSubproblems • Segmentation problem: • enlargement => en + large + ment • Morphotactic problem: given what we know about en, large and ment, how can they be legitimately combined • enlargement => (en + large) + ment • enlargement =/> en + (large + ment) • en + ADJ => V • V + ment => N
2-Level Morphology • In 1981 the four Ks (Kimmo Koskenniemi, Lauri Karttunen, Ronald M. Kaplan and Martin Kay) were working on morphological analysis (MA) • Basic idea was that MA is about computing relation between sets of strings at two levels: • Surface Level (string of lexical words made from surface alphabet) • Lexical Level (string of morphemes made of lexical alphabet). • Relation can be computed using finite state transducers. • Reversibility of finite-state model
What Information to Provide • Specific Information – eg "kicks" • Syntactic Information • POS = verb • Tense = pres • Number = singular • Person = 3 • Type =Transitive • Semantic Information • event-type = Physical Action • type-of subject = animate • type-of object = physical
What Information to Provide • General Information • Class Attributes • Agreement has (Number, Gender) • Enumeration of possible values • Gender = [masc, fem] • Number = [sing, plur] • Class Relationships • Transitive isa Verb • Common isa Noun
Two Views of the Lexicongive rise to different issues • Lexicon as word database • How to represent the word collection • Access: given an arbitrary word, how to access the relevant entries • What information to provide and how to express it. • Lexicon as database about word senses • What are the relations between word senses? • How do word senses hook up with conceptual knowledge
WordNet • In 1985 a group of psychologists and linguists at Princeton had the idea of searching dictionaries conceptually rather than alphabetically. • Attempt to organise a dictionary in terms of word meanings rather than word forms. • What is the nature and organisation of the lexicalised concepts that words can express? • Distinction between word forms, word meanings, and entries.
Lexical Matrix synonymy entries polysemy
WordNet • A key aspect of WordNet is that a given meaning or word sense is represented as the set of words that can be used to express it. • These meanings are called synsets – sets of words with synonymous readings. • Synsets are established empirically according to a principle of substitutability that is relativised to context.
The Principle of Substitutability • Two expressions are synonymous if the substitution of one for another never alters the truth value of a sentence in which the substitution is made. • Two expressions are synonymous in linguistic context C if the substitution of one for the other in C does not alter the truth value. • e.g. plank/board in carpentry contexts
Lexical Matrix entries
WordNet • In Wordnet, the synonymy relation between words is fundamental. • Synsets can be thought of as representing concepts which stand in various semantic relations to each other. • X Antonym Y: meaning (synset) X is opposite to meaning (synset) Y (big, small) • X Hyponym Y: like isa (e.g. dog, mammal) • X Meronym Y: X is a part of Y (e.g. leg, man)
Lexicon as a Concept Graph • We can thus imagine the WordNet Lexicon as a gigantic graph whose nodes are synsets and whose arcs are semantic relations between synsets. • Such a structure can be regarded as a semantic map of the concepts used in a given language. • Many applications can be created using the WordNet graph as a resource
Using WordNet to Measure Semantic Orientations of AdjectivesJaap Kamps, Maarten Marx, Robert J. Mokken, Maarten de Rijke
Conclusion • Lexicon is a central building block of language-sensitive systems • Schizophrenic status of lexical information: linguistic versus world knowledge. • As a wordlist, lexicon has to solve problem of representation and access. Morphological analysis can help to keep number of entries to a manageable level. • As a collection of definitions, lexicon has to deal with relationships between word meanings.