610 likes | 858 Views
Modelli simulativi nelle Scienze Cognitive. Il lessico: modelli linguistici, WordNet, acquisizione lessicale Massimo Poesio. PART I: LEXICON AND LEXICAL SEMANTICS WORDNET. What’s in a lexicon. A lexicon is a repository of lexical knowledge The simplest form of lexicon: a list of words
E N D
Modelli simulativi nelle Scienze Cognitive Il lessico: modelli linguistici, WordNet, acquisizione lessicale Massimo Poesio Modelli simulativi
PART I:LEXICON AND LEXICAL SEMANTICSWORDNET Modelli simulativi
What’s in a lexicon • A lexicon is a repository of lexical knowledge • The simplest form of lexicon: a list of words • But even for English – let alone languages with a more complex morphology, such as Italian – it makes sense to split WORD FORMS from LEXICAL ENTRIES or LEXEMEs: • LEXEME BANK • POS: N • WORD BANKS • LEXEME: BANK • SYN: • NUM: PLUR • And lexical knowledge also includes information about the MEANING of words Metodi simulativi
Meaning …. • Characterizing the meaning of words not easy • Most of the methods considered in these lecture characterize the meaning of a word by stating its relations with other words • This method however doesn’t say much about what the word ACTUALLY mean (e.g., what can you do with a car) Metodi simulativi
Un esempio di lexical entry: VICINO (da it.wiktionary.org) vicino sostantivo m (vicinaf, viciniplm, vicineplf) 1. Colui che abita accanto. (“I miei vicini vengono da Frosinone” vicino aggettivo m (vicinaf, viciniplm, vicineplf) (“La piu’ vicina stella a neutroni e’ RX J185635-3754”) vicino avverbio (invariabile) (“Itunes visto da vicino”) Metodi simulativi
Lexical resources for computers: MACHINE READABLE DICTIONARIES • A traditional DICTIONARY is a database containing information about • the PRONUNCIATION of a certain word • its possible PARTS of SPEECH • its possible SENSES (or MEANINGS) • In recent years, most dictionaries have appeared in Machine Readable form (MRD) • English: • Oxford English Dictionary • Collins • Longman Dictionary of Ordinary Contemporary English (LDOCE) • Italian: • Garzanti • Zanichelli • Paravia • it.wiktionary.org Metodi simulativi
An example LEXICAL ENTRY from a machine-readable dictionary: STOCK,from the LDOCE • 0100 a supply (of something) for use: a good stock of food • 0200 goods for sale: Some of the stock is being taken without being paid for • 0300 the thick part of a tree trunk • 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side • 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed • 0600 a group of animals used for breeding • 0700 farm animals usu. cattle; LIVESTOCK • 0800 a family line, esp. of the stated character • 0900 money lent to a government at a fixed rate of interest • 1000 the money (CAPITAL) owned by a company, divided into SHAREs • 1100 a type of garden flower with a sweet smell • 1200 a liquid made from the juices of meat, bones, etc., used in cooking ….. Metodi simulativi
Homonymy • Word-strings like STOCK are used to express apparently unrelated senses / meanings, even in contexts in which their part-of-speech has been determined • Other well-known examples: BANK, LIME, RIGHT, SET, SCALE • Italian: CALCIO, OBBIETTIVO • An example of the problems homonimy may cause for IR systems • Search for 'West Bank' with Google Metodi simulativi
CALCIO, da “Il grande dizionario Garzanti” • calcio1 [càl-cio] s.m. 1. colpo dato con il piede o con la zampa; pedata; dare, assestare, ricevere un _ 2. (sport) gioco che si svolge tra due squadre di undici giocatori ciascuna … 3. nel football, colpo dato con il piede al pallone: - di punizione, … - di rigore …. – d’angolo …. – piazzato • calcio2 parte inferiore della cassa di un fucile … derivato del lat. calx calcis …. • calcio3 elemento chimico il cui simbolo è Ca; metallo alcalinoterroso …… Metodi simulativi
Omonimia in un MRD per l’Italiano (ItalWordNet) obbiettivo, Nome [1] - scopo di un'operazione militare.(obbiettivo [1], obiettivo [1]) [2] - bersaglio nel tiro di artiglieria(obbiettivo [2], obiettivo [2]) [4] - sistema di lenti per proiettare l'immagine reale di un oggetto(obbiettivo [4], obiettivo [4]) Metodi simulativi
Homonymy and machine translation Metodi simulativi
Meaning in MRDs, 2: SYNONYMY • Two words are SYNONYMS if they have the same meaning at least in some contexts • E.g., PRICE and FARE; CHEAP and INEXPENSIVE; LAPTOP and NOTEBOOK; HOME and HOUSE • I’m looking for a CHEAP FLIGHT / INEXPENSIVE FLIGHT • From Roget’s thesaurus: • OBLITERATION, erasure, cancellation, deletion • But few words are truly synonymous in ALL contexts: • I wanna go HOME / ?? I wanna go HOUSE • The flight was CANCELLED / ?? OBLITERATED / ??? DELETED • Knowing about synonyms may help in IR: • NOTEBOOK (get LAPTOPs as well) • CHEAP PRICE (get INEXPENSIVE FARE) Metodi simulativi
Sinonimia in Italiano scorza, Nome [1] - (corteccia [1], scorza [1]) [2] - parte esterna, involucro dei frutti(buccia [1], scorza [2]) [4] - (scorza [4]) "sotto la sua scorza scortese si nasconde un animo nobile" Metodi simulativi
Problems and limitations of MRDs Identifying distinct senses always difficult- Sense distinctions often subjective Definitions often circular Very limited characterization of the meaning of words Metodi simulativi
Homonymy vs polysemy • 0100 a supply (of something) for use: a good stock of food • 0200 goods for sale: Some of the stock is being taken without being paid for • 0300 the thick part of a tree trunk • 0400 (a) a piece of wood used as a support or handle, as for a gun or tool (b) the piece which goes across the top of an ANCHOR^1 (1) from side to side • 0500 (a) a plant from which CUTTINGs are grown (b) a stem onto which another plant is GRAFTed • 0600 a group of animals used for breeding • 0700 farm animals usu. cattle; LIVESTOCK • 0800 a family line, esp. of the stated character • 0900 money lent to a government at a fixed rate of interest • 1000 the money (CAPITAL) owned by a company, divided into SHAREs • 1100 a type of garden flower with a sweet smell • 1200 a liquid made from the juices of meat, bones, etc., used in cooking ….. Metodi simulativi
POLYSEMY vs HOMONIMY • In cases like BANK, it’s fairly easy to identify two distinct senses (etymology also different). But in other cases, distinctions more questionable • E.g., senses 0100 and 0200 of stock clearly related, like 0600 and 0700, or 0900 and 1000 • In some cases, syntactic tests may help. E.g., KEEP (Hirst, 1987): • Ross KEPT staring at Nadia’s decolletage • Nadia KEPT calm and made a cutting remark • Ross wrote of his embarassment in the diary that he KEPT. • POLYSEMOUS WORDS: meanings are related to each other • Cfr. Human’s foot vs. mountain’s foot • In general, distinction between HOMONIMY and POLYSEMY not always easy (especially with VERBS) Metodi simulativi
Other aspects of lexical meaning not captured by MRDs • Other semantic relations: • HYPONYMY • ANTONYMY • A lot of other information typically considered part of ENCYCLOPEDIAs: • Trees grow bark and twigs • Adult trees are much taller than human beings Metodi simulativi
Hyponymy and Hypernymy • HYPONYMY is the relation between a subclass and a superclass: • CAR and VEHICLE • DOG and ANIMAL • BUNGALOW and HOUSE • Generally speaking, a hyponymy relation holds between X and Y whenever it is possible to substitute Y for X: • That is a X -> That is a Y • E.g., That is a CAR -> That is a VEHICLE. • HYPERNYMY is the opposite relation • Knowledge about TAXONOMIES useful to classify web pages • Eg., Semantic Web • Automatically (e.g., Udo Kruschwitz’s system) • This information not generally contained in MRD Metodi simulativi
EAT-LEX-1 The organization of the lexicon “eat” “eats” eat0600 eat0700 “ate” “eaten” WORD-FORMS LEXEMES SENSES Metodi simulativi
STOCK-LEX-1 STOCK-LEX-2 STOCK-LEX-3 The organization of the lexicon stock0100 stock0200 stock0600 “stock” stock0700 stock0900 stock1000 WORD-STRINGS LEXEMES SENSES Metodi simulativi
CHEAP-LEX-1 CHEAP-LEX-2 INEXP-LEX-3 Synonymy cheap0100 “cheap” …. …… cheapXXXX inexp0900 “inexpensive” inexpYYYY WORD-STRINGS LEXEMES SENSES Metodi simulativi
A more advanced lexical resource: WordNet • A lexical database created at Princeton • Freely available for research from the Princeton site • http://www.cogsci.princeton.edu/~wn/ • Information about a variety of SEMANTICAL RELATIONS • Three sub-databases (supported by psychological research as early as (Fillenbaum and Jones, 1965)) • NOUNs • VERBS • ADJECTIVES and ADVERBS • Each database organized around SYNSETS Metodi simulativi
The noun database • About 90,000 forms, 116,000 senses • Relations: Metodi simulativi
Synsets • Senses (or `lexicalized concepts’) are represented in WordNet by the set of words that can be used in AT LEAST ONE CONTEXT to express that sense / lexicalized concept: the SYNSET • E.g., {chump, fish, fool, gull, mark, patsy, fall guy, sucker, shlemiel, soft touch, mug}(gloss: person who is gullible and easy to take advantage of) Metodi simulativi
Hypernyms 2 senses of robin Sense 1robin, redbreast, robin redbreast, Old World robin, Erithacus rubecola -- (small Old World songbird with a reddish breast) => thrush -- (songbirds characteristically having brownish upper plumage with a spotted breast) => oscine, oscine bird -- (passerine bird having specialized vocal apparatus) => passerine, passeriform bird -- (perching birds mostly small and living near the ground with feet having 4 toes arranged to allow for gripping the perch; most are songbirds; hatchlings are helpless) => bird -- (warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings) => vertebrate, craniate -- (animals having a bony or cartilaginous skeleton with a segmented spinal column and a large brain enclosed in a skull or cranium) => chordate -- (any animal of the phylum Chordata having a notochord or spinal column) => animal, animate being, beast, brute, creature, fauna -- (a living organism characterized by voluntary movement) => organism, being -- (a living thing that has (or can develop) the ability to act or function independently) => living thing, animate thing -- (a living (or once living) entity) => object, physical object -- => entity, physical thing -- Metodi simulativi
Meronymy wn beak –holon Holonyms of noun beak 1 of 3 senses of beak Sense 2 beak, bill, neb, nib PART OF: bird Metodi simulativi
The verb database • About 10,000 forms, 20,000 senses • Relations between verb meanings: Metodi simulativi
Relations between verbal meanings V1 ENTAILS V2 when Someone V1 (logically) entails Someone V2- e.g., snore entails sleep TROPONYMY when To do V1 is To do V2 in some manner- e.g., limp is a troponym of walk Metodi simulativi
The adjective and adverb database • About 20,000 adjective forms, 30,000 senses • 4,000 adverbs, 5600 senses • Relations: Metodi simulativi
How to use • Online: http://cogsci.princeton.edu/cgi-bin/webwn • Command line: • Get synonyms: • wn –synsn bank • Get hypernyms: • wn –hypen robin • (also for adjectives and verbs): get antonyms • wn –antsa right Metodi simulativi
ItalWordNet (una produzione locale) • EuroWordNet: creato da un consorzio Europeo • ItalWordNet: creato da ITC • http://www.ilc.cnr.it/iwndb_php/ Metodi simulativi
Other machine-readable lexical resources • Machine readable dictionaries: • LDOCE • Roget’s Thesaurus • The biggest encyclopedia: CYC • Italian: • http://multiwordnet.itc.it/ (IRST) Metodi simulativi
Readings • WordNet online manuals • C. Fellbaum (ed), Wordnet: An Electronic Lexical Database, The MIT Press Metodi simulativi
PART II: VECTOR-BASED MODELS OF THE LEXICON AND LEXICAL ACQUISITION Modelli simulativi
VECTOR-BASED LEXICAL MODELS • Both in Linguistics and in Psychology researchers have developed theories of the lexicon in which concepts are characterized in terms of FEATURES • E.g., Smith and Medin, 1981; Sartori and Job, 1988 • This type of approach leads to a ‘geometrical’ view of lexical entries as points , or VECTORS, in FEATURE SPACE • This type of model can account for which words ‘mean the same’ • A particularly simple version of this theory is the one in which the ‘features’ are simply other words • Vector-space models have been shown to correlate well with the results of psychological experiments, particularly about SEMANTIC PRIMING Metodi simulativi
VECTOR-BASED MODELS AND LEXICAL ACQUISITION • Vector-based models (both the feature-based and the word-based variety) also interesting because they can serve as the basis for models of lexical acquisition • These models are interesting • From a psychological point of view, to explain how concepts are stored in memory • In neural science, they are being used to investigate SEMANTIC CATEGORY DEFICITS (e.g., Caramazza, Tyler et al, Vigliocco et al) • From a linguistic point of view, because they can address the problems encountered by lexicographers when trying to specify word senses • From a practical point of view: most MRD these days contain at least some information derived by computational means Metodi simulativi
Feature-based lexical semantics • Very old idea in Linguistics: the meaning of a word can be specified in terms of the values of certain `features’ (`DECOMPOSITIONAL SEMANTICS’) • dog : ANIMATE= +, EAT=MEAT, SOCIAL=+ • horse : ANIMATE= +, EAT=GRASS, SOCIAL=+ • cat : ANIMATE= +, EAT=MEAT, SOCIAL=- • E.g., Katz and Fodor, 1968 Metodi simulativi
PSYCHOLOGY: THE FUSS MODEL (Vinson and Vigliocco, 2002, 2003) Metodi simulativi
Vector-based lexical semantics CAT DOG HORSE Metodi simulativi
WORD-BASED VECTOR-SPACE LEXICAL MODELS, I Metodi simulativi
WORD-BASED VECTOR SPACE MODELS, II Metodi simulativi
WORD-BASED VECTOR-SPACE MODELS, III Metodi simulativi
Measures of semantic similarity • Euclidean distance: • Cosine: • Manhattan Metric: Metodi simulativi
DIMENSIONALITY REDUCTION Metodi simulativi
Concept clustering(aka: automatic taxonomy discovery) Year Month Day Joy Car Van Love Fear Airplane Time Vehicle Feeling Metodi simulativi
Some psychological evidence for vector-space representations • Burgess and Lund (1996, 1997): the clusters found with HAL correlate well with those observed using semantic priming experiments. • Landauer, Foltz, and Laham (1997): scores overlap with those of humans on standard vocabulary and topic tests; mimic human scores on category judgments; etc. • Evidence about `prototype theory’ (Rosch et al, 1976) • Posner and Keel, 1968 • subjects presented with patterns of dots that had been obtained by variations from single pattern (`prototype’) • Later, they recalled prototypes better than samples they had actually seen • Rosch et al, 1976: `basic level’ categories (apple, orange, potato, carrot) have higher `cue validity’ than elements higher in the hierarchy (fruit, vegetable) or lower (red delicious, cox) Metodi simulativi
General characterization of vector-based semantics (from Charniak) • Vectors as models of concepts • The CLUSTERING approach to lexical semantics: • Define properties one cares about, and give values to each property (generally, numerical) • Create a vector of length n for each item to be classified • Viewing the n-dimensional vector as a point in n-space, cluster points that are near one another • What changes between models: • The properties used in the vector • The distance metric used to decide if two points are `close’ • The algorithm used to cluster Metodi simulativi
Using words as features in a vector-based semantics • The old decompositional semantics approach requires • Specifying the features • Characterizing the value of these features for each lexeme • Simpler approach: use as features the WORDS that occur in the proximity of that word / lexical entry • Intuition: “You can tell a word’s meaning from the company it keeps” • More specifically, you can use as `values’ of these features • The FREQUENCIES with which these words occur near the words whose meaning we are defining • Or perhaps the PROBABILITIES that these words occur next to each other • Alternative: use the DOCUMENTS in which these words occur (e.g., LSA) Metodi simulativi
Using neighboring words to specify the meaning of words • Take, e.g., the following corpus: • John ate a banana. • John ate an apple. • John drove a lorry. • We can extract the following co-occurrence matrix: Metodi simulativi
Acquiring lexical vectors from a corpus(Schuetze, 1991; Burgess and Lund, 1997) • To construct vectors C(w) for each word w: • Scan a text • Whenever a word w is encountered, increment all cells of C(w) corresponding to the words v that occur in the vicinity of w, typically within a window of fixed size • Differences among methods: • Size of window • Weighted or not • Whether every word in the vocabulary counts as a dimension (including function words such as the or and) or whether instead only some specially chosen words are used (typically, the m most common content words in the corpus; or perhaps modifiers only). The words chosen as dimensions are often called CONTEXT WORDS • Whether dimensionality reduction methods are applied Metodi simulativi