340 likes | 509 Views
Lexical Relations and WordNet. Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture author: Warren Sack. Last Time. What is Cognitive Science? What is Artificial Intelligence?
E N D
Lexical Relations and WordNet Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture author: Warren Sack IS202: Information Organization and Retrieval
Last Time • What is Cognitive Science? • What is Artificial Intelligence? • Knowledge Representation • Languages and Programming Paradigms • Representing Common Sense • Common Sense Interfaces • Story Understanding, Story Generation, and Common Sense IS202: Information Organization and Retrieval
Cognitive Science • 10/30/01 – AI, knowledge representation and common sense • 11/01/01 – Computational Linguistics, Cognitive Psychology and Lexical Knowledge • 11/06/01 – AI and information extraction • 11/08/01 – Linguistics, Philosophy, Psychology, categories, and cognition IS202: Information Organization and Retrieval
Today • Lexical relations • Linguistics • Two approaches to semantics: • Compositional • Relational • Psycholinguistics • WordNet • Description • Structure • Applications IS202: Information Organization and Retrieval
Levels of Linguistic Analysis • Sentences • Phonological/Morphological analysis • Syntactic analysis • Semantic analysis • More than one sentence • Pragmatic analysis IS202: Information Organization and Retrieval
Phonology/Morphology • Phonology: The study of the systems of sounds which are manifested in natural languages; the significant contrasts between sounds that are relevant to meaning. • E.g., consonants, vowels, stress, intonation, etc. • Morphology: the forms of words • E.g., word=watched; morphs=watch+ed; morphemes=watch+past IS202: Information Organization and Retrieval
Syntax The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language. These rules codify permissible combinations of classes of word forms. IS202: Information Organization and Retrieval
Semantics • Semantics is the study of linguistic meaning. • Two standard approaches to lexical semantics (cf., sentential semantics; and, logical semantics): • (1) compositional • (2) relational • Other approaches… IS202: Information Organization and Retrieval
Pragmatics • Deixis • E.g., “I’ll be back in an hour” depends upon the time of the utterance. • Conversational implicature • A: “Can you tell me the time?” • B: “Well, the milkman has come.” [I don’t know exactly, but perhaps you can deduce it from some extra information I give you.] • Presupposition • “Are you still such a bad driver?” • Speech acts • Constatives vs. performatives • e.g., “I second the motion.” • Conversational Structure • E.g., turn-taking rules IS202: Information Organization and Retrieval
Lexical Semantics: Compositional Approach • Compositional lexical semantics, introduced by Katz & Fodor (1963), analyzes the meaning of a word in much the same way a sentence is analyzed into semantic components. The semantic components of a word are not themselves considered to be words, but are abstract elements (semantic atoms) postulated in order to describe word meanings (semantic molecules) and to explain the semantic relations between words. For example, the representation of bachelor might be ANIMATE and HUMAN and MALE and ADULT and NEVER MARRIED. The representation of man might be ANIMATE and HUMAN and MALE and ADULT; because all the semantic components of man are included in the semantic components of bachelor, it can be inferred that bachelor man. In addition, there are implicational rules between semantic components, e.g. HUMAN ANIMATE, which also look very much like meaning postulates. • George Miller, “On Knowing a Word,” 1999 IS202: Information Organization and Retrieval
Lexical Semantics: Relational Approach • Relational lexical semantics was first introduced by Carnap (1956) in the form of meaning postulates, where each postulate stated a semantic relation between words. A meaning postulate might look something like dog animal (if x is a dog then x is an animal) or, adding logical constants, bachelor man and never married [if x is a bachelor then x is a man and not(x has married)] or tall not short [if x is tall then not(x is short)]. The meaning of a word was given, roughly, by the set of all meaning postulates in which it occurs. • George Miller, “On Knowing a Word,” 1999 IS202: Information Organization and Retrieval
Psycholinguistics • The introduction of Noam Chomsky’s theory of syntax to psychologists: • Miller, G.A., Galanter, E., Pribram, K.H. (1960) Plans and the Structure of Behavior. • Some areas of psycholinguistics: • Children’s acquisition of language • First and second language learning • Artificial intelligence? (see Lyons, 1981) IS202: Information Organization and Retrieval
WordNet • Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University • Can be downloaded for free: www.cogsci.princeton.edu/~wn/ • In terms of coverage, WordNet’s goals differ little from those of a good standard college-level dictionary, and the semantics of WordNet is based on the notionof word sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that WordNet aspires to innovation. (Miller, 1998, chapter 1) IS202: Information Organization and Retrieval
Presuppositions of WordNet project • Separability hypothesis: T The lexical component of language can be separated and studied in its own right. • Patterning hypothesis: People have knowledge of the systematic patterns and relations between word meanings. • Comprehensiveness hypothesis: Computational linguistics programs need a store of lexical knowledge that is as extensive as that which people have. IS202: Information Organization and Retrieval
WordNet structure • Synsets versus Words IS202: Information Organization and Retrieval
WordNet: Size POS Unique Synsets Strings Noun 107930 74488 Verb 10806 12754 Adjective 21365 18523 Adverb 4583 3612 Totals 144684 109377 IS202: Information Organization and Retrieval
Structure of WordNet IS202: Information Organization and Retrieval
Structure of WordNet IS202: Information Organization and Retrieval
Structure of WordNet IS202: Information Organization and Retrieval
Unique Beginners • { entity, something, (anything having existence (living or nonliving)) } • { psychological_feature, (a feature of the mental life of a living organism) } • { abstraction, (a general concept formed by extracting common features from specific examples) } • { state, (the way something is with respect to its main attributes; "the current state of knowledge"; "his state of health"; "in a weak financial state") } • { event, (something that happens at a given place and time) } • { act, human_action, human_activity, (something that people do or cause to happen) } • { group, grouping, (any number of entities (members) considered as a unit) } • { possession, (anything owned or possessed) } • { phenomenon, (any state or process known through the senses rather than by intuition or reasoning) } IS202: Information Organization and Retrieval
Roget’s “Unique Beginners” The ontology of Roget’s is headed by six Classes. The first three Classes cover the external world: Abstract Relations deals with such ideas as number, order and time; Space is concerned with movement, shapes and sizes, while Matter covers the physical world and humankind’s perception of it by means of five senses. The remaining Classes deal with the internal world of human beings: the mind (Intellect), the will (Volition), the heart and soul (Emotion, Religion and Morality). There is a logical progression from abstract concepts, through the material universe, to mankind itself, culminating in what Roget saw as mankind’s highest achievements: morality and religion (Kirkpatrick, 1998). Class Four, Intellect, is divided into Formation of ideas and Communication of ideas, and Class Five, Volition, into Individual volition and Social volition. In practice, therefore, the Thesaurus is headed by eight Classes. A path in Roget’s ontology always begins with one of the Classes. It branches to one of the 39 Sections and then to one of the 990 Heads. Each Head is divided into paragraphs grouped by parts of speech: nouns, adjectives, verbs and adverbs. From Mario Jarmasz, Stan Szpakowicz, “Roget’s Thesaurus as an Electronic Lexical Knowledge Base,” 2000. IS202: Information Organization and Retrieval
WordNet Browsers • http://www.cogsci.princeton.edu/cgi-bin/webwn • http://bogart.sip.ucm.es/~jorge/browser.htm • http://www.visualthesaurus.com/ IS202: Information Organization and Retrieval
Other WordNetshttp://www.hum.uva.nl/~ewn/gwa/wordnet_table.htm • Dutch • Spanish • Italian • German • French • Czech • Estonian IS202: Information Organization and Retrieval
Bengali Bulgarian Danish Greek Hebrew Hindi Hindi Kannada Latvian Moldavian Romanian Russian Slovenian Swedish Tamil Thai Turkish Yugoslavian Norwegian Icelandic Forthcoming WordNetshttp://www.hum.uva.nl/~ewn/gwa/wordnet_table.htm IS202: Information Organization and Retrieval
Psycholinguistic evidence for WordNet’s structure • Bever and Rosenbaum, 1970: • A pistol is more dangerous than a rifle. • * A pistol is more dangerous than a gun. • * A gun is more dangerous than a pistol. • Resnik, 1993 • The direct object of the verb drink can be any hyponym of the noun berverage. • Collins and Quillian, 1969 • The time required to verify the statement “A robin is a bird” is shorter than the time required to verify the statement “A robin is an animal.” IS202: Information Organization and Retrieval
Psycholinguistic evidence against WordNet’s structure • Smith and Medin, 1981 • The time required to verify that a chicken is a bird is significantly longer than the time required to verify that a robin is a bird, even though chick and robin stand in the same taxonomic relation to bird. • Rosch, 1973 • Ratings of “typicality” have little to do with frequency or familiarity. • Lakoff, 1987 • Concepts are represented, not by a list of distinguishing features, but by the focal instances (or prototypes) that are the best examples of the prototype. IS202: Information Organization and Retrieval
WordNet Applications • Using WordNet as a data structure. Many languages used by computational linguists and natural language processing researchers now have WordNet packages. E.g., for Perl • Lingua::Wordnet, and • Lingua::Wordnet::Analysis by Dan Brian, http://search.cpan.org/search?dist=Lingua-Wordnet IS202: Information Organization and Retrieval
WordNet Applications • Information Retrieval: Voorhees, 1998 • Query expansion via synsets • “sense-based” rather than “stem-based” vectors • Unfortunately, in both cases, the inability to automatically resolve word senses prevented any improvement from being made. IS202: Information Organization and Retrieval
WordNet Applications • Textual Cohesion and the correction of Malapropisms: Hirst and St-Onge, 1998 Malapropism = the confounding of an intended word with another word of similar sound or similar spelling that has a quite different meaning; e.g., “Super bowl Superb owl” IS202: Information Organization and Retrieval
WordNet Applications • Temporal Indexing through lexical chaining: Al-Halimi and Kazman, 1998 • Indexing transcripts of conference meetings by topic. IS202: Information Organization and Retrieval
WordNet Applications • Conversation themes in Usenet: Sack, 2000 IS202: Information Organization and Retrieval
Next Time • Information Extraction, Artificial Intelligence, and “Story Understanding” Revisited IS202: Information Organization and Retrieval