210 likes | 445 Views
Word Association Thesaurus as a Resource for Extending Semantic Networks. Anna Sinopalnikova 1 , 2 , Pavel Smrz 1 1 Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic 2 Saint-Petersburg State University Universitetskaya 11, Saint-Petersburg, Russia
E N D
Word Association Thesaurus as a Resource for Extending Semantic Networks Anna Sinopalnikova1, 2, Pavel Smrz1 1Faculty of Informatics, Masaryk University Botanicka 68a, 602 00 Brno, Czech Republic 2Saint-Petersburg State University Universitetskaya 11, Saint-Petersburg, Russia {anna, smrz}@fi.muni.cz
Overview • Motivation • Word Association and other notions of psycholinguistics • WAT vs. Corpus • Semantic Information from WAT • core concepts, semantic primitives, syntagmatic and paradigmatic relations, domain information
Motivation • There is still a need for empirical basis of semantic network construction. • Semantic Web initiatives. • WAT are available for many languages. Nobody knows what are they good for and how to use them.
Word Association and other notions of psycholinguistics • Word Association • Word Association Test • Word Association Norms • Word Association Thesaurus
Example Needlestimulates: -> thread: 41, pin: 13, sharp: 6, sew: 5, cotton: 2, dressmaker: 1, fix: 1, prick: 1, sewing: 1, sow: 1, spring; 1, stitch: 1, etc.
WATs explored • RAT - Russian WAT by Karaulov et al (1994-1998): 8000 stimuli - 23000 words covered – 1000 subjects, • EAT - Edinburgh WAT by Kiss et al (1972): 8400 stimuli – 54000 words covered - 1000 subjects, • Czech WAN (Novak et al, 1996): 150 stimuli - 4000 words covered – 250 subjects. Experience gained in projects: • RussNet (a wordnet-like database for Russian linking lexical semantics with derivational morphology • Czech part of the BalkaNet project (multilingual wordnet-like network for 5 Balkan languages and Czech).
Corpus WAN WAT vs. Corpus History: Church & Hanks, 1990; Wettler & Rapp, 1993; Willners, 2001 • Bokrjonok 3.0. - balanced corpus for Russian (16 mln words), • BNC - British National Corpus (112 mln), • CNC - Czech National Corpus (160 mln) and its unbalanced version (630 mln words) Research procedure: 5000 pairs e.g. cheese – mouse, dark - alley have been extracted from each WAN in random order, and then searched in the corpora. The window span was fixed to -10; +10 words.
WAN vs. Corpus: Russian Quantitativeanalysis: (Sinopalnikova, 2004) - 64% word associations do not occur in the corpus, - 49% while excluding unique associations (that with absolute frequency = 1) Qualitative analysis: - high ratio of syntagmatic associations to be absent, - for verbs this number was up to 84%.
WAN vs. Corpus: English Quantitativeanalysis: - 31% word associations do not occur in the BNC Qualitative analysis: PARADIGMATIC 57,1 SYNTAGMATIC8,4 DOMAIN 21,7 OTHER12,8
WAN vs. Corpus: English (2) • acquiring synonymy and hyponymy e.g. sex – fornicate (archaic or humorous), ire (poetic) – anger, cowardly – yellow (slang) • acquiring information about low frequent words e.g. perambulate (NBNC = 3), fornicate (NBNC = 6) cf. EAT:perambulate - walk: 30, pram: 17, baby: 9, push: 8, about: 1, dawdle: 1,move: 1, promenade: 1, slowly: 1, stroll:1, through:1, wander:1, etc. • acquiring domain relations; absentportion of them was surprisingly large for such corpus as BNC e.g. ink-pot – pen: 24, non-violence – peace 29, offside – soccer 2
WAN vs. Corpus: Czech Quantitativeanalysis: - 514 associations missing (10,28%) Qualitative analysis: - proportion of the syntagmatic and paradigmatic ones among them was similar to that for English
Extracting semantic information from WAT • Associations: • by form – 10% (e.g. know – no, yellow - mellow) • by meaning – 90% (e.g. needle – sew, yellow - sun) • core concepts, • semantic primitives, • syntagmatic and paradigmatic relations, • domain information
Core concepts In WAT there could be observed words that have an above-average number of direct links to other words. Russianчеловек, мир, дом, жизнь, есть, думать, жить, идти, большой, хорошо, плохо, нет (не), новый, дерево etc. (295 words with more then 100 relations); English man, sex, no (not), love, house; work, eat, think, go, live; good, old, small etc. (586 words with more then 100 relations); Czechčlověk, dům, strom;jíst, jít, myslet; moc, starý, velký, bílý, hezký etc. These words determine the fundamental concepts of a particular language system, and thus should be incorporated into ontology as its core components (e.g., SUMO upper concepts or EWN Base Concepts.
Semantic primitives • WAT could also provide a list of basic concepts associated with each separate word. • Thus revealing semantics of a word (situation) as a list of semantic constituents - separate pieces of information. • Abstract words (verbs, adjectives or nouns denoting complex situation or emotional states) are difficult to decompose by means of logic and intuition. • E.g. Depression could be reduced to its constituents sad 7, low 5, black 4, manic 4, sadness 3, bored 3, misery 2, tiredness 2, despair 1, gloom 1, grey 1, hopelessness 1, monotony 1, sick 1, mood 1, nerves 1, etc., its probable causes: rain 3, guilt 1, pain 1, unemployment 1, its probable effects: suicide 1, itsantipodes elation 3, fun 1, happiness 1 etc.
Syntagmatic and paradigmatic relations • “Linguistic substitutes for reality” • WA reflect the order of events in reality, the way objects are organized in the space, and the way human beings experience them. • Associations by contiguity e.g. cry – baby may be treated as a manifestation of syntagmatic relation between verb and its subject, while take – hand as a ROLE_INSTRUMENT relation. • Generalization! e.g. drink – water, beer, milk, ale, Coca-cola, coffee, juice, etc. found in WAT should be generalized asdrink ROLE_OBJECT beveragerelation and in such a form incorporated in the semantic network
Syntagmatic and paradigmatic relations (2) • The law of contiguity could not explain all associations. • Law of similarity, e.g. inanimate – dead:39 (SYNONYMY),seek – find:56 (CAUSE relation), buy – sell:56 (CONVERSIVE relation). • One of the main benefits of WAT : paradigmatic relations are given explicitly as opposed to other sources of empirical data (e.g. text corpora).
Domain information • WAT explicitly present the way common words are grouped together according to the fragments of reality they describe. E.g., hospital –> nurse, doctor, pain, ill, injury, load… • Types of domain relations: • name of domain (situation) – domain member e.g. hospital – nurse:8, finance – money: 61, football – player:4; marriage – husband 2; • participant – participant e.g. pepper – salt: 58, tamer – lion: 69, needle – thread: 41 mouse – cat: 22; • participant – circumstance e.g. umbrella – rain: 58; actor – stage:23; • participant – pointer to its function/role in the situation e.g. larder – food: 58, envelope – letter: 60,actor – play: 15 etc. • To differentiate types of domain relations within semantic network, vs. to include them as uniform IS_ASSOCIATED_TO relation?
Conclusions Advantages of using WAT in constructing semantic network: • Simplicity of data acquisition. • Broad variety of semantic information to acquire. • Empirical nature of data extracted (as opposed to theoretical one, cf. conventional ontologies, taxonomies or classification schemes, that supposes the researcher’s introspection and intuition to be involved, and hence, leads to over- and under-estimation of the phenomena under consideration). • Probabilistic nature of data presented (data reflects the relative rather then absolute relevance of semantic relations in each particular case).