330 likes | 509 Views
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 9–IWSD; start of MT). Pushpak Bhattacharyya CSE Dept., IIT Bombay 24 th Jan , 2011. WordNet Sub-Graph. Hyponymy. Dwelling,abode. Hypernymy. Meronymy. kitchen. Hyponymy. bckyard. bedroom. M e r o n y m
E N D
CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 9–IWSD; start of MT) Pushpak BhattacharyyaCSE Dept., IIT Bombay 24th Jan, 2011
WordNet Sub-Graph Hyponymy Dwelling,abode Hypernymy Meronymy kitchen Hyponymy bckyard bedroom M e r o n y m y house,home veranda A place that serves as the living quarters of one or mor efamilies Hyponymy study guestroom hermitage cottage Gloss
Pioneering work at IITB on Multilingual WSD MiteshKhapra, SaurabhSohoney, AnupKulkarni and Pushpak Bhattacharyya, Value for Money: Balancing Annotation Effort, Lexicon Building and Accuracy for Multilingual WSD, Computational Linguistics Conference (COLING 2010), Beijing, China, August 2010. MiteshKhapra, AnupKulkarni, SaurabhSohoney and Pushpak Bhattacharyya, All Words Domain Adapted WSD: Finding a Middle Ground between Supervision and Unsupervision, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010. MiteshKhapra, Sapan Shah, PiyushKedia and Pushpak Bhattacharyya, Domain-Specific Word Sense Disambiguation Combining Corpus Based and Wordnet Based Parameters, 5th International Conference on Global Wordnet (GWC2010), Mumbai, Jan, 2010. MiteshKhapra, Sapan Shah, PiyushKedia and Pushpak Bhattacharyya, Projecting Parameters for Multilingual Word Sense Disambiguation, Empirical Methods in Natural Language Prfocessing (EMNLP09), Singapore, August, 2009.
Motivation • Parallel corpora, wordnets and sense annotated corpora are scarce resources. • Challenges: Lack of resources, multiplicity of Indian languages. • Can we do annotation work in one language and find ways of reusing it for other languages? • Can a more resource fortunate language help a less resource fortunate language? CFILT - IITB
Introduction • Aim: Perform WSD in a multilingual setting involving Hindi, Marathi, Bengali and Tamil • The wordnet and sense marked corpora of Hindi are used for all these languages • Methodology rests on a novel multilingual dictionary framework • Parameters are projected from Hindi to other languages • The domains of interest are Tourism and Health CFILT - IITB
Related Work (1/2) • Knowledge Based Approaches • Lesk’s Algorithm, Walker’s algorithm, Conceptual density, PageRank • Fundamentally overlap based algorithms • Suffer from data sparsity, dictionary definitions being generally small • Broad-coverage algorithms, but, suffer from poor accuracies • Supervised Approaches • WSD using SVM, k-NN,Decision Lists • Typically word-specific classifiers with high accuracies • Need large training corpora - unsuitable for resource scarce languages CFILT - IITB
Related Work (2/2) • Semi-Supervised/Unsupervised Approaches • Hyperlex,Decision Lists • Do not need large annotated corpora but are word-specific classifiers. • Not suited for broad-coverage • Hybrid approaches (Motivation for our work) • Structural Semantic Interconnections • Combine more than one knowledge sources (wordnet as well as a small amount of tagged corpora) • Suitable for broad-coverage CFILT - IITB No single existing solution to WSD completely meets our requirements of multilinguality, high domain accuracy and good performance in the face of not-so-large annotated corpora.
Parameters for WSD (1/4) • Motivating example • The river flows through this region to meet the sea. • S1: (n) sea (a division of an ocean or a large body of salt water partially enclosed by land) • S2: (n) ocean, sea (anything apparently limitless in quantity or volume) • S3: (n) sea (turbulent water with swells of considerable size) "heavy seas“ CFILT - IITB What are the parameters that influence the choice of the correct sense for the word sea?
Parameters for WSD (2/4) • Domain specific distributions • In the Tourism domain the “water-body” sense is more prevalent than the other senses • Domain-specific sense distribution information should be harnessed • Dominance of senses in a domain • {place, country, city, area}, {flora, fauna}, {mode of transport}, {fine arts} are dominant senses in the Tourism domain • A sense which belongs to the sub-tree of a dominant sense should be given a higher score than the other senses CFILT - IITB A synset node in the wordnet hypernymy hierarchy is called Dominantif the synsets in the sub-tree below the synset are frequently occurring in the domain corpora.
Parameters for WSD (3/4) • Corpus Co-occurrence statistics • Co-occurring monosemous and/or already disambiguated words in the context help in disambiguation. • Example: The frequency of co-occurrence of river (monosemous) with “water-body” sense of sea is high • Semantic distance • Shortest path length between two synsets in the wordnet graph • An edge on this shortest path can be any semantic relation (hypernymy, hyponymy, meronymy, holonymy, etc.) • Conceptual distance between noun synsets CFILT - IITB
Parameters for WSD (4/4) Summarizing parameters, • Wordnet-dependent parameters • belongingness-to-dominant-concept • conceptual-distance • semantic-distance • Corpus-dependent parameters • sense distributions • corpus co-occurrence CFILT - IITB
Building a case for Parameter Projection • Wordnet-dependent parameters depend on the graph-based structure of wordnet • Corpus-dependent parameters depend on various statistics learnt from a sense marked corpora • Both the tasks, • Constructing a wordnet from scratch • Collecting sense marked corpora for multiple languages are tedious and expensive CFILT - IITB Can the effort required in constructing semantic graphs for multiple wordnets and collecting sense marked corpora in multiple languages be avoided?
Synset based Multilingual Dictionary (1/2)RajatMohanty, Pushpak Bhattacharyya, PrabhakarPande, ShraddhaKalele, MiteshKhapra and Aditya Sharma. 2008. Synset Based Multilingual Dictionary: Insights, Applications and Challenges. Global Wordnet Conference, Szeged, Hungary, January 22-25. • Unlike traditional dictionary, synsets are linked, and after that the words inside the synsets are linked • Hindi is used as the central language – the synsets of all languages link to the corresponding Hindi synset. CFILT - IITB Advantage: The synsets in a particular column automatically inherit the various semantic relations of the Hindi wordnet – the wordnet based parameters thus get projected
लड़का /HW1 ladakaa, बालक/HW2 baalak, बच्चा/HW3 bachcha, छोरा /HW4 choraa मुलगा/MW1mulagaa, पोरगा/MW2 poragaa, पोर /MW3 pora male-child /HW1, boy /HW2 English Synset Marathi Synset Hindi Synset Synset based Multilingual Dictionary (2/2) • Cross-linkages are set up manually from the words of a synset to the words of a linked synset of the central language • Such cross-linkages actually solve the problem of lexical choice in translating from text of one language to another. CFILT - IITB
Sense Marked corpora Snapshot of a Marathi sense tagged paragraph
saagar (sea) {water body} Sense_2650 samudra (sea) {water body} Sense_8231 saagar (sea) {abundance} saagar (sea) {abundance} Parameter Projection using MultiDict -P(Sense|Word) parameter (1/2) • P({water-body}|saagar) is given by • Using the cross-liked Hindi words we get P({water-body}|saagar) as • In general, CFILT - IITB
Parameter Projection using MultiDict -P(Sense|Word) parameter (2/2) • For HindiMarathi • Average KL Divergence=0.29 • Spearman’s Correlation Coefficient=0.77 • For HindiBengali • Average KL Divergence=0.05 • Spearman’s Correlation Coefficient=0.82 CFILT - IITB There is a high degree of similarity between the distributions learnt using projection and those learnt from the self corpus.
Comparison of projected and true sense distribution statistics for some Marathi words
Parameter Projection using MultiDict -Co-occurrence parameter • Within a domain, the statistics of co-occurrence of senses remain the same across languages. • Co-occurrence of the synsets {cloud} and {sky} is almost same in the Marathi and Hindi corpus. CFILT - IITB
Comparison of projected and true sense co-occurrences statistics for some Marathi words
Algorithms for WSD – Iterative WSD Motivated by the Energy expression in Hopfield network CFILT - IITB
Algorithms for WSD – Modified PageRank Modification Instead of using the overlap in dictionary definitions as edge weights, the wordnet and corpus based parameters are used to calculate edge weights CFILT - IITB
Experimental Setup • Datasets • Tourism corpora in 4 languages (viz., Hindi, Marathi, Bengali and Tamil) • Healthcorpora in 2 languages (Hindi and Marathi) • A 4-fold cross validation was done for all the languages in both the domains CFILT - IITB
Results CFILT - IITB
What is wordnet baseline? • Pick up the first sense as given in the wordnet • Can be based on corpus (needs sense marked corpus) • Can be based on lexicographer’s intuition (more common)
Senses of bank: as in the wordnet • 1. (883) depository financial institution, bank, banking concern, banking company -- (a financial institution that accepts deposits and channels the money into lending activities; "he cashed a check at the bank"; "that bank holds the mortgage on my home") • 2. (99) bank -- (sloping land (especially the slope beside a body of water); "they pulled the canoe up on the bank"; "he sat on the bank of the river and watched the currents") • 3. (76) bank -- (a supply or stock held in reserve for future use (especially in emergencies)) • 4. (54) bank, bank building -- (a building in which the business of banking transacted; "the bank is on the corner of Nassau and Witherspoon")
Senses of water: as in the wordnet Surprising!! • 1. (744) water, H2O -- (binary compound that occurs at room temperature as a clear colorless odorless tasteless liquid; freezes into ice below 0 degrees centigrade and boils above 100 degrees centigrade; widely used as a solvent) • 2. (219) body of water, water -- (the part of the earth's surface covered with water (such as a river or lake or ocean); "they invaded our territorial waters"; "they were sitting by the water's edge") • 3. (50) water system, water supply, water -- (a facility that provides a source of water; "the town debated the purification of the water supply"; "first you have to cut off the water") • 4. (3) water -- (once thought to be one of four elements composing the universe (Empedocles)) • 5. (1) urine, piss, pee, piddle, weewee, water -- (liquid excretory product; "there was blood in his urine"; "the child had to make water") • 6. water -- (a fluid necessary for the life of most animals and plants; "he asked for a drink of water")
Observations on our experiment • IWSD performs better than PageRank • There is a drop in performance when we use parameter projection instead of using self corpora • Despite the drop in accuracy the performance is still better than the wordnet baseline • The performance is consistent in both the domains • One could trade accuracy with the cost of creating sense annotated corpora CFILT - IITB
Start of MT An exercise
To translate … • I will carry. • They drive. • He swims. • They will drive.
Czeck-English data • [nesu] “I carry” • [ponese] “He will carry” • [nese] “He carries” • [nesou] “They carry” • [yedu] “I drive” • [plavou] “They swim”
Hindi-English data • [dhoUMgA] “I carry” • [dhoegA] “He will carry” • [dhotAhAi] “He carries” • [dhotehAi] “They carry” • [chalAtAhuM] “I drive” • [tErtehEM] “They swim”
Bangla-English data • [bai] “I carry” • [baibe] “He will carry” • [bay] “He carries” • [bay] “They carry” • [chAlAi] “I drive” • [sAMtrAy] “They swim”