CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence)

CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 15–Language Divergence) Pushpak BhattacharyyaCSE Dept., IIT Bombay 8th Feb, 2011

Key difference between Statistical/ML-based NLP and Knowledge-based/linguistics-based NLP • Stat NLP: speed and robustness are the main concerns • KB NLP: Phenomena based • Example: • Boys, Toys, Toes • To get the root remove “s” • How about foxes, boxes, ladies • Understand phenomena: go deeper • Slower processing

Perspective on Statistical MT • What is a good translation? • Faithful to source • Fluent in target faithfulness fluency

Word-alignment example (1) (2) (3) (4) Ram has an apple रामके पासएकसेबहै (1) (2)(3) (4) (5) (6) Ram of near an apple is

Kinds of MT Systems(point of entry from source to the target text)fwd

Why is MT difficult?Classical NLP problems • Ambiguity • Lexical: Went to the bank to withdraw money • Structural: Saw the boy with a telescope • Ellipsis: I wanted a book and John a pen • Co-reference • Anaphora: John said he likes music • Hypernymic: John’s house is a robust structure

Why is MT DifficultLanguage Divergence • Lexico-Semantic Divergence • Structural Divergence

Language Divergence(English Hindi: Noun to Adjective) • The demands on sportsmen today can lead to burnout at an early age. (noun – the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) • खिलाड़यों से जो आज अपेक्षाएं हैं, वे उन्हें कम उम्र में अक्रियाशील कर सकती हैं • Sportsmen-from, which today demands exist, that (correlative) them early age in inactive do can (aspectual)V-AUX.

Language Divergence(English Hindi: Noun to Verb) • Every concert they gave us was a sell-out. (an event for which all the tickets have been sold) • उनके हर संगीत-कार्यक्रम के सभी टिकट बिक गए थे। • Their every concert-of all ticket sell-past-passive-plural(were sold out).

Language Divergence(English Hindi: Adjective to Adverb) • The children were watching in wide-eyed amazement. (with eyes fully open because of fear, great surprise, etc) • बच्चे आश्चर्य से आँखें फाड़े देख रहे थे। • Children amazement-with eyes opening widely seeing were.

Language Divergence(English Hindi: Adjective to Verb) • He was in a bad mood at breakfast and wasn't very communicative. (able and willing to talk and give information to other people) • नाश्ते के समय वह खराब मूड में था और ज्यादा बात-चीत नहीं कर रहा था। • Breakfast-of time he bad mood-in was and much conversation not do-past-progressive-sing (was doing).

Language Divergence(English Hindi: Preposition to Adverb) • It gets cooler toward evening. (near a point in time) • शाम होते-होते ठंडक बढ़ जाती है। • Evening happening-happening(reduplication; typical Indian language phenomenon) cold increase-goes (verb compound; polar vector).

Language Divergence(English Hindi: idiomatic usage) • Given her interest in children, teaching seems the right job for her. (when you consider sth) • बच्चों के प्रति (में) उसकी दिलचस्पी देखते हुए, अध्यापन उसके लिएउचित लगता है। • Children-towards her interest having seen, teaching for her appropriate seems.

Language Divergence is ubiquitous (Marathi-Hindi-English: case marking and postpositions transfer: works!) • Not only for languages from distant families, but also within close cousins • प्रथम ताख्यात • वर्तमान(simple present) • तो जातो. • वह जाता है। • He goes. • स्थिरसत्य(universal truth) • पृथ्वी सूर्याभोवती फिरते. • पृथ्वी सूर्य के चारों ओर घूमती है। • The earth revolves round the sun.

Language Divergence(Marathi-Hindi-English: case marking and postpositions: works again!) • ऐतिहासिक सत्य (historical truth) • कृष्ण अर्जुनास सांगतो... • कृष्ण अर्जुन से कहते हैं... • Krushna says to Arjuna… • अवतरण (quoting) • दामले म्हणतात, ... • दामले कहते हैं, ... • Damle says,...

Language Divergence(Marathi-Hindi-English: case marking and postpositions: does not work!) • संनिहित भूत (immediate past) • कधी आलास? हा येतो इतकाच ! • कब आये? बस अभी आया । • When did you come? Just now (I came). • निःसंशय भविष्य (certainty in future) • आता तो मार खातो खास ! • अब वह मार खायगा ही ! • He is in for a thrashing. • आश्वासन (assurance) • मी तुम्हाला उद्या भेटतो. • मैं आप से कल मिलता हूँ। • I will see you tomorrow.

Language Divergence Theory: Lexico-Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002) • Conflational divergence • E: stab; H: churaa se maaranaa (knife-with hit) • S: Utrymningsplan; E: escape plan • Structural divergence • E: SVO; H: SOV • Categorial divergence • Change is in POS category (many examples discussed) • Head swapping divergence • E: Prime Minister of India; H: bhaarat ke pradhaan mantrii (India-of Prime Minister) • Lexical divergence • E: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon

Language Divergence Theory: Syntactic Divergences • Constituent Order divergence • E: Singh, the PM of India, will address the nation today; H: bhaaratkepradhaanmantrii, singh, … (India-of PM, Singh…) • Adjunction Divergence • E: She will visit here in the summer; H: vahyahaagarmiimeMaayegii (she here summer-in will come) • Preposition-Stranding divergence • E: Who do you want to go with?; H: kisakesaathaapjaanaachaahate ho? (who with…) • Null Subject Divergence • E: I will go; H: jaauMgaa (subject dropped) • Pleonastic Divergence • E: It is raining; H: baarish ho rahiihaai (rain happening is: no translation of it)

Entropy considerations Work of Chirag and Venkatesh, ongoing

Language Typology

Parallel Corpora

Phrase Table Entries • Hindi-English Phrase Table Entries • प्रस्तुत ||| a ||| 0.1 • प्रस्तुत ||| afford ||| 0.1 • प्रस्तुत ||| offer ||| 0.5 • प्रस्तुत ||| offers ||| 0.3 Contribution to entropy = 0.507 • Hindi-Marathi Phrase Table Entries • प्रस्तुत ||| अधिक असे देऊ ||| 0.05 • प्रस्तुत ||| उपलब्ध ||| 0.2 • प्रस्तुत ||| काहींचे ||| 0.05 • प्रस्तुत ||| देऊ ||| 0.6 • प्रस्तुत ||| सादर ||| 0.1 Contribution to entropy = 0.503

Entropy Evaluation • The phrase table gives a probability distribution over the possible translations for each source phrase. • We use the probability of the source phrase itself to get a distribution for the entire phrase table. • Entropy is evaluated as per the standard formula • Hindi-Marathi Phrase Table Entropy : 9.671 • Hindi English Phrase Table Entropy : 9.770

Handling Divergence through Indicative Translation (Microsoft Techvista Award, Ananthakrishnan 2007)

Indicative Translation – what and why? • Native speaker acceptable translation not possible • especially considering English-Hindi (Indian languages) divergence • Compromises • human-aided translation (post-editing) • narrow domain (weather reports) • rough translation  Indicative MT Goal: understandable rather than perfect output Purpose: assimilation rather than dissemination (translation on the web)

Divergence between English and Hindi • Divergence: differences in lexical and syntactic choices that languages make in expressing ideas • MaTra: • Structural transfer • SVO to SOV • post-modifiers to pre-modifiers • Lexical transfer: • WSD + lexicon lookup • inflections • case-markers.

Divergence between Natural and Indicative Hindi: some examples E: We eat the rotten canteen food every night. H: हम हररात कैन्टीन का सड़ा हुआ खाना खाते हैं I: हम हर रात सड़ा हुआ कैन्टीन खाना खाते हैं E: The batsman who had been scoring heavily against them has to be removed early. H:जो बल्लेबाज़ उनके विरुद्धज़ोरदार स्कोर कर रहा था उसे जल्दीनिकालना होगा I:बल्लेबाज़, जो उनके विरुद्धज़ोरदार स्कोरकर रहा था, जल्दी निकालना होगा

Categorial divergence E: I am feeling hungry H:मुझेभूखलग रही है I:मैं भूखा महसूस कर रहा हूँ • n-gram matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams: 0/3

Relation between words in noun-noun compounds E: The ten best Aamir Khan performances H: आमिर ख़ान की दस सर्वोत्तम पर्फ़ार्मन्सस I: दस सर्वोत्तम आमिर ख़ान पर्फ़ार्मन्सस • n-gram matches: unigrams: 5/5; bigrams: 2/4; trigrams: 0/3; 4-grams: 0/2

Lexical divergence E: Food, clothing and shelter are a man's basic needs. H: रोटी, कपड़ा और मकान एक मनुष्य की बुनियादी ज़रूरतें हैं I: खाना, कपड़ा, और आश्रय एक मनुष्य कीबुनियादी ज़रूरतेंहैं • n-gram matches: unigrams: 8/10; bigrams: 6/9; trigrams: 4/8; 4-grams: 3/7

Pleonastic Divergence E: It is raining H: बारिश हो रहीहै I: यह बारिश हो रहीहै • n-gram matches: unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2 E: There was a great king H: एक महान राजा था I: वहाँ एक महान राजा था

Stylistic differences E: The Lok Sabha has 545 members. H: लोक सभा में ५४५ सदस्यहैं I: लोक सभाके पास ५४५ सदस्यहैं • n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5; 4-grams: 0/4 Other differences: word order, sentence length

Transliteration and WSD errors E: I purchased a bat. H: मैने एक बल्ला खरीदा I: मैने एक बैट खरीदा मैने एक चमगादड़ खरीदा • n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams: 0/1

Advantages of a hybrid Rule-based + SMT system • What SMT brings to the table • If data available, then no need for linguistic resources • Quick adaptation to • new domains (tourism, health) • new language pairs (English-Gujarati/Marathi) • See improvements by adding data • What rule-based systems bring to the table • Capture small set of systematic difference well • SVO  SOV (do we need to learn this?) • Better handle on correcting specific cases

Preprocessing rules + SMT for English-Indian language MT • Lack of linguistic resources for Indian languages • Lots of resources available for English • Morphology is rich for Indian languages • Wider systematic syntactic differences between English and Indian languages

Placed within the Vauquois Triangle

Previous work on factored MT

Previous work • {ney:04} show that the use of morpho-syntactic information drastically reduces the need for bilingual training data • {ney:06} report the use of morphological and syntactic restructuring information for Spanish-English and Serbian-English translation

Previous work (contd) • Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model • Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors

Previous work (contd) • Avramidis and Koehn {koehn:08} report work on translating from poor to rich morphology, namely, English to Greek and Czech translation • Factored models with case and verb conjugation related factors determined by heuristics on parse trees • Used only on the source side, and not on the target side

Previous work (contd) • Melamed {melamed:04} proposes methods based on tree-to-tree mappings • Imamura et al. {imamura:05} present a similar method that achieves significant improvements over a phrase-based baseline model for Japanese-English translation

Previous work (contd) • Target language does not have parsing/clause-detecting tools • Niessen and Ney {ney:04}: Reorder the source language data prior to the SMT training and decoding cycles German-English SMT • Popovic and Ney {ney:06} :simple local transformation rules for Spanish-English and Serbian-English translation • Collins et al. {collins:05}: German clause restructuring to improve German-English SMT • Wang et al. {wang:07}: similar work for Chinese-English SMT • Ananthakrishnan and Bhattacharyya {anand:08}: syntactic reordering and morphological suffix separation for English-Hindi SMT

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence)