830 likes | 998 Views
NLP and IR: Coming Closer or Moving Apart . Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay www.cse.iitb.ac.in/~pb. Acknowledgement: Manoj Chinnakotla , Arjun Atreya , Mitesh Khapra and many others. NLP and IR Perspective.
E N D
NLP and IR: Coming Closer or Moving Apart Pushpak Bhattacharyya Computer Science and Engineering Department IIT Bombay www.cse.iitb.ac.in/~pb Acknowledgement: ManojChinnakotla, ArjunAtreya, MiteshKhapra and many others
Classical Information Retrieval (Simplified) query Retrieval Model a.k.aRanking algorithm document representation relevant documents late 1960’s • 40+ years of work in designing better models • Vector space models • Binary independence models • Network models • Logistic regression models • Bayesian inference models • Hyperlink retrieval models 2010 Simple analysis (parsing, tokenization, apply synonyms,….) on queries and documents.Complex and intricate ranking algorithms to do all the heavy lifting. (Courtesy: Dr. SriramRaghvan, IBM India Research Lab)
The elusive user satisfaction Ranking Correctness of Query Processing Coverage Indexing Crawling NER Stemming MWE
Example: Semantically precise search for relations/events Query:afghans destroying opium poppies
How NLP has used IR • Web looked upon as a huge repository of evidence for language phenomena. • PMI measure used for Co occurence, Collocation leading to interesting NLP problems like Malapropism detection (Bolshakov and Gelbukh, 2003), evaluation of synsets quality (outlier) • Page rank used for WSD • Recent important problem: getting dictionary from comparable corpora which abound in the web • (Bo Li & Eric Gaussier: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. COLING 2010)
How IR has used NLP • Query disambiguation • WSD: very necessary, but does it make a difference • Different technique than in NLP, very small context • Morphology: better than statistical stemmer? (McNamee et al SIGIR 2009) • Capture Term Relationships • Random walk (Kevyn Collins-Thompson and Jamie Callan, CIKM 2005) • Lexical Semantic Association/Indexing (Deerwester et al, 1990) • Search Quality Management: Understanding Queries (Fagin et al, PODS 2010)
Road map • A perspective on the relationship between NLP and IR • How NLP has used IR • How IR has used NLP • NLP using IR Into the Heartland of NLP: Malapropism • IR using NLP MultiPRF: a way of disambiguation in IR leveraging multilinguality 4. Conclusions and future directions
Into the heartland of NLP: Malapropism detection (Gelbuk et. Al, 2003) • Unintended replacement of one content word by another existing content word similar in sound but semantically incompatible • Immortalized by Mrs. Malaprop in Sheridan’s The Rival • “Why, murder's the matter! slaughter's the matter! killing's the matter! But he can tell you the perpendiculars.“ • "He is the very pineapple of politeness.“
Different from… • Spelling error • They travel around the workd(l and k are adjacent on the KBD) Needs detection and correction: solved problem
Different from… • Eggcorn (idiosyncratic substitution, but plausible) • ex-patriot instead of expatriate • on the spurt of the moment instead of on the spur of the moment Needs detection and correction, but not critical: needs to be solved
Different from… • Spoonerism (error in speech or deliberate play on words in which corresponding consonants, vowels, or morphemes are switched) • "The Lord is a shoving leopard." (a loving shepherd) • "A blushing crow." (crushing blow) • "You have hissed all my mystery lectures. Needs detection and correction: needs to be solved
Different from… • Pun • Is life worth living? That depends a lot on the liver (both meanings plausible) • We're not getting anywhere in geometry class. It feels like we're going in circles. Should NOT be corrected
Motivation for Malapropism Detection • Interactive (manual) editing • Detect and correct errors • Spell checking and correction is practically a solved problem • Grammar checking and correction still needs vast improvement • Semantic incoherence, very difficult to detect and correct
Solution proposed using Google Search and Mutual Information • A pair (V,W) is collocation if it satisfies the above formula • N(V,W) is the number of web-pages where V and W co-occur • N(V) and N(W) are numbers of the web-pages evaluated separately • Nmax is the total web-page number managed by Google for the given language. • As a bottom approximation to Nmax, the number N(MFW) of the pages containing the most frequent word MFW can be taken • For English, MFW is ‘the,’ and N(‘the’) is evaluated to 2.5 billions of pages • This inequality rejects a potential collocation, if its components co-occur in a statistically insignificant quantity.
Experimental Results Possible collocation Correct version Malapropos version 1 travel around the word 55400 20 2 swim to spoon 23 0 3 take for granite 340000 15 4 bowels are pronounced 767 0 5 loose vowels 2750 1320 6 (a) wear turbines 3640 30 6 (b) turbines on the heads 25 0 7 ingenuous machine 805 6 8 affordable germs 1840 9 9 dielectric materialism 1080 4 10 (a) equal excess 457000 990 10 (b) excess to school 19100 4 11 Ironic columns 5560 28 12 activated by irritation 22 10 13 histerical center 90000 7 14 scientifichypotenuse 7050 0
Cross Lingual Search: Motivation • English still the most dominant language on the web • Contributes 72% of the content • Number of non-English users steadily rising all over the world • English penetration in India • Estimated to be around 3-4% • Mostly the urban educated class • Need to enable access to above information through local languages
तिरूपति आने के लिए रेल साधनतिरूपति पुण्य नगर पहुँचने के लिए बहुत रेल उपलब्ध हैं | अगर मुंबई से यात्रा कर रहे है तो मुंबई-चेन्नई एक्सप्रेस गाड़ी से प्रवास कर सकते है | Target Language Index in English Crawled and Indexed Web Pages Hindi Query तिरूपतियात्रा CLIR Engine तिरूपतियात्रा Target Information in English Language Resources Result Snippets in Hindi Ranked List of Results
Input Query Processing Language Analysis (Stemming & Stopword) v v NER Identification Query Translation Translation Disambiguation MWE Identify Machine Transliteration Output Presentation Translated and Disambiguated Query Document Summarization Multilingual Dictionary CLIA GUI Snippet Generation and Translation Results with Summary and Snippet Search and Ranking Template Based Information Extraction MLIR Index The Web (WWW) Crawling and Indexing Focused Crawler Font Transcoder Language Identifier NE and MWE Identification CMLifier
IR using Language Typology and NLP:Multilingual Pseudo Relevance Feedback (ManojChinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual Relevance Feedback: One Language Can Help Another, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.)
User Information Need • Expressed as short query (average length 2.5 words) • Need query expansion • Lexical resources based expansion did not deliver (Voorhees 1994) • Paradigmatic association (synonyms, antonyms, hypo and hypernyms) • Introduces severe topic drift through unrelated senses of expansion terms • Also through irrelevant senses of query terms
Illustration Drifted topic due to inapplicable sense!!! {case, container} Query word: “Madrid bomb blast case” {case, suit, lawsuit} Drifted topic due to expanded term!!! {suit, apparel}
Query Expansion: Current Dominant Practice • Syntagmatic Expansion • Through Pseudo Relevance Feedback • We show • Mutlilingual PRF helps • Familially related language helps still more • Result of insight from linguistics and NLP • Disambiguation by leveraging multilinguality
Offers a principled approach to IR Each document modeled as a probability distribution – Document Language Model User information need is modeled as a probability distribution – Query Model Ranking Function – KL Divergence Language Modeling Approach to IR Problem of Retrieval ↔ Problem of Estimating P(w|ΘR) and P(w|D)
The Challenge - Estimating Query Model ΘR • Average length of query: 2.5 words • Relevance Feedback to the rescue • User marks some documents from initial ranked list as “relevant” • Usually difficult to obtain
Pseudo-Relevance Feedback (PRF) Initial Results Final Results Doc. Score d1 2.4 d2 2.1 d3 1.8 d4 0.7 . dm0.01 Doc. Score d2 2.3 d1 2.2 d3 1.8 d50.6 . dm0.01 Query Q IR Engine Rerank Corpus with Updated Query Relevance Model Updated Query Relevance Model Document Collection d1 √ d2 √ d3 √ d4 √ dk√ Assume top ‘k’ as Relevant Pseudo-Relevance Feedback (PRF) Learn Feedback Model from Documents
Limitations of PRF: Lexical and Semantic Non-Exclusion Initial Retrieval Documents Final Expanded Query Accession to European Union europe union access nation russia presid getti year state Stemmed Query “access europe union” • Previous Attempts • Voorhees et al. used Wordnet – Negative results • Random walk models on Wordnet, Morphological variants, co-occurrence terms (Zhai et al., Collins-Thompson and Callan) Relevant documents with terms like “Membership”, “Member”, “Country” not ranked high enough
Limitations of PRF: Lack of Robustness Final Expanded Query Olive Oil Production in Mediterranean Initial Retrieved Documents Oil Oliv Mediterranean Produc Cook Salt Pepper Serv Cup Documents about Cooking Stemmed Query “oliv oil mediterranean” Causes Query Drift • Previous Attempts • Refining top document set • Refining initial terms obtained through PRF • Selective query expansion • TREC Robustness Track – improving robustness
Harnesses “Multilinguality”: Take help of a collection in a different language called “assisting language” Expectation of increased robustness, since searching in two collections An attractive proposition for languages that have poor monolingual performance due to Resource constraints like inadequate coverage Morphological complexity Can both Semantic Non-inclusion and Lack of Robustness be solved?
Gao et al. (2009) use English to improve Chinese language ranking Demonstrate only on a subset of queries Experimentation on a small dataset Uses cross-document similarity Related Work
Multilingual PRF: System Flow θL1 Get Feedback Model in L1 Initial Retrieval Query in L1 Top ‘k’ Results Interpolate Models L1 Index θL1Trans Translate Feedback Model into L1 Initial Retrieval Top ‘k’ Results θL2 Translate Query into L2 Get Feedback Model in L2 Ranking using Final Model L2 Index
Feedback model estimated in L2 (ΘFL2) translated back into L1 (ΘTransL1) Using probabilistic bi-lingual dictionary from L2 to L1 Learnt from parallel sentence-aligned corpora Back Translation Step Feedback Model Translation
Semantically Related Terms through Feedback Model Translation German-English Word Alignments English-French Word Alignments Nation Aircraft Flugzeug Nation Country Plane State Aeroplane Feedback Model Translation Step UN Air United Flight Nation, Country State, UN, United Aircraft, Plane Aeroplane, Air, Flight
English used as assisting language Good monolingual performance Ease of processing MultiPRF consistently and significantly outperforms monolingual PRF baseline (ManojChinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual PRF: English Lends a Helping Hand, SIGIR 2010, Geneva, Switzerland, July, 2010.) SIGIR Findings: English Lends a Helping Hand!
Do the results hold for languages other than English? What are the characteristics of a good assisting language? Can any language be used to improve the PRF performance of another language? Can this be extended to multiple assisting languages? Performance Study of Assisting Languages
European languages chosen Europarl corpora CLEF dataset Six languages from different language families French, Spanish (Romance), German, English, Dutch(West Germanic), Finnish (Baltic-Finnic) On more than 600 topics Use Google Translate for Query Translation Experimental Setup
chronisch (chronic), pet, athlet (athlete), ekrank (ill), gesund (healthy), tuberkulos (tuberculosis), patient, reis (rice), person MAP improves from 0.062 to 0.636! Get Feedback Model in L1 θL1 Query in German Initial Retrieval Top ‘k’ Results Bronchial asthma θL1Multi L1 Index asthma, allergi,krankheit (disease), allerg (allergenic),chronisch, hauterkrank (illness of skin), arzt (doctor), erkrank (ill) Translate & Interpolate Initial Retrieval Top ‘k’ Results θL2 Translate Query into Spanish L2 Index Asma,bronquial,contamin,ozon, cient, enfermed, alerg, alergi,air Get Feedback Model in L2 El asma bronquial
développ (developed), évolu (evolved), product, produit (product), moléculair (molecular) MAP improves from 0.145 to 0.357! Get Feedback Model in L1 θL1 Query in French Initial Retrieval Top ‘k’ Results Ingénierie Génétique θL1Multi L1 Index génet, ingénier,manipul, animal, pêcheur (fisherman), développ (developed), gen Translate & Interpolate Initial Retrieval Top ‘k’ Results θL2 Translate Query into Dutch L2 Index genetisch, manipulatie, exxon, dier (animal), visser (fisherman), gen Get Feedback Model in L2 Genetische Manipulatie
Tried parallel composition for two assisting languages Uniform interpolation weights used Exhaustively tried all 60 combinations Improvements reported over best performing PRF of L1 or L2 More than one assisting language
Conclusions • NLP and IR: seems to be needing each other • NLP needs IR: for large scale evidence for language phenomena • IR needs NLP: for high quality sophisticated search • Very likely useful in extended-text query situation: Question Answering, Topic-Description-Narration • MultiPRF uses another language to improve robustness and performance • Can be looked upon as a way of disambiguation • Robust, Large Scale, Language Independent WSD methods needed in the face of resource constrained situation (not-so-large amount of sense marked corpora)
URLs • For resources www.cfilt.iitb.ac.in • For publications www.cse.iitb.ac.in/~pb
Thank you Questions and comments?
Tracing the Development of IR Models from Term Presence-Absence to Query Meaning Ranking: can NLP help IR