1 / 34

Natural Language Processing for Information Retrieval

-KVMV Kiran (04005031) ‏ -Neeraj Bisht (04005035) ‏ -L.Srikanth (04005029) ‏. Natural Language Processing for Information Retrieval. OUTLINE. What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion

misae
Download Presentation

Natural Language Processing for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. -KVMV Kiran (04005031)‏ -Neeraj Bisht (04005035)‏ -L.Srikanth (04005029)‏ Natural Language Processing for Information Retrieval

  2. OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A

  3. What is Information Retrieval? • Retrieving information media with information content that is relevant to a user's information need. • Information media can be • Text, documents, images, videos • Used for • Searching • Organization

  4. OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A

  5. Approaches to IR • Two types of retrieval • By metadata (subject, heading, keywords etc)‏ • By content • Metadata • Manually assigned • Automatically assigned • Content based IR is more successful of the two.

  6. OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A

  7. Evaluation of IR methods • Precision: Proportion of retrieved set that is relevant • Precision = |relevant & retrieved|/|retrieved| = P(relevant|retrieved) • Recall : Probability that a relevant document is retrieved by the query • Recall = |relevant & retrieved|/|relevant| = P(retrieved|relevant|

  8. Example • 1000 documents, 400 relevant and 600 non-relevant to a query. • An IR procedure retrieves 75 relevant and 25 non-relevant documents. • Precision – 0.75 • Recall - 75/400

  9. Evaluating IR methods • Trivial to have recall of one • Precision tends to decrease as recall increases • A good IR procedure should have both of them high.

  10. Content based IR • Two approaches • Statistical • Linguistic

  11. OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A

  12. Statistical IR • simple focus based on the "bag of words." • all words in a document are treated as its index terms • each term assigned a weight in function of its importance, usually determined by its appearance frequency • pairing the documents' words with that of the query's

  13. Statistical IR(cont..)‏ • Stages in Statistical IR: • Document Preprocessing • consisting in preparing the documents for its parameterisation, eliminating any elements considered as superfluous. • Parametrisation • once the relevant terms have been identified. This consists in quantifying the document's characteristics (that is, the terms).

  14. Statistical IR(cont..)‏ • An Example- an xml document.

  15. Statistical IR(cont..)‏ • Preprocessing phases • remove elements that are not meant for indexing,such as tags and headers

  16. Statistical IR(cont..)‏ • Text standardising • Uncapitalize • Remove numerals and dates • Remove words in Stopword lists • a list of empty words in a terms list (prepositions, determiners, pronouns, etc.) considered to have little semantic value • Identify n-grams • identify words that are usually together (compound words, proper nouns, etc.) to be able to process them as a single conceptual unit • done by estimating the probability of two words that are often together make up a single term (compound) .e,g, Artificial Intelligence, European Union etc

  17. Statistical IR(cont..)‏

  18. Statistical IR(cont..)‏ • Stemming • Remove suffixes (prefixes) to find the root of the words.

  19. Statistical IR(cont..)‏ • Parameterising the document • assign a weight to each one of the relevant terms associated to a document (usually by appearance frequency)‏

  20. Statistical IR(cont..)‏ • Estimate the importance of a term • TF*IDF (Term frequency * Inverse Document Frequency)‏ • Term Frequency • a term appears often in one document is indicative that that term is representative of the content • Inverse Document frequency • If it appeared frequently in all documents, it would not have any discriminatory value

  21. Drawbacks of Statistical IR • Linguistic Variance : • Synonyms - Different words convey the same meaning • Might provoke document silence • Relevant documents might not be retrieved, recall decreased • Linguistic Ambiguity : • Homograph - Same word different meaning • Will provoke document noise • Might retrieve too many documents, relating to each meaning of the word, precision decreased

  22. Summary • Statistical IR treats documents as bag of words. • Does not take into consideration the linguistics of the language • Need for more linguistics based approach using complex NLP techniques.

  23. OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A

  24. Linguistic IR • The documents are analysed through different linguistic levels by linguistic tools that incorporate each level's own annotations to the text • The techniques involved are:- • Morphological analysis • taggers assign each word to a grammatical category

  25. Linguistic IR (cont..)‏ • Syntax analysis • see how words are related and used together in making larger grammatical units, phrases and sentences • restricted to identify the most meaningful structures: nominal sentences.

  26. Linguistic IR (cont..)‏ • Word Sense Disambiguation • Index by concept rather than words • e.g.Bank as a financial institution, bank as the edge of a river. Disambiguation helps for queries like “Runs on a bank” • one of the most often used tools for word sense disambiguation is the lexicographic database WordNet • an annotated semantic lexicon in different languages made up of synonym groups called SYNSETS groups.

  27. Linguistic IR (cont..)‏ • Synsets provide short definitions along with the different semantic relationships between synonym • 23 synsets for stock, including • broth, stock • livestock, stock, farm animal • stock certificate, stock • stock, gillyflower • stock, carry, stockpile (verb)‏ • standard, stock (adjective)‏

  28. Linguistic IR (cont..)‏ • Use of synsets • For each query word, find its synsets • Query “punch recipes” • punch (3 synsets), recipe (1 synset)‏ • Expand that synset into its “neighborhood” • Grow with WordNet hyponym (is part of) relationships until any additional growth would include a different sense of any word in the core synset • To disambiguate words in a document • Look at all synset neighborhoods for words in document • Compare to the way they overlap throughout collection

  29. Linguistic IR (cont..)‏ • Choose the neighborhoods where local activity is greater than expected global activity

  30. Problems with Linguistic techniques in IR • Linguistic techniques must be essentially perfect to help • Queries are difficult • Non-linguistic techniques implicitly exploit linguistic knowledge

  31. Conclusion • Statistical IR methods have some drawbacks • Linguistic IR methods try to solve those problems have been fairly unsuccessful • Effective IR depends upon properties of queries that make some NLP techniques redundant • Current NLP techniques are not of much help in strict document retrieval.

  32. Q&A

  33. References • Natural Language Processing and Information Retrieval (Ellen M. Voorhes)‏ • Natural Language Processing in Textual Information Retrieval and Related Topics by Mari Vallez; Rafael Pedraza-Jimenez (http://www.hipertext.net/english/pag1025.htm)‏ • NLP for IR by James Allan http://citeseer.ist.psu.edu/308641.html

  34. References (Contd..)‏ • “A lecture on information retrieval” by Douglas W. Oard (http://www.glue.umd.edu/~oard/papers/CMSC723.ppt)‏

More Related