600 likes | 1.15k Views
-KVMV Kiran (04005031) -Neeraj Bisht (04005035) -L.Srikanth (04005029) . Natural Language Processing for Information Retrieval. OUTLINE. What is Information Retrieval(IR)? Approaches to IR Evaluation of IR methods Statistical IR methods Linguistic IR methods Conclusion
E N D
-KVMV Kiran (04005031) -Neeraj Bisht (04005035) -L.Srikanth (04005029) Natural Language Processing for Information Retrieval
OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A
What is Information Retrieval? • Retrieving information media with information content that is relevant to a user's information need. • Information media can be • Text, documents, images, videos • Used for • Searching • Organization
OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A
Approaches to IR • Two types of retrieval • By metadata (subject, heading, keywords etc) • By content • Metadata • Manually assigned • Automatically assigned • Content based IR is more successful of the two.
OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A
Evaluation of IR methods • Precision: Proportion of retrieved set that is relevant • Precision = |relevant & retrieved|/|retrieved| = P(relevant|retrieved) • Recall : Probability that a relevant document is retrieved by the query • Recall = |relevant & retrieved|/|relevant| = P(retrieved|relevant|
Example • 1000 documents, 400 relevant and 600 non-relevant to a query. • An IR procedure retrieves 75 relevant and 25 non-relevant documents. • Precision – 0.75 • Recall - 75/400
Evaluating IR methods • Trivial to have recall of one • Precision tends to decrease as recall increases • A good IR procedure should have both of them high.
Content based IR • Two approaches • Statistical • Linguistic
OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A
Statistical IR • simple focus based on the "bag of words." • all words in a document are treated as its index terms • each term assigned a weight in function of its importance, usually determined by its appearance frequency • pairing the documents' words with that of the query's
Statistical IR(cont..) • Stages in Statistical IR: • Document Preprocessing • consisting in preparing the documents for its parameterisation, eliminating any elements considered as superfluous. • Parametrisation • once the relevant terms have been identified. This consists in quantifying the document's characteristics (that is, the terms).
Statistical IR(cont..) • An Example- an xml document.
Statistical IR(cont..) • Preprocessing phases • remove elements that are not meant for indexing,such as tags and headers
Statistical IR(cont..) • Text standardising • Uncapitalize • Remove numerals and dates • Remove words in Stopword lists • a list of empty words in a terms list (prepositions, determiners, pronouns, etc.) considered to have little semantic value • Identify n-grams • identify words that are usually together (compound words, proper nouns, etc.) to be able to process them as a single conceptual unit • done by estimating the probability of two words that are often together make up a single term (compound) .e,g, Artificial Intelligence, European Union etc
Statistical IR(cont..) • Stemming • Remove suffixes (prefixes) to find the root of the words.
Statistical IR(cont..) • Parameterising the document • assign a weight to each one of the relevant terms associated to a document (usually by appearance frequency)
Statistical IR(cont..) • Estimate the importance of a term • TF*IDF (Term frequency * Inverse Document Frequency) • Term Frequency • a term appears often in one document is indicative that that term is representative of the content • Inverse Document frequency • If it appeared frequently in all documents, it would not have any discriminatory value
Drawbacks of Statistical IR • Linguistic Variance : • Synonyms - Different words convey the same meaning • Might provoke document silence • Relevant documents might not be retrieved, recall decreased • Linguistic Ambiguity : • Homograph - Same word different meaning • Will provoke document noise • Might retrieve too many documents, relating to each meaning of the word, precision decreased
Summary • Statistical IR treats documents as bag of words. • Does not take into consideration the linguistics of the language • Need for more linguistics based approach using complex NLP techniques.
OUTLINE • What is Information Retrieval(IR)? • Approaches to IR • Evaluation of IR methods • Statistical IR methods • Linguistic IR methods • Conclusion • Q&A
Linguistic IR • The documents are analysed through different linguistic levels by linguistic tools that incorporate each level's own annotations to the text • The techniques involved are:- • Morphological analysis • taggers assign each word to a grammatical category
Linguistic IR (cont..) • Syntax analysis • see how words are related and used together in making larger grammatical units, phrases and sentences • restricted to identify the most meaningful structures: nominal sentences.
Linguistic IR (cont..) • Word Sense Disambiguation • Index by concept rather than words • e.g.Bank as a financial institution, bank as the edge of a river. Disambiguation helps for queries like “Runs on a bank” • one of the most often used tools for word sense disambiguation is the lexicographic database WordNet • an annotated semantic lexicon in different languages made up of synonym groups called SYNSETS groups.
Linguistic IR (cont..) • Synsets provide short definitions along with the different semantic relationships between synonym • 23 synsets for stock, including • broth, stock • livestock, stock, farm animal • stock certificate, stock • stock, gillyflower • stock, carry, stockpile (verb) • standard, stock (adjective)
Linguistic IR (cont..) • Use of synsets • For each query word, find its synsets • Query “punch recipes” • punch (3 synsets), recipe (1 synset) • Expand that synset into its “neighborhood” • Grow with WordNet hyponym (is part of) relationships until any additional growth would include a different sense of any word in the core synset • To disambiguate words in a document • Look at all synset neighborhoods for words in document • Compare to the way they overlap throughout collection
Linguistic IR (cont..) • Choose the neighborhoods where local activity is greater than expected global activity
Problems with Linguistic techniques in IR • Linguistic techniques must be essentially perfect to help • Queries are difficult • Non-linguistic techniques implicitly exploit linguistic knowledge
Conclusion • Statistical IR methods have some drawbacks • Linguistic IR methods try to solve those problems have been fairly unsuccessful • Effective IR depends upon properties of queries that make some NLP techniques redundant • Current NLP techniques are not of much help in strict document retrieval.
References • Natural Language Processing and Information Retrieval (Ellen M. Voorhes) • Natural Language Processing in Textual Information Retrieval and Related Topics by Mari Vallez; Rafael Pedraza-Jimenez (http://www.hipertext.net/english/pag1025.htm) • NLP for IR by James Allan http://citeseer.ist.psu.edu/308641.html
References (Contd..) • “A lecture on information retrieval” by Douglas W. Oard (http://www.glue.umd.edu/~oard/papers/CMSC723.ppt)