470 likes | 518 Views
Understand how the query is constructed, processed, and ranked in the context of information retrieval. Learn about the relevance of documents, document representation, types of queries, and models like Boolean and Vector Space. Explore term weights and associations for optimal retrieval.
E N D
Index Query Parse Rank Pre-process Information Retrieval Process Information need Collections How is the query constructed? How is the text processed? text input
Example: Information Needs • Sometimes very specific • <title> Falkland petroleum exploration • <desc> Description: What information is available on petroleum exploration in the South Atlantic near the Falkland Islands? • <narr> Narrative: Any document discussing petroleum exploration in the South Atlantic near the Falkland Islands is considered relevant. Documents discussing petroleum exploration in continental South America are not relevant. • Sometimes very vague • I am going to Kyoto, Japan for a conference in two months. What should I know?
Relevance • In what ways can a document be relevant to a query? • Answer precise question precisely. • Partially answer question. • Suggest a source for more information. • Give background information. • Remind the user of other knowledge. • Others ...
Relevance • How relevant is the document • for this user for this information need. • Subjective, but • Measurable to some extent • How often do people agree a document is relevant to a query • How well does it answer the question? • Complete answer? Partial? • Background Information? • Hints for further exploration?
Document Representation • Information needs and documents are usually represented as sets/bags of terms. • Bag: allow multiple instances of the same element • Terms: words, phrases • To stem or not to stem • Annotation with location information: title, heading
courtesy of Phillip Resnik Bag of Words Example Stop Word List Indexed Term Document 1 Document 2 Document 1 aid 0 1 for The quick brown fox jumped over the lazy dog’s back. all 0 1 is back 1 0 of ‘s brown 1 0 the come 0 1 to dog 1 0 fox 1 0 Document 2 good 0 1 jump 1 0 lazy 1 0 Now is the time for all good men to come to the aid of their party. men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1
Types of Queries • Boolean Query • Does the document satisfy the Boolean expression? • “java” AND “compilers” AND (“unix” OR “linux”) • Vector Query • How similar is the document to the query? • [(java 3) (compiler 2) (unix 1) (linus 1)] • Probabilistic Query • What is the probability that the document is generated by the query?
Boolean Model of Retrieval • Pros • Easy to understand/clear semantics • AND means ‘all’, OR means ‘any’ • Usually computationally efficient • Cons • Difficult to rank results • Rigid: either get too much or too little • AND means ‘all’, OR means ‘any’ • When the information need is complex, it is hard to formulate it as a Boolean query.
T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn Vector Space Model • A collection of n documents with t distinct terms can be represented by a (sparse) matrix. • A query can also be represented as a vector like a document
Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Docs as Vectors
T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 7 T2 Geometric Interpretation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection? Assumption: Documents that are “close together” in space are similar in meaning.
Term Weights • The weight wij reflects the importance of the term Tiin document Dj. • Intuitions: • A term that appears in many documents is not important: e.g., the, going, come, … • If a term is frequent in a document, it is probably important in that document.
Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Pointwise Mutual Information
Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector
Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector
Inverse Document Frequency • IDF provides high values for rare words and low values for common words For a collection of 10000 documents
Term Weights: tf x idf • Term frequency (tf) • the frequency count of a term in a document • Inverse document frequency (idf) • The amount of information contained in the statement “Document X contains the term Ti”. • Assign a tf * idf weight to each term in each document
Term Weights: Pointwise Mutual Information • Pointwise Mutual Information measures the strength of association between two elements (a document and a term). • Observed frequency vs. expected frequency • MI weight is insensitive to stemming and the use of stop word list [Pantel and Lin 02]
What Else can be Terms? • Letter n-grams • Phrases • Relations • Semantic categories
Similarity Measure • Define a similarity measure between a query and a document • Cosine • Dice • Return the documents that are the most similar to the query
Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
Implementation of VS Model • Does the query vector need to be compared with EVERY document vector? • Does the query vector need to be compared with the vectors of documents that contain any of query terms?
What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) • Coverage of Information • Form of Presentation • Effort required/Ease of Use • Time and Space Efficiency • Recall • proportion of relevant material actually retrieved • Precision • proportion of retrieved material actually relevant effectiveness
Relevant vs. Retrieved All docs Retrieved Relevant
Precision vs. Recall All docs Retrieved Relevant
Get as much good stuff while at the same time getting as little junk as possible. Why Precision and Recall?
Retrieved vs. Relevant Documents Very high precision, very low recall Relevant
Retrieved vs. Relevant Documents High recall, but low precision Relevant
Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant
There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries Precision/Recall Curves precision x x x x recall
Difficult to determine which of these two hypothetical results is better: Precision/Recall Curves x precision x x x recall
Average Precision • IR systems typically output a ranked list of documents • For each relevant document, compute the precision up to that point • Average over all precision values computed this way.
Interpolated Average Precision • Precision may go up when going down the ranked list. • Intuitively, this should only go down. • Interpolated Average Precision • for each recall level in 0%, 10%, 20%, … • compute the highest precision after recall reached that point • take the average of the max precision scores
F-Measure • Sometime only one pair of precision and recall is available. • e.g., filtering task • F-Measure • >1: precision is more important • <1: recall is more important • Usually =1
Text Categorization • Goal: classify documents into predefined categories • Approaches: • Naïve Bayes • Nearest Neighbor • SVM
Naïve Bayes Method • Knowledge Base contains • A set of hypotheses • A set of evidences • Probability of an evidence given a hypothesis • Given • A sub set of the evidences known to be present in a situation • Find • the hypothesis with the highest posterior probability: P(H|E1, E2, …, Ek). • The probability itself does not matter so much.
Naïve Bayes Method • Assumptions • Hypotheses are exhaustive and mutually exclusive • H1 v H2 v … v Hk • ¬ (Hi ^ Hj) for any i≠j • Evidences are conditionally independent given a hypothesis • P(E1, E2,…, Ek|H) = P(E1|H)…P(Ek|H) • P(H | E1, E2,…, Ek) = P(E1, E2,…, Ek, H)/P(E1, E2,…, Ek) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek)
Naïve Bayes Method • The goal is to find H that maximize P(E1, E2,…, Ek|H) • Since P(E1, E2,…, Ek|H) = P(E1, E2,…, Ek|H)P(H)/P(E1, E2,…, Ek) and P(E1, E2,…, Ek) is the same for different hypotheses, • Maximizing P(E1, E2,…, Ek|H) is equivalent to maximizing P(E1, E2,…, Ek|H)P(H)= P(E1|H)…P(Ek|H)P(H) • Naïve Bayes Method • Find a hypothesis that maximizes P(E1|H)…P(Ek|H)P(H)
Example: Play Tennis? Predict playing tennis when <sunny, cool, high, strong> What probability should be used to make the prediction? How to compute the probability?
Probabilities of Individual Attributes • Given the training set, we can compute the probabilities
Example: Play Tennis P(+| sunny, cool, high, strong) vs. P(−| sunny, cool, high, strong) P(sunny|+)P(cool|+)P(high|+)P(strong|+)P(+) vs. P(sunny|−)P(cool|−)P(high|−)P(strong|−)P(−)
Application: Spam Detection • Spam • Dear sir, We want to transfer to overseas ($ 126,000.000.00 USD) One hundred and Twenty six million United States Dollars) from a Bank in Africa, I want to ask you to quietly look for a reliable and honest person who will be capable and fit to provide either an existing …… • Legitimate email • Ham: for lack of better name.
Hypotheses: {Spam, Ham} • Evidence: a document • The document is treated as a set (or bag) of words • Knowledge • P(Spam) • The prior probability of an e-mail message being a spam. • How to estimate this probability? • P(w|Spam) • the probability that a word is w if we know w is chosen from a spam. • How to estimate this probability?
Other Text Categorization Algorithms • Support Vector Machine • often has the best performance. • K-Nearest Neighbor