230 likes | 241 Views
This paper investigates the problem of obtaining feedback automatically with minimal effort in supervised information retrieval. The proposed approach includes incremental user feedback, initial fixed training sets, and user tags for relevant and irrelevant documents. It also explores pattern matching, query and document vector set co-occurrence, Rocchio relevance feedback, and probabilistic IR and text classification. The goal is to compute the probability of relevance given a representation of a document.
E N D
Supervised IR Refined Computation of relevant set based on: • Incremental user feedback (Relevance Feedback) OR • Initial fixed training sets • User tags relevant/irrelevant • Routing problem initial class Big open Question – How do we obtain feedback automatically with minimal effort?
“Unsupervised” IRPredicting relevance without user feedback Pattern Matching: Query vector/set Document vector/set Co-occurrence of terms assumed to be indication of relevance
Relevance Feedback Incremental Feedback in vector model Refer to Rocchio, 71 Q0 = Initial Query Q1 = aQ0 + b Ri -dSi NRel NIrrel 1 NRel 1 NIrrel S S i = 1 i = 1
Probabilistic IR/Text Classification Document Retrieval If P(Rel|Doci) > P(Irrel|Doci) Then Doci is “relevant” Else Doci is “not relevant” -OR- P(Rel|Doci) P(Irrel|Doci) Then Doci is “relevant”… Magnitude of ratio indicates our confidence If > 1
Text Classification Select Classj such that: P(Classj | Doci) is maximized (Bowling, DogBreeding, etc.) (incoming mail message) Alternately P(Classj | Doci) P(NOT Classj | Doci) is maximized
General Formulation • Compute: • P(Classj | Evidence) • One of a fixed K classesor set of feature values • *disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc) member of more than 1 • Uses: • REL/IRREL Document Retrieval • Work/Bowling/Dog Breeding Text Classification/Routing • Spanish/Italian/English Language ID • Sick/Well Medical Diagnosis • Herpes/Not Herpes Medical Diagnosis
Feature Set Goal: To Compute: P(Classj | Doci) Abstract Formulation P(Classj| Representation of Doci) Probability given a representation of Doci P(Classj| W1, W2,…Wk) One Representation of a vector of words in the document -OR- P(Classj| F1, F2, … Fk) More general, a list of document features
Problem – Sparse Feature Set In Medical Diagnosis: worth considering all possible feature combinations Can compute P(Evidence|Classi) directly from data for all evidence patterns Eg P(T,T,F|Herpes) = 12/Total Herpes In IR: Too many combinations of feature values to estimate class distribution after all combinations
Bayes Rule Posterior probability of class given evidence Prior probability of class P(Classi|Evidence) = P(Evidence|Classi) x P(Classi) P(Evidence) Uninformative prior: P(Classi) = 1 (Total # of Classes)
Example in Medical Diagnosis A single blood test Probability of test if patient has herpes P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes) Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes Prob of a (pos/neg) test result P(Herpes|Positive Test) = .9 P(Herpes|Negative Test) = .001 P(Not Herpes|Positive Test) = .1 P(Not Herpes|Negative Test) = .999
Evidence Decomposition P(Classj | Evidence) A given combination of feature values Medical Diagnosis Text Classification/ Routing
Example in Text Classification / Routing Dog Breeding (collie, groom, show) Prior chance that mail is about dog breeding P(Classi|Evidence) = P(Evidence|Classi) * P(Classi) Observe directly through Training data P(Evidence) Class 1 – Dog Breeding Training Class 2 - Work Fur Collie Compiler X86 C++ Collie Groom Show Lex YACC Computer Poodle Sire Breed Akita Pup Java
Probabalistic IR Target/Goal: For a given model of relevance to user’s needs Document Retrieval Document Routing / Classification
Multiple Binary Splits Q1 B A B2 A1 A2 B1 Flat K-Way Classification Q1 F G C E A B D
Likelihood Ratios P(Class1| Evidence) = P(Evidence|Class1) * P(Class1) P(Evidence) P(Class2| Evidence) = P(Evidence|Class2) * P(Class2) P(Evidence) P(Class1|Evidence) P(Evidence|Class1) P(Class1) P(Class2|Evidence) P(Evidence|Class2) P(Class2) = *
Likelihood Ratios P(Rel|Doci) Document Retrieval options are P(Irrel|Doci) Rel and Irrel 1. BinaryClassifications P(Work|Doci) Binary routing task – P(Personal|Doci) (2 possible classes) 2. Can Treat K-Way classification as a series of binary classifications • P(Classj|Doci) • P(NOT Classj|Doci) • Compute this ratio for all classes • Choose class j for which this ratio is greatest 3.
Independence Assumption Evidence = w1,w2,w3,…wk P(Class1|Evidence) P(Class1) P(Evidence|Class1) P(Class2|Evidence) P(Class2) P(Evidence|Class2) P(Class1) P(wi|Class1) P(Class2) P(wi|Class2) Final Odds Initial Odds = * k Õ = * i = 1
Using Independence Assumption P(Personal|Akita,pup,fur,show) P(Personal) P(Akita|Personal) P(pup|Personal) = * * P(Work|Akita,pup,fur,show) P(Work) P(Akita|Work) P(pup|Work) P(fur|Personal) P(show|Personal) * * P(fur|Work) P(show|Work) P(Personal|Evidence) 1 27 18 36 3 = * * * * P(Work|Evidence) 9 2 0 2 5 Product of likelihood ratios for each word a1= some constant
Note: Ratios (Partially) Self Weighting ( ) P(The|Personal) 1 5137/100,000 » e.g. P(The|Work) 1 5238/100,000 ( ) P(Akita|Personal) 37 37/100,000 » e.g. P(Akita|Work) 1 1/100,000
Bayesian Model ApplicationsAuthorship Identification P(Hamilton|Evidence) P(Evidence|Hamilton) P(Hamilton) = * P(Madison|Evidence) P(Evidence|Madison) P(Madison) Sense Disambiguation P(Tank-Container|Evidence) P(Evidence|Tank-Container) P(Tank-Container) = * P(Tank-Vehicle|Evidence) P(Evidence|Tank-Vehicle) P(Tank-Vehicle)
Dependence Trees P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4) w1 w2 (Hierarchical Bayesian Models) = direction of dependence w3 w5 w4 w6
Full Probability Decomposition Using Simplifying(Markov) Assumption Assume word only condition on prob. of prev. word Assumption of Full Independence (Graphical) Models – Partial Decomposition intoDependence Trees
Full Probability Decomposition P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * … Using Simplifying (Markov) Assumptions P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * … (Assume P(word) only conditional upon the probability of the previous word) Assumption of Full Independence P(w) = P(w1) * P(w2) * P(w3) * P(w4) * … Graphical Models – Partial Decomposition into Dependence Trees P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …