220 likes | 228 Views
This article explores methods to obtain user feedback automatically with minimal effort in supervised information retrieval systems, using incremental user feedback and initial fixed training sets.
E N D
Supervised IR Refined Computation of relevant set based on: • Incremental user feedback (Relevance Feedback) OR • Initial fixed training sets • User tags relevant/irrelevant • Routing problem initial class Big open Question – How do we obtain feedback automatically with minimal effort?
“Unsupervised” IRPredicting relevance without user feedback Pattern Matching: Query vector/set Document vector/set Co-occurrence of terms assumed to be indication of relevance
Relevance Feedback Incremental Feedback in vector model Refer to Rocchio, 71 Q0 = Initial Query Q1 = aQ0 + b Ri -dSi NRel NIrrel 1 NRel 1 NIrrel S S i = 1 i = 1
Probabilistic IR/Text Classification Document Retrieval If P(Rel|Doci) > P(Irrel|Doci) Then Doci is “relevant” Else Doci is “not relevant” -OR- P(Rel|Doci) P(Irrel|Doci) Then Doci is “relevant”… Magnitude of ratio indicates our confidence If > 1
Text Classification Select Classj such that: P(Classj | Doci) is maximized (Bowling, DogBreeding, etc.) (incoming mail message) Alternately P(Classj | Doci) P(NOT Classj | Doci) is maximized
General Formulation • Compute: • P(Classj | Evidence) • One of a fixed K classesor set of feature values • *disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc) member of more than 1 • Uses: • REL/IRREL Document Retrieval • Work/Bowling/Dog Breeding Text Classification/Routing • Spanish/Italian/English Language ID • Sick/Well Medical Diagnosis • Herpes/Not Herpes Medical Diagnosis
Feature Set Goal: To Compute: P(Classj | Doci) Abstract Formulation P(Classj| Representation of Doci) Probability given a representation of Doci P(Classj| W1, W2,…Wk) One Representation of a vector of words in the document -OR- P(Classj| F1, F2, … Fk) More general, a list of document features
Problem – Sparse Feature Set In Medical Diagnosis: worth considering all possible feature combinations Can compute P(Evidence|Classi) directly from data for all evidence patterns Eg P(T,T,F|Herpes) = 12/Total Herpes In IR: Too many combinations of feature values to estimate class distribution after all combinations
Bayes Rule Posterior probability of class given evidence Prior probability of class P(Classi|Evidence) = P(Evidence|Classi) x P(Classi) P(Evidence) Uninformative prior: P(Classi) = 1 (Total # of Classes)
Example in Medical Diagnosis A single blood test Probability of test if patient has herpes P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes) Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes Prob of a (pos/neg) test result P(Herpes|Positive Test) = .9 P(Herpes|Negative Test) = .001 P(Not Herpes|Positive Test) = .1 P(Not Herpes|Negative Test) = .999
Evidence Decomposition P(Classj | Evidence) A given combination of feature values Medical Diagnosis Text Classification/ Routing
Example in Text Classification / Routing Dog Breeding (collie, groom, show) Prior chance that mail is about dog breeding P(Classi|Evidence) = P(Evidence|Classi) * P(Classi) Observe directly through Training data P(Evidence) Class 1 – Dog Breeding Training Class 2 - Work Fur Collie Compiler X86 C++ Collie Groom Show Lex YACC Computer Poodle Sire Breed Akita Pup Java
Probabalistic IR Target/Goal: For a given model of relevance to user’s needs Document Retrieval Document Routing / Classification
Multiple Binary Splits Q1 B A B2 A1 A2 B1 Flat K-Way Classification Q1 F G C E A B D
Likelihood Ratios P(Class1| Evidence) = P(Evidence|Class1) * P(Class1) P(Evidence) P(Class2| Evidence) = P(Evidence|Class2) * P(Class2) P(Evidence) P(Class1|Evidence) P(Evidence|Class1) P(Class1) P(Class2|Evidence) P(Evidence|Class2) P(Class2) = *
Likelihood Ratios P(Rel|Doci) Document Retrieval options are P(Irrel|Doci) Rel and Irrel 1. BinaryClassifications P(Work|Doci) Binary routing task – P(Personal|Doci) (2 possible classes) 2. Can Treat K-Way classification as a series of binary classifications • P(Classj|Doci) • P(NOT Classj|Doci) • Compute this ratio for all classes • Choose class j for which this ratio is greatest 3.
Independence Assumption Evidence = w1,w2,w3,…wk P(Class1|Evidence) P(Class1) P(Evidence|Class1) P(Class2|Evidence) P(Class2) P(Evidence|Class2) P(Class1) P(wi|Class1) P(Class2) P(wi|Class2) Final Odds Initial Odds = * k Õ = * i = 1
Using Independence Assumption P(Personal|Akita,pup,fur,show) P(Personal) P(Akita|Personal) P(pup|Personal) = * * P(Work|Akita,pup,fur,show) P(Work) P(Akita|Work) P(pup|Work) P(fur|Personal) P(show|Personal) * * P(fur|Work) P(show|Work) P(Personal|Evidence) 1 27 18 36 3 = * * * * P(Work|Evidence) 9 2 0 2 5 Product of likelihood ratios for each word a1= some constant
Note: Ratios (Partially) Self Weighting ( ) P(The|Personal) 1 5137/100,000 » e.g. P(The|Work) 1 5238/100,000 ( ) P(Akita|Personal) 37 37/100,000 » e.g. P(Akita|Work) 1 1/100,000
Bayesian Model ApplicationsAuthorship Identification P(Hamilton|Evidence) P(Evidence|Hamilton) P(Hamilton) = * P(Madison|Evidence) P(Evidence|Madison) P(Madison) Sense Disambiguation P(Tank-Container|Evidence) P(Evidence|Tank-Container) P(Tank-Container) = * P(Tank-Vehicle|Evidence) P(Evidence|Tank-Vehicle) P(Tank-Vehicle)
Dependence Trees P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4) w1 w2 (Hierarchical Bayesian Models) = direction of dependence w3 w5 w4 w6
Full Probability Decomposition P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * … Using Simplifying (Markov) Assumptions P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * … (Assume P(word) only conditional upon the probability of the previous word) Assumption of Full Independence P(w) = P(w1) * P(w2) * P(w3) * P(w4) * … Graphical Models – Partial Decomposition into Dependence Trees P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …