Supervised IR

Supervised IR Refined Computation of relevant set based on: • Incremental user feedback (Relevance Feedback) OR • Initial fixed training sets • User tags relevant/irrelevant • Routing problem  initial class Big open Question – How do we obtain feedback automatically with minimal effort?

“Unsupervised” IRPredicting relevance without user feedback Pattern Matching: Query vector/set Document vector/set Co-occurrence of terms assumed to be indication of relevance

Relevance Feedback Incremental Feedback in vector model Refer to Rocchio, 71 Q0 = Initial Query Q1 = aQ0 + b Ri -dSi NRel NIrrel 1 NRel 1 NIrrel S S i = 1 i = 1

Probabilistic IR/Text Classification Document Retrieval If P(Rel|Doci) > P(Irrel|Doci) Then Doci is “relevant” Else Doci is “not relevant” -OR- P(Rel|Doci) P(Irrel|Doci) Then Doci is “relevant”… Magnitude of ratio indicates our confidence If > 1

Text Classification Select Classj such that: P(Classj | Doci) is maximized (Bowling, DogBreeding, etc.) (incoming mail message) Alternately P(Classj | Doci) P(NOT Classj | Doci) is maximized

General Formulation • Compute: • P(Classj | Evidence) • One of a fixed K classesor set of feature values • *disjoint classes* - Can’t be a (e.g. Words in a language,Medical Test Results, etc) member of more than 1 • Uses: • REL/IRREL  Document Retrieval • Work/Bowling/Dog Breeding  Text Classification/Routing • Spanish/Italian/English  Language ID • Sick/Well  Medical Diagnosis • Herpes/Not Herpes  Medical Diagnosis

Feature Set Goal: To Compute: P(Classj | Doci)  Abstract Formulation P(Classj| Representation of Doci)  Probability given a representation of Doci P(Classj| W1, W2,…Wk) One Representation of a vector of words in the document -OR- P(Classj| F1, F2, … Fk) More general, a list of document features

Problem – Sparse Feature Set In Medical Diagnosis: worth considering all possible feature combinations Can compute P(Evidence|Classi) directly from data for all evidence patterns Eg P(T,T,F|Herpes) = 12/Total Herpes In IR: Too many combinations of feature values to estimate class distribution after all combinations

Bayes Rule Posterior probability of class given evidence Prior probability of class P(Classi|Evidence) = P(Evidence|Classi) x P(Classi) P(Evidence) Uninformative prior: P(Classi) = 1 (Total # of Classes)

Example in Medical Diagnosis A single blood test Probability of test if patient has herpes P(Herpes|Evidence) = P(Evidence|Herpes) * P(Herpes) Probability of herpes given a test result P(Evidence) Prior probability of patient having herpes Prob of a (pos/neg) test result P(Herpes|Positive Test) = .9 P(Herpes|Negative Test) = .001 P(Not Herpes|Positive Test) = .1 P(Not Herpes|Negative Test) = .999

Evidence Decomposition P(Classj | Evidence) A given combination of feature values Medical Diagnosis Text Classification/ Routing

Example in Text Classification / Routing Dog Breeding (collie, groom, show) Prior chance that mail is about dog breeding P(Classi|Evidence) = P(Evidence|Classi) * P(Classi) Observe directly through Training data P(Evidence) Class 1 – Dog Breeding Training Class 2 - Work Fur Collie Compiler X86 C++ Collie Groom Show Lex YACC Computer Poodle Sire Breed Akita Pup Java

Probabalistic IR Target/Goal: For a given model of relevance to user’s needs Document Retrieval Document Routing / Classification

Multiple Binary Splits Q1 B A B2 A1 A2 B1 Flat K-Way Classification Q1 F G C E A B D

Likelihood Ratios P(Rel|Doci) Document Retrieval options are P(Irrel|Doci) Rel and Irrel 1. BinaryClassifications P(Work|Doci) Binary routing task – P(Personal|Doci) (2 possible classes) 2. Can Treat K-Way classification as a series of binary classifications • P(Classj|Doci) • P(NOT Classj|Doci) • Compute this ratio for all classes • Choose class j for which this ratio is greatest 3.

Note: Ratios (Partially) Self Weighting ( ) P(The|Personal) 1 5137/100,000 » e.g. P(The|Work) 1 5238/100,000 ( ) P(Akita|Personal) 37 37/100,000 » e.g. P(Akita|Work) 1 1/100,000

Dependence Trees P(w1,w2,…,wk) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w2) * P(w5|w2) * P(w6|w5) * P(w6|w5w4) w1 w2 (Hierarchical Bayesian Models) = direction of dependence w3 w5 w4 w6

Full Probability Decomposition Using Simplifying(Markov) Assumption Assume word only condition on prob. of prev. word Assumption of Full Independence (Graphical) Models – Partial Decomposition intoDependence Trees

Full Probability Decomposition P(w) = P(w1) * P(w2|w1) * P(w3|w2w1) * P(w4|w3w2w1) * … Using Simplifying (Markov) Assumptions P(w) = P(w1) * P(w2|w1) * P(w3|w2) * P(w4|w3) * … (Assume P(word) only conditional upon the probability of the previous word) Assumption of Full Independence P(w) = P(w1) * P(w2) * P(w3) * P(w4) * … Graphical Models – Partial Decomposition into Dependence Trees P(w) = P(w1) * P(w2|w1) * P(w3) * P(w4|w1w3) * …

Supervised IR

Supervised IR

Presentation Transcript

Supervised Learning

Supervised learning

SUPERVISED DIVERSIONARY PROGRAM

Supervised Learning

Semi-Supervised Learning

Supervised Learning

Supervised Classification

Supervised Writing

Supervised Writing

SUPERVISED AGRICULTURE EXPERIENCE

Supervised Clustering

Supervised and semi-supervised learning for NLP

Supervised classification

Supervised Learning

Semi-Supervised Learning

Supervised classification

Semi-Supervised Learning

Supervised Learning

Supervised Learning

Supervised Classification

Supervised Learning