A Review of Information Filtering Part I: Adaptive Filtering

A Review of Information FilteringPart I: Adaptive Filtering Chengxiang Zhai Language Technologies Institiute School of Computer Science Carnegie Mellon University

Outline • The Problem of Adaptive Information Filtering (AIF) • The TREC Work on AIF • Evaluation Setup • Main Approaches • Sample Results • The Importance of Learning • Summary & Research Directions

Adaptive Information Filtering (AIF) • Dynamic information stream • (Relatively) stable user interest • System “blocks” non-relevant information according to user’s interest • User provides feedback on the received items • System learns from user’s feedback • Performance measured by the utility of the filtering decisions

A Typical AIF Application: News Filtering • Given a news stream and users • Each user expresses interest by a text “query” • For each news article, system makes a yes/no filtering decision for each user interest • User provides feedback on the received news • System learns from feedback • Utility = 3*|Good| - 2 *|Bad|

AIF vs. Retrieval, Categorization, Topic tracking etc. • AIF is like retrieval over a dynamic stream of information items, but ranking is impossible • AIF is like online binary categorization without initial training data and with limited feedback • AIF is like tracking user interest over a news stream

Evaluation of AIF • Primary measure: linear utility (->prob. cut) • E.g., used in TREC7 & 8 used in TREC9 • Problems with the linear utility • Unbounded • Not comparable across topics/profiles • Average utility may be dominated by one topic

Other Measures • Nonlinear utility (e.g., “early” relevant doc is worth more) • Normalized utility • More meaningful for averaging • But can be inversely correlated with precision/recall! • Other measures that reflect a trade-off between precision and recall

Accumulated Docs Feedback Learning A Typical AIF System User profile text Initialization Accepted Docs Binary Classifier ... User User Interest Profile Doc Source utility func

Three Basic Problems in AIF • Making filtering decision (Binary classifier) • Doc text, profile text  yes/no • Initialization • Initialize the filter based on only the profile text or very few examples • Learning from • Limited relevance judgments (only on “yes” docs) • Accumulated documents • All trying to maximize the utility

The TREC Work on AIF • The Filtering Track of TREC • Major Approaches to AIF • Sample Results

The Filtering Track (TREC7, 8, &9)(Hull 99, Hull & Robertson 00, Robertson & Hull 01) • Encourage development and evaluation of techniques for text filtering • Tasks • Adaptive filtering (start with little/none training, online filtering with limited feedback) • Batch filtering (start with many training examples, online filtering with limited feedback) • Routing (start with many training examples, ranking test documents)

AIF Evaluation Setup • TREC7: LF1, LF3 utility functions • AP88--90 + 50 topics • No training initially • TREC8: LF1, LF2 utility functions • Financial Times 92-94 + 50 topics • No training initially • TREC9: T9U, Precision@50, etc • OHSUMED + 63 original topics + 4903 MeSH topics • 2? initial (positive) training examples available

Major Approaches to AIF • “Extended” retrieval systems • “Reuse” retrieval techniques to score documents • Use a score threshold for filtering decision • Learn to improve scoring with traditional feedback • New approaches to threshold setting and learning • “Modified” categorization systems • Adapt to binary, unbalanced categorization • New approaches to initialization • Train with “censored” training examples

A General Vector-Space AIF Approach no doc vector Utility Evaluation Scoring Thresholding yes profile vector threshold Vector Learning Threshold Learning Feedback Information

Extended Retrieval Systems • City Univ./MicroSoft (Okapi): Prob. IR • Univ. of Massachusetts (Inquery): Infer. Net. • Queens College, CUNY (Pirc): Prob. IR • Clairvoyance Corp. (Clarit): Vector Space • Univ. of Nijmegen (KUN): Vector Space • Univ. of Twente (TNO): Language Model • And many others … ...

Threshold Setting in Extended Retrieval Systems • Utility-independent approaches (generally not working well, not covered in this talk) • Indirect (linear) utility optimization • Logistic regression (score->prob. of relevance) • Direct utility optimization • Empirical utility optimization • Expected utility optimization given score distributions • All try to learn the optimal threshold

Difficulties in Threshold Learning • Censored data • Little/none labeled data • Scoring bias due to vector learning 36.5 R 33.4 N 32.1 R 29.9 ? 27.3 ? … ... =30.0

Logistic Regression • General idea: convert score of D to p(R|D) • Fit the model using feedback data • Linear utility is optimized with a fixed prob. cutoff • But, • Possibly incorrect parametric assumptions • No positive examples initially • Censored data and limited positive feedback

Logistic Regression in Okapi(Robertson & Walker 2000) • Motivation: Recover probability of relevance from the original prob. IR model • Need to estimate , , and ast1 (avg. score of top 1% docs) • All topics share the same , which is initially set and never updated

Logistic Regression in Okapi(cont.) • Initially, all topics share the same , , and ast1 is estimated with a linear regression ast1 = a1 + a2 * maxscore • After one week, ast1 is estimated based on the documents available from the week. • Threshold learning •  is fixed all the time •  is updated with gradient descent • heuristic “ladder” is used to allow “exploration”

Logistic Regression in Okapi(cont.) • Pros • Well-motivated method for the Okapi system • Based on principled approach • Cons • Limited adaptation • Exploration is ad hoc (over-explore initially) • Some nonlinear utility may not correspond to a fixed probability cutoff

Direct Utility Optimization • Given • A utility function U(CR+ ,CR- ,CN+ ,CN-) • Training data D={<si, {R,N,?}>} • Formulate utility as a function of the threshold and training data: U=F(,D) • Choose the threshold by optimizing F(,D), i.e.,

Empirical Utility Optimization • Basic idea • Compute the utility on the training data for each candidate threshold (score of a training doc) • Choose the threshold that gives the maximum utility • Difficulty: Biased training sample! • We can only get an upper bound for the true optimal threshold. • Solutions: • Heuristic adjustment(lowering) of threshold • Lead to “beta-gamma threshold learning”

The Beta-Gamma Threshold Learning Method in CLARIT(zhai et al. 00) • Basic idea • Extend the empirical utility optimization method by putting a lower bound on the threshold. •  is to correct score bias •  is to control exploration • ,  are relatively stable and can be tuned based on independent data • Can optimize any utility function (with appropriate “zero” utility )

Encourage exploration up to zero  Utility  Cutoff position ,N 0 1 2 3 … K ...  , [0,1] The more examples, the less exploration (closer to optimal) Illustration of Beta-Gamma Threshold Learning

Beta-Gamma Threshold Learning (cont.) • Pros • Explicitly addresses exploration-exploitation tradeoff (“Safe” exploration) • Arbitrary utility (with appropriate lower bound) • Empirically effective and robust • Cons • Purely heuristic • Zero utility lower bound often too conservative

Score Distribution Approaches( Aramptzis & Hameren 01; Zhang & Callan 01) • Assume generative model of scores p(s|R), p(s|N) • Estimate the model with training data • Find the threshold by optimizing the expected utility under the estimated model • Specific methods differ in the way of defining and estimating the scoring distributions

A General Formulation of Score Distribution Approaches • Given p(R), p(s|R), and p(s|N), E[U] for sample size n, is a function of  and n, I.e., E[U]=F(n, ) • The optimal threshold for sample size n is

Solution for Linear Utility& Continuous p(s|R) & p(s|N) • Linear utility • The optimal threshold is the solution to the following equation (independent of n)

Gaussian-Exponential Distributions • P(s|R) ~ N(,2) p(s-s0|N) ~ E() (From Zhang & Callan 2001)

Optimal Threshold for Gaussian-Exp. Distributions

Parameter Estimation in KUN(Aramptzis & Hameren 01) • , 2 estimated using ML on rel. docs •  estimated using top 50 non-rel. docs • Some recent “improvement”: • Compute p(s) based on p(wi) • Initial distribution: q as the only rel doc. • Soft probabilistic threshold, sampling with p(R|s)

Maximum Conditional Likelihood (Zhang & Callan 01) • Explicitly modeling of censored data • Data: {<si, ri,i>} ri {R,N}, • Maximizing • Conjugate Gradient Descent • Prior is introduced for smoothing (making it Bayesian?) • Minimum “delivery ratio” used to ensure exploration

Score Distribution Approaches (cont.) • Pros • Principled approach • Arbitrary utility • Empirically effective • Cons • May be sensitive to the scoring function • Exploration not addressed

“Modified” Categorization Methods • Mostly applied to batch filtering, or routing and sometimes combined with Rocchio • K-Nearest Neighbor (CMU) • Naïve Bayes (Seoul) • Neural Network (ICDC, DSO, IRIT) • Decision Tree (NTT) • Only K-Nearest Neighbor was applied to AIF (CMU) • With special thresholding strategies

The State of the Art Performance • For high-precision utilities, system can hardly beat the zero-return baseline! (I.e., negative utility) • Direct/indirect utility optimization methods generally performed much better than utility-independent tuning of threshold • Hard to compare different threshold learning methods, due to too many other factors (e.g., scoring, etc)

TREC7 • No initial example • No system beats the zero-return baseline for F1 (pr>=0.4) • Several systems beat the zero-return baseline for F3 (pr>=0.2) (from Hull 99)

TREC7 • Learning effect is clear in some systems • But, stream is not “long” enough for systems to benefit from learning (from Hull 99)

TREC8 • Again, learning effect is clear • But, systems still couldn’t beat the zero-return baseline! (from Hull & Robertson 00)

TREC9 • 2 initial examples • Amplifying learning effect • T9U (prob >=0.33) • Systems clearly beat the zero-return baseline! (from Robertson & Hull 01)

The Importance of Learning in AIF(Results from Zhai et al. 00) • Learning and initial inaccuracies: Learning compensates for initial inaccuracies • Exploitation vs. exploration: Exploration (lowering threshold) pays off in the long run score ideal adaptive idealfixed actual adaptive actual fixed time

Learning Effect 1: Correction of Inappropriate Initial Threshold Setting bad initial threshold without updating bad initial threshold with updating

Learning Effect 2: Early Exploration Pays Off

Learning Effect 3: Regular Exploration Pays Off Later

Tradeoff between Exploration and Exploitation: under-explore over-explore

Summary • AIF is a very interesting and challenging online learning problem • As a learning task, it has extremely sparse training data • Initially no training data • Later, limited and censored training examples • Practically, learning must also be efficient

Summary(cont.) • Evaluation of AIF is challenging • Good performance (utility) is achieved by • Direct/indirect utility optimization • Learning the optimal score threshold from feedback • Appropriate tradeoff between exploration and exploitation • Several different threshold methods can all be effective

Research Directions • Threshold learning • Non-parametric score density estimation? • Controlled comparison of threshold methods • Integrated AIF model • Bayesian decision theory + EM? • Exploration-exploitation tradeoff • Reinforcement learning? • User model & evaluation measures • Users care about more factors than the linear utility • Users’ interest may drift over time • Redundancy reduction & novelty detection

References • General papers on TREC filtering evaluation • D. Hull, The TREC-7 Filtering Track: Description and Analysis, TREC-7 Proceedings. • D. Hull and S. Robertson, The TREC-8 Filtering Track Final Report, TREC-8 Proceedings. • S. Robertson and D. Hull, The TREC-9 Filtering Track Final Report, TREC-9 Proceedings. • Papers on specific adaptive filtering methods • Stephen Robertson and Stephen Walker (2000),Threshold Setting in Adaptive Filtering . Journal of Documentation, 56:312-331, 2000 • Chengxiang Zhai, Peter Jansen, and David A. Evans, Exploration of a heuristic approach to threshold learning in adaptive filtering, 2000 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Poster presentation. • Avi Arampatzis and Andre van HamerenThe Score-Distributional Threshold Optimization for Adaptive Binary Classification Tasks, SIGIR'2001. • Yi Zhang and Jamie Callan, 2001,Maximum Likelihood Estimation for Filtering Threshold, SIGIR 2001.

A Review of Information Filtering Part I: Adaptive Filtering

A Review of Information Filtering Part I: Adaptive Filtering

Presentation Transcript

Mathematical Morphology II: Filtering

Packet Filtering

Wireless Security

Chapter 8. FIR Filter Design

Tutorial on Junk Mail Filtering

Learning from the Crowd: Collaborative Filtering Techniques for Identifying On-the-Ground Twitterers during Mass Disru

Naïve Bayes

Spatial Filtering

Fall 2004, CIS, Temple University CIS527: Data Warehousing, Filtering, and Mining Lecture 6

Image Processing and Analysis

ELECTRONIC CIRCUITS

防火牆教育訓練

IRTF Adaptive Optics System Review

Introduction to audio signal processing

Text Categorization

The Immune System: Innate and Adaptive:Body Defenses: Part A

The Immune System: Innate and Adaptive:Body Defenses: Part A

Introduction to Computer Vision

Advanced Internet Security

CS589-04 Digital Image Processing Lecture 2. Intensity Transformation and Spatial Filtering

The Immune System: Innate and Adaptive:Body Defenses: Part A

Blocked at the border