310 likes | 402 Views
Filtering Semi-Structured Documents Based on Faceted Feedback. Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz. Personalized Information Filtering. Identify user-desired documents from a document stream
E N D
Filtering Semi-Structured DocumentsBased on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz
Personalized Information Filtering • Identify user-desired documents from a document stream • Two families of filtering approaches • Collaborative Filtering (CF) • Content-Based Filtering (CBF) • Applications: news feeder, email spam filter, etc. Emails Passed documents News Filtering System Blogs …
Semi-Structured Documents • Increasingly prevalent over the Internet • Emails, news, movies, tweets, etc. • Plenty of metadata available
Definitions • Facet: a metadata field • Date, Topic, Location, Director, Genre, etc. • Facet-Value Pair (FVP): a metadata field assigned with a particular value • Topic: Royal wedding • Date: 04-29-2011 • Location: London, UK Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Motivation • Existing filtering approaches learn user interests based on users’ relevance judgments of documents • Users may have prior knowledge on which facet-value pairs are relevant • English-only readers • “Language: English” • Social network analysts • “Company: Facebook” Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Can we exploit users’ prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
A New User Interaction Mechanism:Faceted Feedback FVP candidates: • Lang: … • Topic: … • Date: … Filtering System Relevant FVPs: • Topic: … • Lang: …
Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Q1: Possible Methods • Feature selection methods for text classification • E.g., Mutual Information, Chi-Square measure, etc. • Usually a large number of labeled documents available • Query expansion methods for retrieval • E.g., TFIDF score on pseudo relevant documents • No labeled documents available Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
FVP Selection: Our Approach • In a filtering task • A large number of unlabeled documents • Possibly a small number of labeled documents • We rank facet-value pairs by Pseudo relevant (positively classified) documents User-labeled relevant documents Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Content-Based Filtering (CBF) • Treated as a binary text classification task • User profile: a feature vector that represents a user’s information needs (interests/preferences) • Given the user profile θ, a document can be determined as relevant or not according to: Document label Document vector The core of CBF is learning the user profile! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Q2: Possible Methods • Simple methods • Boolean strategy (AND, OR) • Feature selection • Pseudo relevant document • Sophisticated methods • Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) • Generalized Expectation Criteria (Druck et al. 08) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Our Approach • The assumption • A feature is selected by a user since it has a high correlation with the document label (R/NR) • Generalized Constraint Model (GCM) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Correlation Decomposition • Sufficiency • The probability of a document being relevant given that the feature has occurred: P(R+|f=1) • P(R+|f=1)=1 : sufficient features • E.g., “Company: Facebook” for social network analysts • Necessity • The probability of the feature having occurred given that a document is relevant: P(f=1|R+) • P(f=1|R+)=1 : necessary features • E.g., “Language: English” for English-only readers Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Examples: Highly-Correlated Features The whole corpus f2=1 R+ f1=1 f3=1 1) f1 is a sufficient feature since P(R+|f1=1)=1 2) f2 is a necessary feature since P(f2=1|R+)=1 3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Estimating Sufficiency Document label User profile vector The feature Estimation of the label of document di The set of documents covered by feature f Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Estimating Necessity Bayes’ Theorem! Feature sufficiency Prior distribution Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Reference Distributions • Our assumption • User selects a feature since it has a high sufficiency and/or a high necessity • Reference distributions: two Bernoulli dist’ns • The sufficiency/necessity of a user-selected feature should be close to the reference distribution • KL-divergence for similarity measure Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
User Profile Learning • The unified loss function to combine two types of feedback: User-labeled documents Sufficient features Necessary features Ts , Tn:reference dist’ns Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
User Interaction Mechanisms Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Two mechanisms • Mechanism 1: ask users to select features they think are relevant • Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21
Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary
Data Sets • Use two data sets from TREC filtering track • TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs) • Metadata field: MeSH (Medical Subject Headings) • TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors • Metadata fields: Topic, Industry, Region • Split each topic set into two equal-size subsets • One for parameter tuning, the other for testing Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Faceted Feedback Collection • Recruit subjects on Mechanical Turk • Five subjects per topic • The average performances will be reported • For each topic, we show subjects • The topic description (information need) • A group of facet-value pair candidates Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Evaluation Metrics • Precision (macro) • Recall (macro) • T11U = 2 * Nrd – Nnd • Nrd: the number of relevant docs delivered • Nnd: the number of non-relevant docs delivered • T11SU = • MinNU = -0.5 • MaxU: the maximum possible utility (T11U) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary
Results 1: w/wo Faceted Feedback (FF) # relevant docs initially known Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Results 2: Different Learning Algorithms Existing approaches Our approach BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria
Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary
Summary • Faceted feedback is useful for filtering, especially in the cold-start scenarios • The Generalized Constraint Model (GCM) is a robust user profile learning algorithm • In future work, we will evaluate our methods on data sets where faceted features are more important • Movie, music, product, etc. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz
Questions? Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz lanbo@soe.ucsc.edu yiz@soe.ucsc.edu xingqianli@gmail.com