Filtering Semi-Structured Documents Based on Faceted Feedback

Filtering Semi-Structured DocumentsBased on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz

Personalized Information Filtering • Identify user-desired documents from a document stream • Two families of filtering approaches • Collaborative Filtering (CF) • Content-Based Filtering (CBF) • Applications: news feeder, email spam filter, etc. Emails Passed documents News Filtering System Blogs …

Semi-Structured Documents • Increasingly prevalent over the Internet • Emails, news, movies, tweets, etc. • Plenty of metadata available

Definitions • Facet: a metadata field • Date, Topic, Location, Director, Genre, etc. • Facet-Value Pair (FVP): a metadata field assigned with a particular value • Topic: Royal wedding • Date: 04-29-2011 • Location: London, UK Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Motivation • Existing filtering approaches learn user interests based on users’ relevance judgments of documents • Users may have prior knowledge on which facet-value pairs are relevant • English-only readers • “Language: English” • Social network analysts • “Company: Facebook” Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Can we exploit users’ prior knowledge on facet-value pairs for filtering? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

A New User Interaction Mechanism:Faceted Feedback FVP candidates: • Lang: … • Topic: … • Date: … Filtering System Relevant FVPs: • Topic: … • Lang: …

Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Q1: Possible Methods • Feature selection methods for text classification • E.g., Mutual Information, Chi-Square measure, etc. • Usually a large number of labeled documents available • Query expansion methods for retrieval • E.g., TFIDF score on pseudo relevant documents • No labeled documents available Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

FVP Selection: Our Approach • In a filtering task • A large number of unlabeled documents • Possibly a small number of labeled documents • We rank facet-value pairs by Pseudo relevant (positively classified) documents User-labeled relevant documents Intuition: features that occur frequently among relevant docs while rarely in the whole corpus are very likely to be relevant Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Research Questions • Question 1 • How to select facet-value pair candidates? • Question 2 • How to learn user profiles based on faceted feedback? Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Content-Based Filtering (CBF) • Treated as a binary text classification task • User profile: a feature vector that represents a user’s information needs (interests/preferences) • Given the user profile θ, a document can be determined as relevant or not according to: Document label Document vector The core of CBF is learning the user profile! Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Q2: Possible Methods • Simple methods • Boolean strategy (AND, OR) • Feature selection • Pseudo relevant document • Sophisticated methods • Bayesian logistic regression with an adjusted prior (Dayanik et al. 06) • Generalized Expectation Criteria (Druck et al. 08) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Our Approach • The assumption • A feature is selected by a user since it has a high correlation with the document label (R/NR) • Generalized Constraint Model (GCM) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Correlation Decomposition • Sufficiency • The probability of a document being relevant given that the feature has occurred: P(R+|f=1) • P(R+|f=1)=1 : sufficient features • E.g., “Company: Facebook” for social network analysts • Necessity • The probability of the feature having occurred given that a document is relevant: P(f=1|R+) • P(f=1|R+)=1 : necessary features • E.g., “Language: English” for English-only readers Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Examples: Highly-Correlated Features The whole corpus f2=1 R+ f1=1 f3=1 1) f1 is a sufficient feature since P(R+|f1=1)=1 2) f2 is a necessary feature since P(f2=1|R+)=1 3) f3 is neither necessary nor sufficient, but both its sufficiency and necessity are high (>0.5) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Estimating Sufficiency Document label User profile vector The feature Estimation of the label of document di The set of documents covered by feature f Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Estimating Necessity Bayes’ Theorem! Feature sufficiency Prior distribution Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Reference Distributions • Our assumption • User selects a feature since it has a high sufficiency and/or a high necessity • Reference distributions: two Bernoulli dist’ns • The sufficiency/necessity of a user-selected feature should be close to the reference distribution • KL-divergence for similarity measure Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

User Profile Learning • The unified loss function to combine two types of feedback: User-labeled documents Sufficient features Necessary features Ts , Tn:reference dist’ns Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

User Interaction Mechanisms Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz Two mechanisms • Mechanism 1: ask users to select features they think are relevant • Mechanism 2: ask users to specifically select features they think are sufficient and necessary respectively 21

Outline • Introduction • Faceted Feedback • Facet-Value Pair Candidate Selection • Learning from Faceted Feedback • Experiments • Settings • Results • Summary

Data Sets • Use two data sets from TREC filtering track • TREC 2000: OHSUMED (348566 medical articles) + 63 topics (information needs) • Metadata field: MeSH (Medical Subject Headings) • TREC 2002: RCV1 (~800,000 news articles) + 50 topics defined by human assessors • Metadata fields: Topic, Industry, Region • Split each topic set into two equal-size subsets • One for parameter tuning, the other for testing Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Faceted Feedback Collection • Recruit subjects on Mechanical Turk • Five subjects per topic • The average performances will be reported • For each topic, we show subjects • The topic description (information need) • A group of facet-value pair candidates Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Evaluation Metrics • Precision (macro) • Recall (macro) • T11U = 2 * Nrd – Nnd • Nrd: the number of relevant docs delivered • Nnd: the number of non-relevant docs delivered • T11SU = • MinNU = -0.5 • MaxU: the maximum possible utility (T11U) Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Results 1: w/wo Faceted Feedback (FF) # relevant docs initially known Faceted feedback improves filtering performances, especially when fewer relevant documents are initially known. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Results 2: Different Learning Algorithms Existing approaches Our approach BOOL(A), BOOL(O): Boolean strategy FS: feature selection based on FF Pseudo-D/Q: pseudo relevant doc/query Prior: logistic regression with Bayesian prior GEC: generalized expectation criteria

Summary • Faceted feedback is useful for filtering, especially in the cold-start scenarios • The Generalized Constraint Model (GCM) is a robust user profile learning algorithm • In future work, we will evaluate our methods on data sets where faceted features are more important • Movie, music, product, etc. Lanbo Zhang, Yi Zhang, Qianli Xing. IRKM Lab at University of California, Santa Cruz

Questions? Filtering Semi-Structured Documents Based on Faceted Feedback Lanbo Zhang, Yi Zhang, Qianli Xing Information Retrieval and Knowledge Management (IRKM) Lab University of California, Santa Cruz lanbo@soe.ucsc.edu yiz@soe.ucsc.edu xingqianli@gmail.com

Filtering Semi-Structured Documents Based on Faceted Feedback

Filtering Semi-Structured Documents Based on Faceted Feedback

Presentation Transcript

Keyword Search on Structured and Semi-Structured Data

Machine-learning based Semi-structured IE

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Semi-Structured Data Models

Processing of structured documents

Filtering Multiple-Record Web Documents Based on Application Ontologies

Machine-learning based Semi-structured IE

Structured Documents

Processing of structured documents

Structured Documents

Processing of structured documents

Diversifying Query Results on Semi-Structured Data

Semi-structured Data

Semi-structured data - exercises

Image Filtering Based on GMSK

Interactive Retrieval Based on Faceted Feedback

Machine-learning based Semi-structured IE

Processing of structured documents

Semi-structured Data

Character-Level Analysis of Semi-Structured Documents for Set Expansion

Semi-Structured data (XML)

Machine-learning based Semi-structured IE