310 likes | 453 Views
Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features. Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign. Much of the Information Sought on the Web nowadays is about Entities . How to improve our products’ quality?. We love George!!.
E N D
Entity-Centric Document Filtering:Boosting Feature Mapping through Meta-Features Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign
Much of the Information Sought on the Web nowadays is about Entities. How to improve our products’ quality? We love George!! The Web A Huge Entity Database OMG! IPad Air is coming out~~ Fans BUSINIESS TREC-KBA Task How to help Wikipedia editors enrich Wikipedia? Editor
Proposal: Entity-Centric Document Filtering System
Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for Entities Relevant Documents entity-centric document filtering system Billions of News, blogs, forums, tweets... Irrelevant Documents Interested Entities
INPUT: Only Entity Name is Usually Insufficient. Michael Jordan
INPUT: Use Identification Page to Characterize the Target Entity. Entity Identification Pages Resolve the ambiguity problem. Provide more information about the entity
OUTPUT: Relevant/Irrelevant Documents for Target Entities. Relevant Irrelevant Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ... Steve Jobs’ story is completely different from Bill Gates ... Bill Gates Michael Jordan is a Leading researcher in machine learning and AI. Michael Jordan is considered by many the best basket player in NBA history Michael Jordan (NBA Player)
Problem: Entity-Centric Learning to Filter
Problem: Entity-Centric Learning to Filter Wiki Page Relevant Irrelevant Wiki Page Relevant Irrelevant Training Phase Entity-centric Document Filter Testing Phase Wiki Page ? ? ? ?
How to Predict Document Relevance for an Entity Characterized by an Identification Page? Relevance • Traditional IR models such as BM25, language model do not work. • Designed for Short Queries • Entity Pages contain many Noisy Keywords
Our Idea: Check if the document mentions about the most basic information of the entity. Seattle Microsoft Philanthropist Windows
For an Entity with Labeled Documents, Learning its Important Keywords is Simple. Relevant Document Irrelevant Document Bill Gates, speaking as co-founder of Microsoft, will give a talk next Tuesday ... Steve Jobs’ story is completely different from Bill Gates ... Relevance of document d for entity e How many times appear in Importance of keyword as feature weighting, as features • High : Microsoft, founder, software, ... • Low: Apple, from, ipad, ...
However, Such Keyword Importance is Not Adaptable to Other Entities. UNC NBA Chicago Bull Seattle Microsoft MVP Philanthropist Windows Keyword Importance Transfer Training Entities (with Labeled Documents) New Entities (without Labeled Documents)
Insight: Meta-feature Based Keyword Mapping
Two Keywords for Two Entities:Similar Properties Similar Importance Similar Importance Keyword: Chicago Bull Keyword: Microsoft Both of them... are mentioned a lot in their Wiki Pages. are organization. appear in the info-box. ....
Meta-Feature -- “Features of Features”:Properties that are related to keyword importance General Meta-Feature IDF, IsNoun, InEntity, ... ID-Page-Related Meta-Feature InInfobox, InOpenPara, ... Wiki Page InSpec, InReview, ... Amazon Page
Clustering-based Keyword Mapping Training Phase here ... Hollywood the Microsoft ... is as ... NKU Harvard this the a CFR Cascade Keyword Weighting: Keyword Weighting: Testing Phase Wiki NBA NBA the there UNC UNC the there ... ... Wiki must Bobcats Bobcats must
Document Relevance based on Keyword Clusters Keyword Importance Keyword Clusters
Traditional Clustering Algorithm Might Fail 1. Irrelevant Meta-Features might Lead to Useless Clusters Occupation Oscar the October ... Hollywood ... actor WA programmer is screenwriter for MS consistently 2. Different Possible Ways of Clustering. Which one is better? ? OR 10
BoostMapping: Boosting Effective Clusters Document Labels Objective of Clustering: Boosting the Prediction Accuracy of Relevance Hollywood here the Microsoft ... ... NKU is Harvard this as Cascade CFR the a Only Useful Clusters are Generated.
BoostMapping:2. Enumerate Conditions to Generate the Most Predictive Cluster. Achieve the Highest Prediction Accuracy Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...
BoostMapping:3. Update the Document Distribution Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...
BoostMapping:4. Generate the Next Cluster Under the Current Document Distribution Cluster IDF <= 1.45 Is_Infobox = False .... Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...
BoostMapping:5. Repeat the Process Until the Predict Accuracy Converge Update the document distribution again Cluster IDF <= 1.45 Is_Infobox = False .... Cluster IDF<1.45 Is_Organization IDPageTF>=3 ...
Three Datasets • TREC-KBA • 29 person entities, 52,238 documents • Wikipedia pages as ID pages • Product • 39 product entities, 2,398 documents • Amazon pages as ID pages • MilQuery (From Million Query Track) • 143 general entities, 8,208 documents. • Wikipedia pages ad ID pages. Dinosaur Hostage Rescue Kodak
Performance Comparison with Baselines QBD-TFIDF: Use TFIDF to Select Important Keywords as Queries. QueryByName: Use Entity Names As Queries VectorSim: Measure Relevance Based on Query-Document Similarity LinearMapping: Keyword Mapping based on a Linear Function.
Thanks! Q&A