120 likes | 245 Views
Entity Categorization Over Large Document Collections. Presenter : Shu-Ya Li Authors : Venkatesh Ganti , Arnd Christian König , Rares Vernica. KDD, 2008 . Outline. Motivation Objective Methodology Experiments and Results Conclusion Comments. Motivation. Prior approaches.
E N D
Entity Categorization Over Large Document Collections Presenter : Shu-Ya Li Authors : VenkateshGanti, Arnd Christian König, RaresVernica KDD, 2008
Outline • Motivation • Objective • Methodology • Experiments and Results • Conclusion • Comments
Motivation • Prior approaches • But… Entity • companies • [Entity] • present results • … Donald Knuth • works in research … is-a-researcher (Donald_Knuth) is-a-researcher (Entity)? Context [Entity] publish • newspapers • Going from unstructured data to structured data • Extracting entities (people, movies) from documents and identifying the categories (painter, writer, actor) • Most prior approaches (unary relation extraction) • only analyzed the local document context within which entities occur.
Objectives } “…[Entity]’s paper…” [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor • In this paper, we improve the accuracy of entity categorization by • considering an entity’s context across multiple documents • exploiting existing large lists of related entities
Methodology … Julia Roberts starred in Pretty Woman in 1988 … (Yao_Ming, is-a-athlete) Actor-List Feature: Co-occurrence between entityand actor name in context. Ex: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … actor name Entity (Pretty Woman , is-a-movie)
Methodology - Processing large Document Collections Classification Classifiers C Aggregation List-Member Extraction Context Feature Extraction Entity-List Pairs .retaining the most important list members Verification (Delete false Positives) Entity-Feature Pairs a known set of directors (as ε) a list of actors (as ) 3.2 million documents from Wiki Entity – Candidate Context Pairs } E1: Pretty Woman E2: Mystic Pizza E3:Doubt E4: Duplicity E5:Enchanted … Amy Adams ElizabethReaser JuliaRoberts TaraReid JudyReyes … Actors list n-gram Extraction Rule-based Extraction List-Member Detection wiki Co-Occurrence List corpus L Document Corpus D Synopsis of L
Methodology - Processing large Document Collections Classification Classifiers C … Julia Roberts starred in Pretty Woman in 1988 … Aggregation List-Member Extraction Context Feature Extraction .Scanning D once {Julia, Roberts, starred, Pretty, Woman, Julia Roberts, Pretty Woman, … } Entity-List Pairs 1. the large amount of data written 2. not expected to contain an entity is a member of a list Verification (Delete false Positives) .Our Approach – Bloom Filter {starred, Pretty, Woman, Pretty Woman, … } Entity-Feature Pairs Entity – Candidate Context Pairs (Julia Robert, starred) (Julia Robert, Pretty) (Julia Robert, Woman) (Julia Robert, Pretty Woman) Verification n-gram Extraction Rule-based Extraction List-Member Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L
Conclusion Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.
Comments • Advantage • The first half of this paper is clear. • Drawback • But the first half of this paper isn’t clear. • Application • Entity categorization