1 / 11

Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections. Presenter : Shu-Ya Li Authors : Venkatesh Ganti , Arnd Christian König , Rares Vernica. KDD, 2008 . Outline. Motivation Objective Methodology Experiments and Results Conclusion Comments. Motivation. Prior approaches.

urban
Download Presentation

Entity Categorization Over Large Document Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Categorization Over Large Document Collections Presenter : Shu-Ya Li Authors : VenkateshGanti, Arnd Christian König, RaresVernica KDD, 2008

  2. Outline • Motivation • Objective • Methodology • Experiments and Results • Conclusion • Comments

  3. Motivation • Prior approaches • But… Entity • companies • [Entity] • present results • … Donald Knuth • works in research … is-a-researcher (Donald_Knuth) is-a-researcher (Entity)? Context [Entity] publish • newspapers • Going from unstructured data to structured data • Extracting entities (people, movies) from documents and identifying the categories (painter, writer, actor) • Most prior approaches (unary relation extraction) • only analyzed the local document context within which entities occur.

  4. Objectives } “…[Entity]’s paper…” [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor • In this paper, we improve the accuracy of entity categorization by • considering an entity’s context across multiple documents • exploiting existing large lists of related entities

  5. Methodology … Julia Roberts starred in Pretty Woman in 1988 … (Yao_Ming, is-a-athlete) Actor-List Feature: Co-occurrence between entityand actor name in context. Ex: Extraction of is-a-movie relation Alan Alba Richard Gere Julia Roberts … actor name Entity (Pretty Woman , is-a-movie)

  6. Methodology - Processing large Document Collections Classification Classifiers C Aggregation List-Member Extraction Context Feature Extraction Entity-List Pairs .retaining the most important list members Verification (Delete false Positives) Entity-Feature Pairs a known set of directors (as ε) a list of actors (as ) 3.2 million documents from Wiki Entity – Candidate Context Pairs } E1: Pretty Woman E2: Mystic Pizza E3:Doubt E4: Duplicity E5:Enchanted … Amy Adams ElizabethReaser JuliaRoberts TaraReid JudyReyes … Actors list n-gram Extraction Rule-based Extraction List-Member Detection wiki Co-Occurrence List corpus L Document Corpus D Synopsis of L

  7. Methodology - Processing large Document Collections Classification Classifiers C … Julia Roberts starred in Pretty Woman in 1988 … Aggregation List-Member Extraction Context Feature Extraction .Scanning D once {Julia, Roberts, starred, Pretty, Woman, Julia Roberts, Pretty Woman, … } Entity-List Pairs 1. the large amount of data written 2. not expected to contain an entity is a member of a list Verification (Delete false Positives) .Our Approach – Bloom Filter {starred, Pretty, Woman, Pretty Woman, … } Entity-Feature Pairs Entity – Candidate Context Pairs (Julia Robert, starred) (Julia Robert, Pretty) (Julia Robert, Woman) (Julia Robert, Pretty Woman) Verification n-gram Extraction Rule-based Extraction List-Member Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L

  8. Experiments

  9. Conclusion Studied the effect of aggregate context in relation extraction. Proposed efficient processing techniques for large text corpora. Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers.

  10. Comments • Advantage • The first half of this paper is clear. • Drawback • But the first half of this paper isn’t clear. • Application • Entity categorization

More Related