1 / 14

Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections. Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research. Relationship Extraction from Text. Task : Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity.

lamya
Download Presentation

Entity Categorization Over Large Document Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entity Categorization Over Large Document Collections Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research

  2. Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs. targeted extraction

  3. Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs.targetedextraction • Large document collections (> 107 Documents)

  4. Using Aggregate Context Extraction logic: ‘[E] works … research’ Single-context Extraction: • “…[Entity] works • in research…” ([Entity], is-a-researcher) We track an entity across contexts, allowing us to combine less predictive features. } • Multi-context Extraction: “…[Entity]’s paper…” Aggregate Context Features [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor

  5. Using Co-occurrence Features Leverage co-occurrence of entity classes (e.g. directors likely co-occur with actors) for extraction. Example: Extraction of is-a-director relation: Two Questions: What difference do the aggregate contextsmake for extraction accuracy? This means keeping track of contexts across documents - can we make this efficient? } Aggregate Context Features … Julia Roberts starred in a Robert Altman film in 1994 … Robert_Altman, co-occurs with actor name … • Co-occurrence features can be between • Entities of different classes. • Entities of one class. • Combination with text-features possible: • e.g., ‘[Entity] plays for [Team_Name]’. Actor-List Alan Alba Richard Gere Julia Roberts …

  6. Processing large Document Collections

  7. Architecture New Architecture Classification COUNT(entity, relation) > Δ Context Feature Extraction Aggregation • Duplicated overhead from • - Document scanning • - Document processing • - Entity Extraction. Entity-Relation Pairs Entity-Feature Pairs Agg. Feature Extraction Single-Context Extraction Co-Occurrence Detection • Co-Occurrence • Detection • Co-Occurrence • Detection • Co-Occurrence • Detection Document Corpus D Co-Occurrence List corpus L

  8. New Architecture • Frequency-distribution of entities very skewed. • Pruning based on retaining most frequent entities and list members in memory. • Challenge: Determining frequencies online. • => Compact hash-synopses of frequencies (CM-Sketch) perform well. Challenges: 1. Fast & accurate co-occurrence detection using the synopsis. 2. Pruning of redundant output. Classification Context Feature Extraction List-Member Extraction • Fast identification of candidate matches through 2-stage filtering. • Use of Bloom-Filters to trade off memory footprint with false positive rate. Aggregation Aggregation • Potentially very large output: • Duplication, e.g. • Entity: “George Bush” Feature: ‘President’ • Potentially very large output: • Duplication via very many co-occurrences, e.g. actor-actor. Entity-List Pairs Delete false Positives Entity-Feature Pairs Entity – Candidate Context Pairs Agg. Feature Extraction Rule-based Extraction Co-Occurrence Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L

  9. Experiments

  10. Experimental Evaluation • Task: Categorization of entities into professions (actor, writer, painter, etc.) • Document-Corpus: 3.2 Million Wikipedia pages • Training data generated using Wikipedia lists of famous painters, writers, etc… • Aggregate-Context Classifier: linear SVM using text n-gram & co-occurrence features (binary) • Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]). • Co-occurrence list: contains 10% of entity strings in training data.

  11. Experimental Evaluation: Accuracy

  12. Experimental Evaluation: Overhead • Main remaining overhead: writing of entity-features pairs. • Simple caching strategy reduces this overhead by an order of magnitude.

  13. Conclusions • Studied the effect of aggregate context in relation extraction. • Proposed efficient processing techniques for large text corpora. • Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers. • The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.

  14. Questions?

More Related