Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research

Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs. targeted extraction

Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs.targetedextraction • Large document collections (> 107 Documents)

Using Aggregate Context Extraction logic: ‘[E] works … research’ Single-context Extraction: • “…[Entity] works • in research…” ([Entity], is-a-researcher) We track an entity across contexts, allowing us to combine less predictive features. } • Multi-context Extraction: “…[Entity]’s paper…” Aggregate Context Features [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor

Using Co-occurrence Features Leverage co-occurrence of entity classes (e.g. directors likely co-occur with actors) for extraction. Example: Extraction of is-a-director relation: Two Questions: What difference do the aggregate contextsmake for extraction accuracy? This means keeping track of contexts across documents - can we make this efficient? } Aggregate Context Features … Julia Roberts starred in a Robert Altman film in 1994 … Robert_Altman, co-occurs with actor name … • Co-occurrence features can be between • Entities of different classes. • Entities of one class. • Combination with text-features possible: • e.g., ‘[Entity] plays for [Team_Name]’. Actor-List Alan Alba Richard Gere Julia Roberts …

Processing large Document Collections

Architecture New Architecture Classification COUNT(entity, relation) > Δ Context Feature Extraction Aggregation • Duplicated overhead from • - Document scanning • - Document processing • - Entity Extraction. Entity-Relation Pairs Entity-Feature Pairs Agg. Feature Extraction Single-Context Extraction Co-Occurrence Detection • Co-Occurrence • Detection • Co-Occurrence • Detection • Co-Occurrence • Detection Document Corpus D Co-Occurrence List corpus L

New Architecture • Frequency-distribution of entities very skewed. • Pruning based on retaining most frequent entities and list members in memory. • Challenge: Determining frequencies online. • => Compact hash-synopses of frequencies (CM-Sketch) perform well. Challenges: 1. Fast & accurate co-occurrence detection using the synopsis. 2. Pruning of redundant output. Classification Context Feature Extraction List-Member Extraction • Fast identification of candidate matches through 2-stage filtering. • Use of Bloom-Filters to trade off memory footprint with false positive rate. Aggregation Aggregation • Potentially very large output: • Duplication, e.g. • Entity: “George Bush” Feature: ‘President’ • Potentially very large output: • Duplication via very many co-occurrences, e.g. actor-actor. Entity-List Pairs Delete false Positives Entity-Feature Pairs Entity – Candidate Context Pairs Agg. Feature Extraction Rule-based Extraction Co-Occurrence Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L

Experiments

Experimental Evaluation • Task: Categorization of entities into professions (actor, writer, painter, etc.) • Document-Corpus: 3.2 Million Wikipedia pages • Training data generated using Wikipedia lists of famous painters, writers, etc… • Aggregate-Context Classifier: linear SVM using text n-gram & co-occurrence features (binary) • Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]). • Co-occurrence list: contains 10% of entity strings in training data.

Experimental Evaluation: Accuracy

Experimental Evaluation: Overhead • Main remaining overhead: writing of entity-features pairs. • Simple caching strategy reduces this overhead by an order of magnitude.

Conclusions • Studied the effect of aggregate context in relation extraction. • Proposed efficient processing techniques for large text corpora. • Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers. • The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.

Questions?

Entity Categorization Over Large Document Collections

Entity Categorization Over Large Document Collections

Presentation Transcript

Processing of large document collections

Processing of large document collections

Processing of Large Document Collections 1

Entity Categorization Over Large Document Collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Document Categorization

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections

Processing of large document collections