150 likes | 294 Views
Entity Categorization Over Large Document Collections. Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research. Relationship Extraction from Text. Task : Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity.
E N D
Entity Categorization Over Large Document Collections Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research
Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs. targeted extraction
Relationship Extraction from Text Task: Given a corpus of documents and entity-recognition logic, extract structured relations between entities from text. Entity • … Donald Knuth • works in research … is-a-researcher(Donald_Knuth) Context • …Yao Ming plays for • the Houston Rockets… works-for(Yao_Ming, Houston_Rockets) Motivation: Going fromunstructured data to structured data • Applications in search, business intelligence, etc. • Focus: • Openrelationship extraction vs.targetedextraction • Large document collections (> 107 Documents)
Using Aggregate Context Extraction logic: ‘[E] works … research’ Single-context Extraction: • “…[Entity] works • in research…” ([Entity], is-a-researcher) We track an entity across contexts, allowing us to combine less predictive features. } • Multi-context Extraction: “…[Entity]’s paper…” Aggregate Context Features [Entity], ‘paper’ [Entity], ‘talk’ [Entity], ‘published’ ([Entity], is-a-researcher) “…[Entity] gave a talk…” “…[Entity] published…” Multi-Feature Relation Extractor
Using Co-occurrence Features Leverage co-occurrence of entity classes (e.g. directors likely co-occur with actors) for extraction. Example: Extraction of is-a-director relation: Two Questions: What difference do the aggregate contextsmake for extraction accuracy? This means keeping track of contexts across documents - can we make this efficient? } Aggregate Context Features … Julia Roberts starred in a Robert Altman film in 1994 … Robert_Altman, co-occurs with actor name … • Co-occurrence features can be between • Entities of different classes. • Entities of one class. • Combination with text-features possible: • e.g., ‘[Entity] plays for [Team_Name]’. Actor-List Alan Alba Richard Gere Julia Roberts …
Architecture New Architecture Classification COUNT(entity, relation) > Δ Context Feature Extraction Aggregation • Duplicated overhead from • - Document scanning • - Document processing • - Entity Extraction. Entity-Relation Pairs Entity-Feature Pairs Agg. Feature Extraction Single-Context Extraction Co-Occurrence Detection • Co-Occurrence • Detection • Co-Occurrence • Detection • Co-Occurrence • Detection Document Corpus D Co-Occurrence List corpus L
New Architecture • Frequency-distribution of entities very skewed. • Pruning based on retaining most frequent entities and list members in memory. • Challenge: Determining frequencies online. • => Compact hash-synopses of frequencies (CM-Sketch) perform well. Challenges: 1. Fast & accurate co-occurrence detection using the synopsis. 2. Pruning of redundant output. Classification Context Feature Extraction List-Member Extraction • Fast identification of candidate matches through 2-stage filtering. • Use of Bloom-Filters to trade off memory footprint with false positive rate. Aggregation Aggregation • Potentially very large output: • Duplication, e.g. • Entity: “George Bush” Feature: ‘President’ • Potentially very large output: • Duplication via very many co-occurrences, e.g. actor-actor. Entity-List Pairs Delete false Positives Entity-Feature Pairs Entity – Candidate Context Pairs Agg. Feature Extraction Rule-based Extraction Co-Occurrence Detection Co-Occurrence List corpus L Document Corpus D Synopsis of L
Experimental Evaluation • Task: Categorization of entities into professions (actor, writer, painter, etc.) • Document-Corpus: 3.2 Million Wikipedia pages • Training data generated using Wikipedia lists of famous painters, writers, etc… • Aggregate-Context Classifier: linear SVM using text n-gram & co-occurrence features (binary) • Single-Context classifier: 100K extraction rules (incl. gaps) derived from training data (algorithm of [König and Brill, KDD’06]). • Co-occurrence list: contains 10% of entity strings in training data.
Experimental Evaluation: Overhead • Main remaining overhead: writing of entity-features pairs. • Simple caching strategy reduces this overhead by an order of magnitude.
Conclusions • Studied the effect of aggregate context in relation extraction. • Proposed efficient processing techniques for large text corpora. • Both aggregate and co-occurrence features provide significant increase in extraction accuracy compared to single-context classifiers. • The use of pruning techniques and approximate filters results in significant reduction in the overall extraction overhead.