Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Using Wikipedia for Hierarchical Finer Categorizationof Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC 2009

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 6 Conclusion

Structured and organized encyclopedic corpus is a suitable training corpus. • a wide range of topics • provides hyperlinks 1 Introduction

In this paper • Discuss the usability of Wikipedia • Induce WordNet and Wikipedia domain taxonomy into the feature space • Using Maximum Entropy and SVM classifier 1 Introduction

Kazama and Torisawa (2007) • extracted gloss text • Dakka and Cucerzan (2008) • tagging the Wikipedia data • Bunescu and Pasca (2006) • built a disambiguation system 2 Related Work

10-18-2007 English version of Wikipedia • 2 million articles • 292,384 categories • a taxonomy with a depth about 10 • 5882 Wikipedia Stub categories • 105 domains 3 Corpus Creation

3 Corpus Creation 3.1 Categories in Wikipedia 3.2 Named entity categories 3.3 Procedure

taxonomy • constituted by categories • linked to other categories across depth and breadth • contains cycles • Tackled by Zesch and Gurevych, 2007 • wikipedia taxonomy is not a tree 3.1 Categories in Wikipedia

the domain hierarchy • 17 basic domains • 88 sub-domains 3.2 Named entity categories

to avoid the bias towards any particular domain • rules to choose set of categories • To ensure diversity in the categorization task • To ensure we select balanced categories • consider category with each parameter closest to mean value under that domain 3.2 Named entity categories

extract named entity phrases • using Stanford POS tagger • extract typed dependency relationships • extract the content words around a named entity • collect the NPs (noun phrases) and VPs (verb phrases) 3.3 Procedure

Firstly, we look for redirected and disambiguated article titles matching with first name of the named entity. • If, there are more than one such titles, consider the target title using minimum edit distance metric. • Pick all articles that fall under the same category as the target article. • Look for those articles that fall under the special categories that are chosen for the classification task. • Find the article that shares maximum number of categories with the target article and label the target article with the its special category. 3.3 Procedure

About 10,000 samples • Training 75% • Testing 25% 3.3 Procedure

3.3 Procedure

four types of feature sets • a syntactic feature set • three semantic features 4 Features

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features

phrase structure parse • nesting of multi-word constituents • dependency parse • dependencies between individual words • dependency relations gives a clue about probable semantic relations that can be associated with the named entity. 4.1 Typed Dependency Feature

preferred to have a hypernym feature which is semantically specific • hypernyms of all synsets are inversely ordered according to their depth in the hypernymtree • deepest hypernymin the lot is choosen as the target feature for that content word 4.2 Hypernyms

4 Features 4.1 Typed Dependency Feature 4.2 Hypernyms 4.3 Domain based features 4.3.1 Wordnet domains 4.3.2 Wikipedia domains 4.3.3 WDH vsWikipedia Domain System

Every synset in WordNet is associated a domain label in Wordnet Domain Hierarchy (WDH) • There are 5 top-level domains and 46 basic domains in WDH. 4.3.1 Wordnet domains

indexed Wikipedia • search content words in the index for the categories that contain more number of pages containing a content word • Especially, pages with links are weighed double the pages that contains the word without a hyperlink. 4.3.2 Wikipedia domains

4.3.3 WDH vsWikipedia Domain System

5 Experiments

outline 1 Introduction 2 Related Work 3 Corpus Creation 4 Features 5 Experiments 5.1 Experiment 1: Feature wise model 5.2 Experiment 2: Feature combination model 5.3 Experiment 3: Error analysis

presented a named entity categorization system • employs Wikipedia categories as classes • adapted hierachial categorization of Wikipedia • mine relations among named entities

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Using Wikipedia for Hierarchical Finer Categorization of Named Entities

Presentation Transcript

Towards a semantic extraction of named entities

Semantic Relation Extraction for Linking Named Entities to Biomedical Databases

Indexing concepts and/or named entities

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors

Collective annotation of wikipedia entities in web text

Augmenting Wikipedia with Named Entity Tags

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

Wiki3C : Exploiting Wikipedia for Context-aware Concept Categorization

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors

Named Entities in Domain Unlimited Speech Translation

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Identification of bilingual named entities from Wikipedia using a pair Hidden Markov Model

Exploiting Wikipedia Inlinks for Linking Entities in Queries

Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization

Hierarchical Text Categorization and its Application to Bioinformatics

Face Categorization using SIFT features

Learning Formulation and Transformation Rules for Multilingual Named Entities

Text Classification and Named Entities for New Event Detection

Identification of Composite Named Entities in a Spanish Textual Database

Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge

Named Entities in Czech Texts and Their Processing

Functional Annotation of Genes Using Hierarchical Text Categorization