Background

A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random FieldsYotaro Watanabe, Masayuki Asahara and Yuji MatsumotoNara Institute of Science and TechnologyEMNLP-CoNLL 2007 29th JunePrague, Czech

Background • Named Entity • Proper nouns (e.g. Shinzo Abe (Person), Prague (Location)), time/date expressions (e.g. June 29 (Date)) and numerical expressions (e.g. 10%) • In many NLP applications (e.g. IE, QA), Named Entities play an important role • Named Entity Recognition task (NER) • Treated as sequential tagging problem • Machine learning methods have been proposed • Recall is usually low • Large scale NE dictionary is useful for NER Semi-automatic methods to compile NE dictionaries have been demanded

Resource for NE dictionary construction • Wikipedia • Multi-lingual encyclopedia on the Web • 382,613 gloss articles (as of June 20, 2007, Japanese) • Gloss indices are composed by nouns or proper nouns • HTML (Semi-structured text) • Lists(<LI>) and Tables(<TABLE>) can be used as clues for NE type categorization • Linked articles are glossed by anchor texts in articles • Each article has one or more categories Wikipedia has useful information for NE categorization Can be considered as a suitable resource

Objective • Extract Named Entities by assigning proper NE labels for gloss indices of Wikipedia Person Person Location Natural Object Organization Product

Use of Wikipedia features • Features of Wikipedia articles • Anchors of an article refer to the other related articles • Anchors in list elements have dependencies each other • =>Make3 assumptions about dependencies between anchors Assumption 1 : The latter element in a list item tends to be in an attribute relation to the former element PERSON VOCATION • Burt Bacharach…composer • Dillard & Clark • Carpenters • Karen Carpenter ORGANIZATION ORGANIZATION Assumption 2 : The elements in the same itemization tends to be in the same NE category PERSON an example of a list structure Assumption 3 : The nested element tends to be in a part-of relation to the upper element

Overview of our approach • Focus on HTML list structure in Wikipedia • Make 3 assumptions about dependencies between anchors • Formalize NE categorization problem as labeling NE classes to anchors in lists • Define 3 kinds of cliques (edges: Sibling, Cousin and Relative ) between anchors based on 3 assumptions • Construct graphs based on 3 defined cliques • CRFs for NE categorization in Wikipedia • Define potential functions over 3 edges (and nodes) to provide conditional distribution over the graphs • Estimate MAP label assignment over the graphs using Conditional Random Fields

Conditional Random Fields (CRFs) y1 y2 y3 ･･･ yn x • Conditional Random Fields[Lafferty 2001] • Discriminative, Undirected Models • Define conditional distribution p(y|x) • Features • Arbitrary features can be used • Globally optimize on all possible label assignments • Can deal with label dependencies by defining potential functions for cliques (2 or more nodes)

Use of dependencies for categorization • NE categorization problem as labeling classes to anchors • The edges of the constructed graphs corresponds to a particular dependency • Estimate MAP label assignment over the constructed graphs using Conditional Random Fields • Our formulation: Can extract anchors without gloss articles : article exists • Dillard & Clark..country rock • Carpenters • Karen Carpenter : article does not exist

Clique definition based on HTML tree structure <UL> • Dillard & Clark…country rock • Carpenters • Karen Carpenter <LI> <LI> <A> <A> <A> <UL> Dillard & Clark country rock Carpenters Sibling The latter element tends to be in an attribute or a concept of the former element <LI> Sibling Cousin <A> Relative Cousin The elements tend to have a common NE category (e.g. ORGANIZATION) Karen Carpenter Use these 3 relations as cliques of CRFs Relative The latter element tends to be in a constituent part of the former element

A graph constructed from 3 clique definitions S S C R C C C C S S R Sibling The latter element tends to be an attribute or a concept of the former element • Burt Bacharach…”On my own”…1986 • Dillard & Clark • Gene Clark • Carpenters …”As Time Goes By”…2000 • Karen Carpenter Cousin The elements tend to have a common attribute (e.g. ORGANIZATION) The latter element tends to be a constituent part of the former element Relative Estimate the MAP label assignment over the graph S : Sibling C : Cousin R : Relative

Model : Potential function for Sibling, Cousin and Relative cliques : Potential function for nodes S S C • Constructed graphs include cycles : exact inference is computationally expensive • ->Introduce Tree-based Reparameterization (TRP)[Wainwright 2003] for approximate inference R C C C S S R

Experiments • The aims of experiments are: • Compare graph-based approach (relational) to node-wise approach (independent) to investigate how the relational classification improves classification accuracy • Investigate the effect of defined cliques • Compare CRFs models to baseline models based on SVMs • Show the effectiveness of using marginal probability for filtering NE candidates.

Dataset • Dataset • Randomly sampled 2300 articles (Japanese version as of October 2005) • Anchors in list elements(<LI>) are hand-annotated with NE class label • We used Extended Named Entity Hierarchy (Sekine et al. 2002) • We reduced the number of classes to 13 from the original 200+ in order to avoid data sparseness • Classification target :16136 (14285 of those are NEs)

Experiments (CRFs) S S S S S S C C C R R R C C C C C C C C C C S S C S S S S C R R R SCR model SC model SR model CR model S S C R C C C S S C R S model R model I model C model • To investigate which clique type contributes classification accuracy: • We construct models that constitute of possible combinations of defined cliques • 8 models (SCR, SC, SR, CR, S, C, R, I) • Classification is performed on each connected subgraph

Experimental settings (Baseline) , Evaluation I model • Baseline：Support Vector Machines (SVMs) ［Vapnik 1998］ • We perform two models: • I model: each anchor text is classified independently • P model: anchor texts are ordered by linear position in HTML, and performed history-based classification (j-1th classification result is used in j-th classification) • For multi-class classification : one-versus-rest • Evaluation • 5-fold cross validation, by F1-value P model

Results (F1-value) ALL : whole dataset , no article : anchors without articles S S S S S S C C C R R R C C C C C C C C C C S S C S S S S C R R R SCR model SC model SR model CR model P model S S C R C C C S S C R S model R model I model I model C model SVMs CRFs

Results (F1-value) Performed McNemar paired test on labeling disagreements => difference was significant (p < 0.01) 1. Graph-based vs. Node-wise ALL : whole dataset , no article : anchors without articles S S S S S S C C C R R R C C C C C C C C C C S S C S S S S C R R R SCR model SC model SR model CR model P model S S C R C C C S S C R S model R model I model I model C model SVMs CRFs

Results (F1-value) 2. Which clique is most contributed? => Cousin clique ALL : whole dataset , no article : anchors without articles S S S S S S C C C R R R C C C C C C C C C C S S C S S S S C R R R SCR model SC model SR model CR model P model S S Cousin cliques provided the highest accuracy improvements compare to sibling and relative cliques C R C C C S S C R S model R model I model I model C model SVMs CRFs

Results (F1-value) 3. CRFs vs. SVMs Significance Test: McNemar paired test on labeling disagreements ALL : whole dataset , no article : anchors without articles S S S S S S C C C R R R C C C C C C C C C C S S C S S S S C R R R SCR model SC model SR model CR model P model S S C R C C C S S C R S model R model I model I model C model SVMs CRFs

Filtering NE candidates using marginal probability • Construct dictionaries from extracted NE candidates • Methods with lower cost are desirable • Extract only confident NE candidates -> Use of marginal probability that provided by CRFs • Marginal probability • probability of a particular label assignment for a node • This can be regarded as “confidence” of a classifier yi

Precision-Recall Curve At this point, recall value is about 0.57 and precision value is about 0.97 Using the proper thresholding of marginal probability, NE dictionary can be constructed with lower cost Precision-Recall curve obtained by thresholding the marginal probability of the MAP estimation in the CR model of CRFs

Summary and future work • Summary • Proposed a method for categorizing NEs in Wikipedia • Defined 3 kinds of cliques (Sibling, Cousin and Relative) over HTML tree • Graph-based model achieved significant improvements compare to Node-wise model, and baseline methods (SVMs) • NEs can be extracted with lower cost by exploiting marginal probability

Summary and Future work • Future work • Use fine-grained NE classes • For many NLP applications (e.g. QA, IE), NE dictionary with fine grained label sets will be a useful resource • Classification with statistical methods becomes difficult in case that the label set is large, because of the insufficient positive examples • Incorporate hierarchical structure of label sets into our models (Hierarchical Classification) • Previous work suggest that exploiting hierarchical structure of label sets improve classification accuracy

Thank you.

Background

Background

Presentation Transcript

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background

Background.

Background

Background

Background

Background

Background

Background