Exploiting Named Entity Taggers in a Second Language

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics ACL 2005 Student ResearchWorkshop

Abstract • Named Entity Recognition • (X) Complex linguistic resources • (X) A hand coded system • (X) Any language dependent tools • The only information we use is automatically extracted from the documents, without human intervention. • Our approach even outperformed the hand coded system on NER in Spanish, and it achieved high accuracies in Portuguese.

Introduction • Most NER approaches have very low portability • Many NE extractor systems rely heavily on complex linguistic resources, which are typically hand coded, for example regular expressions, grammars, gazetteers and the like. • Adapting a system to a different collection or language requires a lot of human effort: rewriting the grammars, acquiring new dictionaries, searching trigger words, and so on. • (O) NE extractor system for Spanish + Portuguese corpus • (X) developing linguistic tools such as parsers, POS taggers, grammars and the like.

Related Work (1/2) • Hidden Markov Models • Zhou and Su, 2002 • Internal features: gazetteer information • External features: the context of other NEs already recognized • Bikel et al., 1997 and Bikel et al., 1999 • Maximum entropy • Borthwick, 1999: dictionaries and other orthographic information • Carreras et al.,2003 • presented results of a NER system for Catalan using Spanish resources • using cross-linguistic features

Related Work (2/2) • Petasis et al., 2000 • extending a proper noun dictionary • an inductive decision-tree classifier • unsupervised probabilistic learning of syntactic and semantic context • POS tags and morphological information • Arévalo et al., 2002 • External information provided by gazetteers and lists of trigger words • A context free grammar, manually coded, is used for recognizing syntactic patterns

Data sets • The corpus in Spanish is that used in the CoNLL 2002 competitions for the NE extraction task. • training: 20,308 NEs • testa: 4,634 NEs • to tune the parameters of the classifiers (development set) • performed experiments with testa only • testb: 3,948 NEs • to compare the results of the competitors • The corpus in Portuguese is “HAREM: Evaluation contest on named entity recognition for Portuguese”. This corpus contains newspaper articles and consists of 8,551 words with 648 NEs.

Two-step Named Entity Recognition • Named Entity Delimitation (NED) • Determining boundaries of named entities • Named Entity Classification (NEC) • Classifying the named entities into categories

Named Entity Delimitation (1/3) • BIO scheme • We used a modified version of C4.5 algorithm (Quinlan, 1993) implemented within the WEKA environment (Witten and Frank, 1999). • For each word we combined two types of features: • Internal: word itself, orthographic information(1~6 possible states) and the position in the sentence. • External: POS tag and BIO tag • A given word w are extracted using a window of five words anchored in the word w, each word described by the internal and external features mentioned previously. • Our classifier learns to discriminate among the three classes and assigns labels to all the words, processing them sequentially.

Named Entity Delimitation (2/3)

Named Entity Delimitation (3/3) • The hand coded system used in this work was developed by the TALP research center (Carreras and Padró, 2002). • NLP analyzers for Spanish, English and Catalan • Include practical tools such as POS taggers, semantic analyzers and NE extractors. • This NER system is based on hand-coded grammars, lists of trigger words and gazetteer information.

NED - Experimental Results • Average of several runs of 10-fold cross-validation

Named Entity Classification • To build a training set for the NEC learner: • the same attributes as for the NED task • suffix of each word. maximum size of 5 characters. • Spanish: person, organization, location and miscellaneous. • Portuguese: person, object, quantity, event, organization, artifact, location, date, abstraction and miscellaneous. • We believe that the learner will be capable of achieving good accuracies in the learning task.

NEC - Experimental Results (1/2) • Similarly to the NED case we trained C4.5 classifiers for the NEC task

NEC - Experimental Results (2/2) Only 4 instances

Conclusions • NER systems must be easy to port and robust, given the great variety of documents and languages for which it is desirable to have these tools available. • Our method does not require any language dependent features. The only information used in this approach is automatically extracted from the documents, without human intervention. • Our method has shown to be robust and easy to port to other languages. The only requirement for using our method is a tokenizer for languages.

Exploiting Named Entity Taggers in a Second Language

Exploiting Named Entity Taggers in a Second Language

Presentation Transcript

Named Entity Recognition

Named Entity Classification

Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition

Extended Named Entity ver. 6.1.2

Named Entity Recognition in Tweets: TwitterNLP

A Framework for Named Entity Recognition in the Open Domain

Biomedical Named Entity Recognition

Exploiting domain and task regularities for robust named entity recognition

VI.3 Named Entity Reconciliation

Named Entity Recognition

Named Entity Recognition without Training Data on a Language you don’t speak

Named Entity Recognition

Named Entity Tagging

NAMED ENTITY RECOGNITION

An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition

Named Entity Extraction

Named Entity Recognition

Named Entity Tagging