170 likes | 362 Views
Improving NER in Arabic using a Morphological Tagger. Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu. Overview. Named Entity Recognition (NER) NER for Arabic: the Challenges Using Morphological Analysis and Disambiguation
E N D
Improving NER in Arabic using a Morphological Tagger Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu
Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis
Named Entity Recognition:Mention Detection Geo-political entity (GPE) Organization Person Sample below from Adaptive Content Extraction (ACE) Our objective: Find name mentions NEW YORK, March 19 (AFP) Media tycoon Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment…
NER: The Approach Structured perceptron label sequence model BIO-encoded name mentions Token stream O O B:person I:person O O O O O O B:org … Media tycoon Barry Diller on Wednesday quit as chief of Vivendi … Mentions Person: “Barry Diller” Organization: “Vivendi Universal Entertainment”
The Role of Features Media tycoon Barry Diller • State-of-the-art methods rely on word-local features • Typically have form F(S, i) {0,1}, for sequence S and position i • Classes of feature • Word identity (e.g., word at index is “the”) • Orthographic (e.g., word is capitalized) • Lexical (e.g., word is a noun, or in a list of cities) word_diller:1 word_the:0 capitalized:1 numeric:0 in_name_list:0 … word_barry:1 word_the:0 capitalized:1 numeric:0 in_name_list:1 … word_tycoon:1 word_the:0 capitalized:1 numeric:0 … word_media:1 word_the:0 capitalized:1 numeric:0 …
Challenges of Arabic NER للارمن ل+ ال+ ارمن “for the Armenians” Orthographic ambiguity Dearth of lexical features Clitics, affixation Arabic rich in affixes best tokenization? Omission of short vowels Increased lexical ambiguity Word identity less reliable feature English features Word identity Orthographic Gazetteers والحكومة و+ ال+ حكومة Arabic features Word identity “and the government” Future work Addressed in this study Addressed in this study
Features Based on Term Clusters • Distributional term clustering using unlabeled corpora • Source of features for NER (Miller, et al, 2004; Freitag, 2004) • Boolean features reflecting cluster membership Example Arabic clusters Example English clusters
Morphological Analysis and Disambiguation for Arabic (MADA) • Buckwalter Arabic Morphological Analysis (BAMA): ;;WORD byn bay˜ana=[bay˜an_1 POS:V +PV +S:3MS BW:+bay˜an/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • MADA (Habash and Rambow, 2005; Roth et al 2008): Which BAMA analysis is correct? • Combination of classifiers on orthogonal dimensions of Arabic morphology • 96% disambiguation accuracy • 99.3% word-level PATB tokenization accuracy
The Initial Experiment • Two new features: • Capitalized gloss (GlossCap) • No entry exists for a word in our morphological database (OOV) ;;WORD byn bay˜ana=[bay˜an_1 POS:V +PV +S:3MS BW:+bay˜an/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • Two enhanced NER models, OOV feature in both: • BAMA only. A GlossCap feature is true if the gloss of any analysis returned by BAMA is capitalized. • MADA. A GlossCap feature is true only if the gloss of the analysis selected by MADA is capitalized.
Results Base: Recall limited BAMA: Marginal improvement MADA: 7% Improvement in recall while also improving precision!
Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis
Spans or Tags? • Correct NER = span is correct AND tag is correct • Question: how hard is each component of the problem? • Evaluate performance • On Span AND Tag (S&T) • (same evaluation as before) • On Span only (S) • Note: different evaluation set, thus different numbers for S&T compared to earlier • ConclusionThe harder problem in NER is the correct identification of the spans.
Errors by Type • Categorizing all errors in development set by type • 44% are recall errors (we miss an NE) • 16% are precision errors (we propose a false NE) • 25% are span errors (we propose one or more false span that overlap(s) with a gold span) • only 15% are label errors (the span is correct, the label is not) • Confirms previous result that labels are not an important source of errors • Recall errors (we do not find NE): these are often very common entities • Major way to improve results: improve recall, perhaps by use of gazeteer
System Combination Experiments • We have three systems – can we combine? • Baseline without morphology • System with analyzer only (BAMA) • System with disambiguated analysis (MADA) • Three combinations: • Oracle: choose the correct system • Union: choose NE if any system chooses NE • Intersection: choose NE only if all systems choose NE • Precision – Recall tradeoff • Oracle shows high potential
Conclusion • Morphological disambiguation in-context using MADA helps NER • More precisely, what helps NER is the case (uppercase/lowercase) of the English gloss of the MADA selected entry! • Ideas for improvement: • Gazetteer for recall improvement • Use of lexemes (lemmatization, performed by MADA) in clustering • Can other MT-based techniques help NER in a NER-resource-poorer language?