Improving NER in Arabic using a Morphological Tagger

Improving NER in Arabic using a Morphological Tagger Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu

Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis

Named Entity Recognition:Mention Detection Geo-political entity (GPE) Organization Person Sample below from Adaptive Content Extraction (ACE) Our objective: Find name mentions NEW YORK, March 19 (AFP) Media tycoon Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment…

NER: The Approach Structured perceptron label sequence model BIO-encoded name mentions Token stream O O B:person I:person O O O O O O B:org … Media tycoon Barry Diller on Wednesday quit as chief of Vivendi … Mentions Person: “Barry Diller” Organization: “Vivendi Universal Entertainment”

The Role of Features Media tycoon Barry Diller • State-of-the-art methods rely on word-local features • Typically have form F(S, i)  {0,1}, for sequence S and position i • Classes of feature • Word identity (e.g., word at index is “the”) • Orthographic (e.g., word is capitalized) • Lexical (e.g., word is a noun, or in a list of cities) word_diller:1 word_the:0 capitalized:1 numeric:0 in_name_list:0 … word_barry:1 word_the:0 capitalized:1 numeric:0 in_name_list:1 … word_tycoon:1 word_the:0 capitalized:1 numeric:0 … word_media:1 word_the:0 capitalized:1 numeric:0 …

Challenges of Arabic NER للارمن ل+ ال+ ارمن “for the Armenians” Orthographic ambiguity Dearth of lexical features Clitics, affixation Arabic rich in affixes best tokenization? Omission of short vowels Increased lexical ambiguity Word identity less reliable feature English features Word identity Orthographic Gazetteers والحكومة و+ ال+ حكومة Arabic features Word identity “and the government” Future work Addressed in this study Addressed in this study

Features Based on Term Clusters • Distributional term clustering using unlabeled corpora • Source of features for NER (Miller, et al, 2004; Freitag, 2004) • Boolean features reflecting cluster membership Example Arabic clusters Example English clusters

Morphological Analysis and Disambiguation for Arabic (MADA) • Buckwalter Arabic Morphological Analysis (BAMA): ;;WORD byn bayãna=[bayãn_1 POS:V +PV +S:3MS BW:+bayãn/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • MADA (Habash and Rambow, 2005; Roth et al 2008): Which BAMA analysis is correct? • Combination of classifiers on orthogonal dimensions of Arabic morphology • 96% disambiguation accuracy • 99.3% word-level PATB tokenization accuracy

The Initial Experiment • Two new features: • Capitalized gloss (GlossCap) • No entry exists for a word in our morphological database (OOV) ;;WORD byn bayãna=[bayãn_1 POS:V +PV +S:3MS BW:+bayãn/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • Two enhanced NER models, OOV feature in both: • BAMA only. A GlossCap feature is true if the gloss of any analysis returned by BAMA is capitalized. • MADA. A GlossCap feature is true only if the gloss of the analysis selected by MADA is capitalized.

Results Base: Recall limited BAMA: Marginal improvement MADA: 7% Improvement in recall while also improving precision!

Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis

Error Analysis

Spans or Tags? • Correct NER = span is correct AND tag is correct • Question: how hard is each component of the problem? • Evaluate performance • On Span AND Tag (S&T) • (same evaluation as before) • On Span only (S) • Note: different evaluation set, thus different numbers for S&T compared to earlier • ConclusionThe harder problem in NER is the correct identification of the spans.

Errors by Type • Categorizing all errors in development set by type • 44% are recall errors (we miss an NE) • 16% are precision errors (we propose a false NE) • 25% are span errors (we propose one or more false span that overlap(s) with a gold span) • only 15% are label errors (the span is correct, the label is not) • Confirms previous result that labels are not an important source of errors • Recall errors (we do not find NE): these are often very common entities • Major way to improve results: improve recall, perhaps by use of gazeteer

System Combination Experiments • We have three systems – can we combine? • Baseline without morphology • System with analyzer only (BAMA) • System with disambiguated analysis (MADA) • Three combinations: • Oracle: choose the correct system • Union: choose NE if any system chooses NE • Intersection: choose NE only if all systems choose NE • Precision – Recall tradeoff • Oracle shows high potential

Conclusion • Morphological disambiguation in-context using MADA helps NER • More precisely, what helps NER is the case (uppercase/lowercase) of the English gloss of the MADA selected entry! • Ideas for improvement: • Gazetteer for recall improvement • Use of lexemes (lemmatization, performed by MADA) in clustering • Can other MT-based techniques help NER in a NER-resource-poorer language?

Improving NER in Arabic using a Morphological Tagger

Improving NER in Arabic using a Morphological Tagger

Presentation Transcript

Linguistically Informed and Corpus Informed Morphological Analysis of Arabic

Using Multiple Diacritics in Arabic Scripts for Steganography

Multi-level NER in a CG framework

A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation

Simultaneous Morphological Analysis and Lemmatization of Arabic Text

Using evaporated neutron number distribution as a saturation signature tagger

Improving In-Hospital Resuscitation Using A Code Bundle

How to Tag a Corpus Using Stanford Tagger

Improving Morphosyntactic Tagging of Slovene by Tagger Combination

Tagger

IWT in NER

Towards Resolving Morphological Ambiguity in Arabic Intelligent Language Tutoring Framework

IWT in NER

Towards Developing a Multi-Dialect Morphological Analyser for Arabic

Ear Tagger

Tagger Microscope

Muon tagger

UGTag : morphological analyzer and tagger for Ukrainian language

Using Arabic Picturebooks

Arabic course in dubai | Arabic classes in Dubai | Learn Arabic in dubai