1 / 16

Improving NER in Arabic using a Morphological Tagger

Improving NER in Arabic using a Morphological Tagger. Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu. Overview. Named Entity Recognition (NER) NER for Arabic: the Challenges Using Morphological Analysis and Disambiguation

lorant
Download Presentation

Improving NER in Arabic using a Morphological Tagger

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving NER in Arabic using a Morphological Tagger Benjamin Farber, Dayne Freitag FairIsaac Nizar Habash, Owen Rambow Columbia-CCLS habash@ccls.columbia.edu

  2. Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis

  3. Named Entity Recognition:Mention Detection Geo-political entity (GPE) Organization Person Sample below from Adaptive Content Extraction (ACE) Our objective: Find name mentions NEW YORK, March 19 (AFP) Media tycoon Barry Diller on Wednesday quit as chief of Vivendi Universal Entertainment…

  4. NER: The Approach Structured perceptron label sequence model BIO-encoded name mentions Token stream O O B:person I:person O O O O O O B:org … Media tycoon Barry Diller on Wednesday quit as chief of Vivendi … Mentions Person: “Barry Diller” Organization: “Vivendi Universal Entertainment”

  5. The Role of Features Media tycoon Barry Diller • State-of-the-art methods rely on word-local features • Typically have form F(S, i)  {0,1}, for sequence S and position i • Classes of feature • Word identity (e.g., word at index is “the”) • Orthographic (e.g., word is capitalized) • Lexical (e.g., word is a noun, or in a list of cities) word_diller:1 word_the:0 capitalized:1 numeric:0 in_name_list:0 … word_barry:1 word_the:0 capitalized:1 numeric:0 in_name_list:1 … word_tycoon:1 word_the:0 capitalized:1 numeric:0 … word_media:1 word_the:0 capitalized:1 numeric:0 …

  6. Challenges of Arabic NER للارمن ل+ ال+ ارمن “for the Armenians” Orthographic ambiguity Dearth of lexical features Clitics, affixation Arabic rich in affixes best tokenization? Omission of short vowels Increased lexical ambiguity Word identity less reliable feature English features Word identity Orthographic Gazetteers والحكومة و+ ال+ حكومة Arabic features Word identity “and the government” Future work Addressed in this study Addressed in this study

  7. Features Based on Term Clusters • Distributional term clustering using unlabeled corpora • Source of features for NER (Miller, et al, 2004; Freitag, 2004) • Boolean features reflecting cluster membership Example Arabic clusters Example English clusters

  8. Morphological Analysis and Disambiguation for Arabic (MADA) • Buckwalter Arabic Morphological Analysis (BAMA): ;;WORD byn bay˜ana=[bay˜an_1 POS:V +PV +S:3MS BW:+bay˜an/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • MADA (Habash and Rambow, 2005; Roth et al 2008): Which BAMA analysis is correct? • Combination of classifiers on orthogonal dimensions of Arabic morphology • 96% disambiguation accuracy • 99.3% word-level PATB tokenization accuracy

  9. The Initial Experiment • Two new features: • Capitalized gloss (GlossCap) • No entry exists for a word in our morphological database (OOV) ;;WORD byn bay˜ana=[bay˜an_1 POS:V +PV +S:3MS BW:+bay˜an/PV+a/PVSUFF_SUBJ:3MS] = declare/demonstrate bayonu=[bayona_1 POS:N +NOM +DEF BW:+bayon/NOUN+u/CASE_DEF_NOM] = between/among biyn=[biyn_1 POS:PN BW:+biyn/NOUN_PROP+] = Ben • Two enhanced NER models, OOV feature in both: • BAMA only. A GlossCap feature is true if the gloss of any analysis returned by BAMA is capitalized. • MADA. A GlossCap feature is true only if the gloss of the analysis selected by MADA is capitalized.

  10. Results Base: Recall limited BAMA: Marginal improvement MADA: 7% Improvement in recall while also improving precision!

  11. Overview • Named Entity Recognition (NER) • NER for Arabic: the Challenges • Using Morphological Analysis and Disambiguation • Error Analysis

  12. Error Analysis

  13. Spans or Tags? • Correct NER = span is correct AND tag is correct • Question: how hard is each component of the problem? • Evaluate performance • On Span AND Tag (S&T) • (same evaluation as before) • On Span only (S) • Note: different evaluation set, thus different numbers for S&T compared to earlier • ConclusionThe harder problem in NER is the correct identification of the spans.

  14. Errors by Type • Categorizing all errors in development set by type • 44% are recall errors (we miss an NE) • 16% are precision errors (we propose a false NE) • 25% are span errors (we propose one or more false span that overlap(s) with a gold span) • only 15% are label errors (the span is correct, the label is not) • Confirms previous result that labels are not an important source of errors • Recall errors (we do not find NE): these are often very common entities • Major way to improve results: improve recall, perhaps by use of gazeteer

  15. System Combination Experiments • We have three systems – can we combine? • Baseline without morphology • System with analyzer only (BAMA) • System with disambiguated analysis (MADA) • Three combinations: • Oracle: choose the correct system • Union: choose NE if any system chooses NE • Intersection: choose NE only if all systems choose NE • Precision – Recall tradeoff • Oracle shows high potential

  16. Conclusion • Morphological disambiguation in-context using MADA helps NER • More precisely, what helps NER is the case (uppercase/lowercase) of the English gloss of the MADA selected entry! • Ideas for improvement: • Gazetteer for recall improvement • Use of lexemes (lemmatization, performed by MADA) in clustering • Can other MT-based techniques help NER in a NER-resource-poorer language?

More Related