1 / 10

Adapting ANNIE for Named Entity Recognition in Cebuano: A Surprising Success Story

This article discusses the development of a named entity recognition system for Cebuano language in just 4 days, with no native speakers and no training data. The adaptation of the ANNIE system, rule-based approaches, resource collection, and evaluation results are explained.

czerwinski
Download Presentation

Adapting ANNIE for Named Entity Recognition in Cebuano: A Surprising Success Story

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with no native speakers and no training data?

  2. Named Entity Recognition • Decided to work on resource collection and development for NE • Test our claims about ANNIE being easy to adapt to new languages and tasks. • Rule-based meant we didn’t need training data. • But could we write rules without even knowing any Cebuano?

  3. Adapting ANNIE for Cebuano • Default IE system is for English, but some modules can be used directly • Used tokeniser, splitter, POS tagger, gazetteer, NE grammar, orthomatcher • splitter and orthomatcher unmodified • added tokenisation post-processing, new lexicon for POS tagger and new gazetteers • Modified POS tagger implementation and NE grammars

  4. Gazetteer • Perhaps surprisingly, very little info on Web • mined English texts about Philippines for names of cities, first names, organisations ... • used bilingual dictionaries to create “finite” lists such as days of week, months of year.. • mined Cebuano texts for “clue words” by combination of bootstrapping, guessing and bilingual dictionaries • kept English gazetteer because many English proper nouns and little ambiguity

  5. NE grammars • Most English JAPE rules based on POS tags and gazetteer lookup • Grammars can be reused for languages with similar word order, orthography etc. • Most of the rules left as for English, with a few minor adjustments

  6. Evaluation (1) • System annotated 10 news texts and output as colour-coded HTML. • Evaluation on paper by native Cebuano speaker from UMD. • Evaluation not perfect due to lack of annotator training • 85.1% Precision, 58.2% Recall, 71.65% Fmeasure • Non-reusable 

  7. Evaluation (2) • 2nd evaluation used 21 news texts, hand tagged on paper and converted to GATE annotations later • System annotations compared with “gold standard” • Reusable 

  8. Results

  9. What did we learn? • Even the most bizarre (and simple) ideas are worth trying • Trying a variety of different approaches from the outset is fundamental • Communication is vital (being nocturnal helps too if you’re in the UK) • Good mechanisms for evaluation need to be factored in

More Related