20 th of May 2004

20th of May 2004 Mixed-Lingual Entity Recognition Beatrice Alex School of InformaticsThe University of Edinburgh

Named Entity Recognition • What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g. “John Lennon” = NE of type person “T-Mobile” = NE of type organisation “Edinburgh” = NE of type location • How are they recognised? Use of internal and external context

NER Methods • Rule-based • hand-written patterns • rely on punctuation, capitalisation and other features in the text • Statistical-based • data-driven approaches • exploit the statistical properties of real language to learn models • Hybrid Methods

PhD ProposalSupervisors: Claire Grover, Stephen Clark • Proposed research topic: mixed-lingual NER, i.e. the detection and classification of NEs in a different language from the base language of the text • Examples: „Das Central Command erklärte, das Schicksal des Piloten sei noch ungeklärt.“ “Germany's Die Welt reports that four people died in the heat wave last week.”

Background and Motivation • Multi-lingual and language-independent NER - active research area in NLP circles (MET-1/2, CoNLL02/03) • Many errors in German NER due to amount of foreign language material in German articles (Rössler, 2002) • Mixed-lingual NER - unspecified or beyond capabilities of existing approaches

Beneficiaries • Performance improvements of applications where NER is standardly applied (IE, QA, text summarisation, topic identification) • Valuable information to polyglot TTS synthesis • Pre-processing tool for MT systems

Denglish • English: dominant language of science & technology, air-traffic control, advertising • Increasing influence on German The live eventwas really cool. There were tickets, fast food, drinks in the basement.

Preliminary Research • Analysis of English inclusions in German newspaper articles on different domains: • (1) Internet & Telecoms, (2) EU and (3) space travel • Corpus: 16,000 tokens per domain from German newspaper (FAZ) • Automatic classification of English tokens (NN and FM) by means of a simple lookup procedure • More than 90% of all English inclusions are nouns (Yang, 1999; Yeandle, 2001; Corr, 2003)

1. Lookup Procedure • CELEX lookup (NN|FM) in German and English databases • only in German database > DE • only in English database > EN • in both databases: • Computer, Trend, Monster • Generation, Union, Mission • Art, Tag, Rat, Fall, All • in neither database > 2. lookup procedure

2. Lookup Procedure • Google lookup with language preference • German compounds: Mausklick (mouse click) • English unhyphenated compounds: Homepage • Mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight) • English nouns with German inflections: Receivern • Abbreviations and acronyms: GPS, UKW • Words with spelling mistakes: Abruch (abortion) • English words with American spelling: Center • Classification based on number of hits

Results • Output: Das <EN>Central</EN> <EN>Command</EN> erklärte, das <DE>Schicksal</DE> des <DE>Piloten</DE> sei noch ungeklärt. EN: Central Command explained, the fate of the pilot is still unclear. MT: CentralCommand explained, the fate of the pilot was still unsettled.

English Inclusions

Error Analysis • Sources of Error: • Wrong POS tags • Mixed-lingual unhyphenated compounds • New internationalisms • Abbreviations with several expansions • Unreliable Google hits • Inclusions from other languages • Need for better handling of NEs • Morpheme level analysis for compounds • Extension to other POS tag

Future Work • Collection of more data and annotation for training and evaluation • Development of sequence modelling classifier, e.g. maximum entropy • Implementation of other languages • Application-based evaluation (e.g. MT)

20 th of May 2004

20 th of May 2004

Presentation Transcript

Friday Sermon May 20 th 2011

May 20 th , 2013

Tuesday, May 20 th 2014

Monday 20 th May 2013

Warm Up May 20 th

Monday, May 20 th

20 May 2004

House Meeting May 5 th , 2004

HIPAA Security Workshop May 20, 2004

GWC2004 20 th January 2004

20 th May 2011

WACO 12 th May 2004

20 th of May 2004

Transitional Rate Relief (England) 20 th May 2004 Nick James

RBP Workshop 26 th May 2004

TCE Introduction Metal Finishing Forum May 20 th , 2004

Lecture 2: May 20 th 2009

20 th May 2011

GWC2004 20 th January 2004

Linking London: 20 th May