Named Entity Discovery from Multilingual Corpora

Named Entity Discovery from Multilingual Corpora Alexandre Klementiev Dan Roth NIPS 2006 • MLIA Workshop Supported by ARDA’s AQUAINT program and a DOI grant under the Reflex program

Named Entity Recognition Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.) After receiving his M.B.A. from[ORG Harvard Business School],[PER Richard F. America]accepted a facultyposition at the[ORG McDonough School of Business]in[LOC Washington].

Motivation • Most approaches in NER are targeted toward a specific domain: language, topic, set of tags, annotation convention, etc. • Classifiers learned in one domain are very brittle when used in another (even related) domain • Many domains (e.g. less commonly used languages) do not have the required resources / they are expensive to obtain

The Supreme Court has confirmed a narrow win for the centre-left opposition led by Romano Prodi. But after a meeting with his advisers, Berlusconi let it be known he was considering a further legal challenge. Supreme Court Премьер-министр Италии Берлускони отказался признать решение Верховного суда, который подтвердил победу левоцентристской коалиции на всеобщих выборах. Берлускони English NER System Russian NER System Romano Prodi Верховного суда Berlusconi Motivation • Adaptation: How can we reduce the requirements on the resources needed to produce a classifier for a new domain? • Some approaches: • Exploit hypotheses learned in one domain to help learning in another • Transfer resources across languages This work

Outline • Introduction and Motivation • Resources • Multilingual Comparable Corpora • Transliteration • Temporal Alignment • Contextual Similarity • Topic Similarity • Named Entity Discovery • Algorithm • Details • Experiments / Results • Future Work and Summary

Resources • Comparable multilingual corpora are increasingly available • E.g., multilingual news streams, movie subtitles • Named entities in such corpora have a number of properties which can be exploited

Multilingual Comparable Corpora : Transliteration • NEs are often transliterated or share etymological origin • Lilic  Лилич • Parliament  Парламент

Multilingual Comparable Corpora :Temporal Alignment • NEs in one language cooccur with their counterparts in the other

Multilingual Comparable Corpora : Contextual Similarity • NEs tend to occur in similar contexts; dictionary can be used to score their contextual similarity Almodovar's next will be titled "El Piel Que Habito" and the cast will include Penelope Cruz. film actress . . . filmmaker sued Oscar-winning Spanish Pedro Almodovar is being by the Popular Party for suggesting it was fomenting a coup d'etat on the eve of the general election. Dict Испанский Педро Альмодовар займет в главной роли в своем будущем Пенелопу Крус. режиссер актрису фильме . . . подать в суд Народная партия Испании собирается на известного испанского Педро Альмодовара. режиссера

Multilingual Comparable Corpora : Topic Similarity • We expect NEs to appear in documents from a particular set of topics • Kennedy is likely to appear in articles about politics, travel Sports Travel E E E R E E R R Politics Medicine E E R R E E R R R

Multilingual Comparable Corpora: Approach • Key insight: make use of data/domain properties to drive supervision • Can use the four properties/observations to (independently) score pairs (<Source Language NE>, <Target Language Candidate>) • Training: iterative algorithm to learn a transliteration model using temporal alignment, contextual similarity and topic similarity as a supervision signals • Discovery: Combine scores (re-rank) • Given a bilingual corpus one side of which is tagged, discover NEs in the other language • Find single- and multi-word Named Entities • Optionally, use dictionary to discover (partially) translated NEs (e.g. “Mount Rainier”)

S T Repeat For each NE in S Until D stops changing D M Algorithm: Training Input Bilingual comparable corpus (S,T) Set of single-word Named Entities in S Output Transliteration model M Initialization Initialize transliteration model M D  Collect candidate list in T with high score (according to current M) Re-rank candidate list (e.g. temporal similarity) Add top ranked (θ) candidate to D Use D to train M

Dict S T (optional) Dictionary For each NE in S For each constituent word in NE D M Algorithm: Discovery Input Bilingual comparable corpus (S,T) Set of Named Entities in S Transliteration model M Output Set of NE pairs D from S and T D  Collect candidate list with high M score (optional) Add dictionary translations to candidate list Re-rank candidate list (e.g. temporal similarity) Select top ranked (θ) candidate If combined NE candidate appears in T, add it to D

Transliteration Model • Linear discriminative approach for transliteration model M • Use the Perceptron algorithm to train M • M(ES, ET)  transliteration score • Initialize M with: • Small (~20) set of transliterations as positive examples • NonNEs paired with random words from T as negative examples

Transliteration Model: Features • Features • For a pair of NE and a candidate (ES, ET) partition Es and ET into substrings of length 0 to n • Each feature is a pair of substrings • For example, (ES, ET) = (powell, pouel), n = 2 • Es  {_, p, o, w, e, l, l, po, ow, we, el, ll} • ET  {_, p, o, u, e, l, po, ou, ue, el} • Feature vector is thus ((p,_), (p, a),… (w, au),… (el, el),…(ll, el)) • Phonetic sequence is preserved, so we can limit number of features • E.g. disallow couplings whose starting positions are too far apart ((p, ue) in the above example) • Extract features from examples automatically • New features are discovered and used during training

Transliteration Model: What happens Algorithm iteratively refines transliteration model with the help of time sequence similarity scoring • Current transliteration model chooses a list of candidates • Best temporally aligned candidate is used for next round of training Example transliteration candidate lists for NE forsythfor two iterations [correct is форсайт]

Мичигана Мичигане Мичиганский + +  Temporal similarity: Equivalence Classes • Languages with rich morphology: a (simplistic) assumption has to be made to group morphological variants  Мичиган [-а, -е, -ский] • Equivalence classes in our experiments • Russian: Common prefix of 5 letters or more • Мичиган [-а, -е, -ский, -] • English: Unique strings • Michigan

Temporal Similarity • Similarity of time distributions is computed using Discrete Fourier Transform based metric • Euclidean distance between vectors of Fourier expansion coefficients of time distributions • About 41% accuracy using this metric alone • More robust to (dis)alignment than Cosine and Pearson Level of Alignment

Contextual Similarity • Collect a ‘contextual signature’ for each NE equivalence class • Collect and count context words around each mention of an NE / candidate • Compute TFIDF-like score for each context word • Contextual signature = 20 context words with highest score • Compute contextual score of (ES, ET) as the number of dictionary translation between their contextual signatures

Experiments: Discovery • Almost 5 years of (short) BBC news articles and loose Russian translations • 20 pairs of NEs and their transliterations to initialize the transliteration model

Experiments: Discovery [single word NEs] In top 30 81.9% In top 20 80.5% In top 10 76.5% In top 5 70.3% Complete Algorithm 63.8% Top 63.8% Temporal Only 41.0% TransliterationOnly 44.2%

Experiments: Discovery [multi-word NEs] • 66% accuracy multi-word (two or more) NEs during discovery • Examples: Transliterated Translated Partially Translated

Experiments: Initial example set size • M initialized with 80, 20, and 5 transliteration pairs •  size  convergence time • Noise slows down convergence 80 20 5

Temporal vs. Context Supervision [work in progress] Temporal Supervision 63.8% Context Supervision 40.9%

Future/current work • Scoring functions provide independent sources of supervision - combine them • Learn transliteration model and scoring function weights simultaneously (in a co-training fashion) • Adapt to a new corpus without supervision (e.g. use transliteration as supervision to re-weigh scoring functions) • Add topic similarity score; multilingual clustering may be interesting in its own right • More languages • Train NER on the automatically tagged target corpus, compare to training on hand tagged corpus

Summary • Key insight: make use of domain properties to drive supervision • Algorithm for Named Entity discovery in multilingual comparable corpora • Little supervision • Small initialization set of transliteration pairs • Simplistic morphological assumptions for target language • Discriminative transliteration model • Three scoring functions Group’s web site: http://l2r.cs.uiuc.edu/~cogcomp/ Demo: http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=MNED

Corpus/Experiment Specifics • 2,327 documents [spanning 1/1/2001 through 10/05/2005] • 14,781 equivalence classes • Single-word: 978 NEs, 727 were checked • Multi-word (2 or more): random 282 verified by a human

Named Entity Discovery from Multilingual Corpora

Named Entity Discovery from Multilingual Corpora

Presentation Transcript

Named Entity Recognition

Named Entity Classification

Named Entity Recognition

Extended Named Entity ver. 6.1.2

Biomedical Named Entity Recognition

VI.3 Named Entity Reconciliation

Named Entity Recognition

Parallel Corpora for Multilingual Ontology Learning

Named Entity Recognition

Named Entity Tagging

Corpora and Evaluation Tools for Multilingual Named Entity Grammar Development

Using WordNet Predicates for Multilingual Named Entity Recognition

Unsupervised Models for Named Entity Classifcation

NAMED ENTITY RECOGNITION

Named Entity Extraction

Interlingua Annotation of Multilingual Corpora (IAMTC) Project

Named Entity Recognition

Named Entity Tagging

How Does Named Entity Recognition Work?