270 likes | 415 Views
Named Entity Discovery from Multilingual Corpora. Alexandre Klementiev Dan Roth. NIPS 2006 • MLIA Workshop. Supported by ARDA’s AQUAINT program and a DOI grant under the Reflex program. Named Entity Recognition.
E N D
Named Entity Discovery from Multilingual Corpora Alexandre Klementiev Dan Roth NIPS 2006 • MLIA Workshop Supported by ARDA’s AQUAINT program and a DOI grant under the Reflex program
Named Entity Recognition Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.) After receiving his M.B.A. from[ORG Harvard Business School],[PER Richard F. America]accepted a facultyposition at the[ORG McDonough School of Business]in[LOC Washington].
Motivation • Most approaches in NER are targeted toward a specific domain: language, topic, set of tags, annotation convention, etc. • Classifiers learned in one domain are very brittle when used in another (even related) domain • Many domains (e.g. less commonly used languages) do not have the required resources / they are expensive to obtain
The Supreme Court has confirmed a narrow win for the centre-left opposition led by Romano Prodi. But after a meeting with his advisers, Berlusconi let it be known he was considering a further legal challenge. Supreme Court Премьер-министр Италии Берлускони отказался признать решение Верховного суда, который подтвердил победу левоцентристской коалиции на всеобщих выборах. Берлускони English NER System Russian NER System Romano Prodi Верховного суда Berlusconi Motivation • Adaptation: How can we reduce the requirements on the resources needed to produce a classifier for a new domain? • Some approaches: • Exploit hypotheses learned in one domain to help learning in another • Transfer resources across languages This work
Outline • Introduction and Motivation • Resources • Multilingual Comparable Corpora • Transliteration • Temporal Alignment • Contextual Similarity • Topic Similarity • Named Entity Discovery • Algorithm • Details • Experiments / Results • Future Work and Summary
Resources • Comparable multilingual corpora are increasingly available • E.g., multilingual news streams, movie subtitles • Named entities in such corpora have a number of properties which can be exploited
Multilingual Comparable Corpora : Transliteration • NEs are often transliterated or share etymological origin • Lilic Лилич • Parliament Парламент
Multilingual Comparable Corpora :Temporal Alignment • NEs in one language cooccur with their counterparts in the other
Multilingual Comparable Corpora : Contextual Similarity • NEs tend to occur in similar contexts; dictionary can be used to score their contextual similarity Almodovar's next will be titled "El Piel Que Habito" and the cast will include Penelope Cruz. film actress . . . filmmaker sued Oscar-winning Spanish Pedro Almodovar is being by the Popular Party for suggesting it was fomenting a coup d'etat on the eve of the general election. Dict Испанский Педро Альмодовар займет в главной роли в своем будущем Пенелопу Крус. режиссер актрису фильме . . . подать в суд Народная партия Испании собирается на известного испанского Педро Альмодовара. режиссера
Multilingual Comparable Corpora : Topic Similarity • We expect NEs to appear in documents from a particular set of topics • Kennedy is likely to appear in articles about politics, travel Sports Travel E E E R E E R R Politics Medicine E E R R E E R R R
Multilingual Comparable Corpora: Approach • Key insight: make use of data/domain properties to drive supervision • Can use the four properties/observations to (independently) score pairs (<Source Language NE>, <Target Language Candidate>) • Training: iterative algorithm to learn a transliteration model using temporal alignment, contextual similarity and topic similarity as a supervision signals • Discovery: Combine scores (re-rank) • Given a bilingual corpus one side of which is tagged, discover NEs in the other language • Find single- and multi-word Named Entities • Optionally, use dictionary to discover (partially) translated NEs (e.g. “Mount Rainier”)
S T Repeat For each NE in S Until D stops changing D M Algorithm: Training Input Bilingual comparable corpus (S,T) Set of single-word Named Entities in S Output Transliteration model M Initialization Initialize transliteration model M D Collect candidate list in T with high score (according to current M) Re-rank candidate list (e.g. temporal similarity) Add top ranked (θ) candidate to D Use D to train M
Dict S T (optional) Dictionary For each NE in S For each constituent word in NE D M Algorithm: Discovery Input Bilingual comparable corpus (S,T) Set of Named Entities in S Transliteration model M Output Set of NE pairs D from S and T D Collect candidate list with high M score (optional) Add dictionary translations to candidate list Re-rank candidate list (e.g. temporal similarity) Select top ranked (θ) candidate If combined NE candidate appears in T, add it to D
Transliteration Model • Linear discriminative approach for transliteration model M • Use the Perceptron algorithm to train M • M(ES, ET) transliteration score • Initialize M with: • Small (~20) set of transliterations as positive examples • NonNEs paired with random words from T as negative examples
Transliteration Model: Features • Features • For a pair of NE and a candidate (ES, ET) partition Es and ET into substrings of length 0 to n • Each feature is a pair of substrings • For example, (ES, ET) = (powell, pouel), n = 2 • Es {_, p, o, w, e, l, l, po, ow, we, el, ll} • ET {_, p, o, u, e, l, po, ou, ue, el} • Feature vector is thus ((p,_), (p, a),… (w, au),… (el, el),…(ll, el)) • Phonetic sequence is preserved, so we can limit number of features • E.g. disallow couplings whose starting positions are too far apart ((p, ue) in the above example) • Extract features from examples automatically • New features are discovered and used during training
Transliteration Model: What happens Algorithm iteratively refines transliteration model with the help of time sequence similarity scoring • Current transliteration model chooses a list of candidates • Best temporally aligned candidate is used for next round of training Example transliteration candidate lists for NE forsythfor two iterations [correct is форсайт]
Мичигана Мичигане Мичиганский + + Temporal similarity: Equivalence Classes • Languages with rich morphology: a (simplistic) assumption has to be made to group morphological variants Мичиган [-а, -е, -ский] • Equivalence classes in our experiments • Russian: Common prefix of 5 letters or more • Мичиган [-а, -е, -ский, -] • English: Unique strings • Michigan
Temporal Similarity • Similarity of time distributions is computed using Discrete Fourier Transform based metric • Euclidean distance between vectors of Fourier expansion coefficients of time distributions • About 41% accuracy using this metric alone • More robust to (dis)alignment than Cosine and Pearson Level of Alignment
Contextual Similarity • Collect a ‘contextual signature’ for each NE equivalence class • Collect and count context words around each mention of an NE / candidate • Compute TFIDF-like score for each context word • Contextual signature = 20 context words with highest score • Compute contextual score of (ES, ET) as the number of dictionary translation between their contextual signatures
Experiments: Discovery • Almost 5 years of (short) BBC news articles and loose Russian translations • 20 pairs of NEs and their transliterations to initialize the transliteration model
Experiments: Discovery [single word NEs] In top 30 81.9% In top 20 80.5% In top 10 76.5% In top 5 70.3% Complete Algorithm 63.8% Top 63.8% Temporal Only 41.0% TransliterationOnly 44.2%
Experiments: Discovery [multi-word NEs] • 66% accuracy multi-word (two or more) NEs during discovery • Examples: Transliterated Translated Partially Translated
Experiments: Initial example set size • M initialized with 80, 20, and 5 transliteration pairs • size convergence time • Noise slows down convergence 80 20 5
Temporal vs. Context Supervision [work in progress] Temporal Supervision 63.8% Context Supervision 40.9%
Future/current work • Scoring functions provide independent sources of supervision - combine them • Learn transliteration model and scoring function weights simultaneously (in a co-training fashion) • Adapt to a new corpus without supervision (e.g. use transliteration as supervision to re-weigh scoring functions) • Add topic similarity score; multilingual clustering may be interesting in its own right • More languages • Train NER on the automatically tagged target corpus, compare to training on hand tagged corpus
Summary • Key insight: make use of domain properties to drive supervision • Algorithm for Named Entity discovery in multilingual comparable corpora • Little supervision • Small initialization set of transliteration pairs • Simplistic morphological assumptions for target language • Discriminative transliteration model • Three scoring functions Group’s web site: http://l2r.cs.uiuc.edu/~cogcomp/ Demo: http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=MNED
Corpus/Experiment Specifics • 2,327 documents [spanning 1/1/2001 through 10/05/2005] • 14,781 equivalence classes • Single-word: 978 NEs, 727 were checked • Multi-word (2 or more): random 282 verified by a human