1 / 27

Named Entity Discovery from Multilingual Corpora

Named Entity Discovery from Multilingual Corpora. Alexandre Klementiev Dan Roth. NIPS 2006 • MLIA Workshop. Supported by ARDA’s AQUAINT program and a DOI grant under the Reflex program. Named Entity Recognition.

coye
Download Presentation

Named Entity Discovery from Multilingual Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Named Entity Discovery from Multilingual Corpora Alexandre Klementiev Dan Roth NIPS 2006 • MLIA Workshop Supported by ARDA’s AQUAINT program and a DOI grant under the Reflex program

  2. Named Entity Recognition Identify entities of specific types in text (e.g. people, locations, dates, organizations, etc.) After receiving his M.B.A. from[ORG Harvard Business School],[PER Richard F. America]accepted a facultyposition at the[ORG McDonough School of Business]in[LOC Washington].

  3. Motivation • Most approaches in NER are targeted toward a specific domain: language, topic, set of tags, annotation convention, etc. • Classifiers learned in one domain are very brittle when used in another (even related) domain • Many domains (e.g. less commonly used languages) do not have the required resources / they are expensive to obtain

  4. The Supreme Court has confirmed a narrow win for the centre-left opposition led by Romano Prodi. But after a meeting with his advisers, Berlusconi let it be known he was considering a further legal challenge. Supreme Court Премьер-министр Италии Берлускони отказался признать решение Верховного суда, который подтвердил победу левоцентристской коалиции на всеобщих выборах. Берлускони English NER System Russian NER System Romano Prodi Верховного суда Berlusconi Motivation • Adaptation: How can we reduce the requirements on the resources needed to produce a classifier for a new domain? • Some approaches: • Exploit hypotheses learned in one domain to help learning in another • Transfer resources across languages This work

  5. Outline • Introduction and Motivation • Resources • Multilingual Comparable Corpora • Transliteration • Temporal Alignment • Contextual Similarity • Topic Similarity • Named Entity Discovery • Algorithm • Details • Experiments / Results • Future Work and Summary

  6. Resources • Comparable multilingual corpora are increasingly available • E.g., multilingual news streams, movie subtitles • Named entities in such corpora have a number of properties which can be exploited

  7. Multilingual Comparable Corpora : Transliteration • NEs are often transliterated or share etymological origin • Lilic  Лилич • Parliament  Парламент

  8. Multilingual Comparable Corpora :Temporal Alignment • NEs in one language co­occur with their counterparts in the other

  9. Multilingual Comparable Corpora : Contextual Similarity • NEs tend to occur in similar contexts; dictionary can be used to score their contextual similarity Almodovar's next will be titled "El Piel Que Habito" and the cast will include Penelope Cruz. film actress . . . filmmaker sued Oscar-winning Spanish Pedro Almodovar is being by the Popular Party for suggesting it was fomenting a coup d'etat on the eve of the general election. Dict Испанский Педро Альмодовар займет в главной роли в своем будущем Пенелопу Крус. режиссер актрису фильме . . . подать в суд Народная партия Испании собирается на известного испанского Педро Альмодовара. режиссера

  10. Multilingual Comparable Corpora : Topic Similarity • We expect NEs to appear in documents from a particular set of topics • Kennedy is likely to appear in articles about politics, travel Sports Travel E E E R E E R R Politics Medicine E E R R E E R R R

  11. Multilingual Comparable Corpora: Approach • Key insight: make use of data/domain properties to drive supervision • Can use the four properties/observations to (independently) score pairs (<Source Language NE>, <Target Language Candidate>) • Training: iterative algorithm to learn a transliteration model using temporal alignment, contextual similarity and topic similarity as a supervision signals • Discovery: Combine scores (re-rank) • Given a bilingual corpus one side of which is tagged, discover NEs in the other language • Find single- and multi-word Named Entities • Optionally, use dictionary to discover (partially) translated NEs (e.g. “Mount Rainier”)

  12. S T Repeat For each NE in S Until D stops changing D M Algorithm: Training Input Bilingual comparable corpus (S,T) Set of single-word Named Entities in S Output Transliteration model M Initialization Initialize transliteration model M D  Collect candidate list in T with high score (according to current M) Re-rank candidate list (e.g. temporal similarity) Add top ranked (θ) candidate to D Use D to train M

  13. Dict S T (optional) Dictionary For each NE in S For each constituent word in NE D M Algorithm: Discovery Input Bilingual comparable corpus (S,T) Set of Named Entities in S Transliteration model M Output Set of NE pairs D from S and T D  Collect candidate list with high M score (optional) Add dictionary translations to candidate list Re-rank candidate list (e.g. temporal similarity) Select top ranked (θ) candidate If combined NE candidate appears in T, add it to D

  14. Transliteration Model • Linear discriminative approach for transliteration model M • Use the Perceptron algorithm to train M • M(ES, ET)  transliteration score • Initialize M with: • Small (~20) set of transliterations as positive examples • Non­NEs paired with random words from T as negative examples

  15. Transliteration Model: Features • Features • For a pair of NE and a candidate (ES, ET) partition Es and ET into substrings of length 0 to n • Each feature is a pair of substrings • For example, (ES, ET) = (powell, pouel), n = 2 • Es  {_, p, o, w, e, l, l, po, ow, we, el, ll} • ET  {_, p, o, u, e, l, po, ou, ue, el} • Feature vector is thus ((p,_), (p, a),… (w, au),… (el, el),…(ll, el)) • Phonetic sequence is preserved, so we can limit number of features • E.g. disallow couplings whose starting positions are too far apart ((p, ue) in the above example) • Extract features from examples automatically • New features are discovered and used during training

  16. Transliteration Model: What happens Algorithm iteratively refines transliteration model with the help of time sequence similarity scoring • Current transliteration model chooses a list of candidates • Best temporally aligned candidate is used for next round of training Example transliteration candidate lists for NE forsythfor two iterations [correct is форсайт]

  17. Мичигана Мичигане Мичиганский + +  Temporal similarity: Equivalence Classes • Languages with rich morphology: a (simplistic) assumption has to be made to group morphological variants  Мичиган [-а, -е, -ский] • Equivalence classes in our experiments • Russian: Common prefix of 5 letters or more • Мичиган [-а, -е, -ский, -] • English: Unique strings • Michigan

  18. Temporal Similarity • Similarity of time distributions is computed using Discrete Fourier Transform based metric • Euclidean distance between vectors of Fourier expansion coefficients of time distributions • About 41% accuracy using this metric alone • More robust to (dis)alignment than Cosine and Pearson Level of Alignment

  19. Contextual Similarity • Collect a ‘contextual signature’ for each NE equivalence class • Collect and count context words around each mention of an NE / candidate • Compute TFIDF-like score for each context word • Contextual signature = 20 context words with highest score • Compute contextual score of (ES, ET) as the number of dictionary translation between their contextual signatures

  20. Experiments: Discovery • Almost 5 years of (short) BBC news articles and loose Russian translations • 20 pairs of NEs and their transliterations to initialize the transliteration model

  21. Experiments: Discovery [single word NEs] In top 30 81.9% In top 20 80.5% In top 10 76.5% In top 5 70.3% Complete Algorithm 63.8% Top 63.8% Temporal Only 41.0% TransliterationOnly 44.2%

  22. Experiments: Discovery [multi-word NEs] • 66% accuracy multi-word (two or more) NEs during discovery • Examples: Transliterated Translated Partially Translated

  23. Experiments: Initial example set size • M initialized with 80, 20, and 5 transliteration pairs •  size  convergence time • Noise slows down convergence 80 20 5

  24. Temporal vs. Context Supervision [work in progress] Temporal Supervision 63.8% Context Supervision 40.9%

  25. Future/current work • Scoring functions provide independent sources of supervision - combine them • Learn transliteration model and scoring function weights simultaneously (in a co-training fashion) • Adapt to a new corpus without supervision (e.g. use transliteration as supervision to re-weigh scoring functions) • Add topic similarity score; multilingual clustering may be interesting in its own right • More languages • Train NER on the automatically tagged target corpus, compare to training on hand tagged corpus

  26. Summary • Key insight: make use of domain properties to drive supervision • Algorithm for Named Entity discovery in multilingual comparable corpora • Little supervision • Small initialization set of transliteration pairs • Simplistic morphological assumptions for target language • Discriminative transliteration model • Three scoring functions Group’s web site: http://l2r.cs.uiuc.edu/~cogcomp/ Demo: http://l2r.cs.uiuc.edu/~cogcomp/demo.php?dkey=MNED

  27. Corpus/Experiment Specifics • 2,327 documents [spanning 1/1/2001 through 10/05/2005] • 14,781 equivalence classes • Single-word: 978 NEs, 727 were checked • Multi-word (2 or more): random 282 verified by a human

More Related