Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014

A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14th October 2014

Overview • Parallel Corpus • Problem • Motivation Background • Random Forest Classifier • Statistical Phrase Alignment • Hybrid Approach Methods Experiments • English-Greek & English-Romanian • Error Analysis Conclusions • Discussion • Future Work

Background: Parallel Corpus “A parallel corpus is a collection of documents in a source language paired with their direct translation in a target language” Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού Greek

Background: Parallel Corpus Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English • 1) Useful for SMT • 2) Relatively scarce resources • Koehn (2005) trained 110 SMT systems (11 languages) • in three weeks. • Available finance, law, medicine etc. • 3) Excellent resources for mining bilingual terminologies • Exact translations => No missing translations of terms • sentence aligned => limited search space of candidate translations • Same size => term frequencies are comparable η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου του µαστού Greek

Background: Problem Parallel Corpus Dictionary of MWT Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού metastatic breast cancer µεταστατικού καρκίνου µαστού

Background: Biomedical Domain Existing resources in the biomedical domain remain incomplete • A multilingual terminological resource (more than 20 languages) • Indexes ~7.6M English terms UMLS expand UMLS for English-Greek and English-Romanian ~6.3M missing tranlsations

Methodology: Term Alignment Pipeline Parallel Corpus Link to UMLS MetaMap Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer C0278488, Neoplastic Process C0278488, Neoplastic Process η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού

Methodology: Term Alignment Algorithms • Supervised machine learning method • Exploits internal structure of terms • (character n-gram feature representation) • Requires positive and negative instances for training • Out-of-domain seed dictionary (i.e. BabelNet) Random Forest Classifier (EACL 2014, EMNLP 2014) • Unsupervised approach • Part of Moses SMT (Koehn et al., 2007) • (Out of the box solution) • Exploits co-occurrences of source and target terms • Works well for frequently occurring terms • Performance decreases for rare terms Statistical Phrase Alignment (Koehn et al., 2003)

Methodology: Hybrid Approach • For s to be translated, RF and SPA suggest N ranked candidate translations Translation probability Classification margin type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2

Methodology: Hybrid Approach • Dictionaries containing N candidate translations have a limited number of applications • (e.g., SMT) • To enrich existing terminologies, human curators need to post-edit the output • of term alignment methods • Objective is to improve the precision of higher ranking candidates (precision@N=1) • Intersection of RF and SPA; ranking candidates according to translation probability by SPA type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2 Voting σακχαρώδη διαβήτη τύπου 2

Experiments: Corpora • EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency • 1.5K sentence aligned documents in 22 languages • Drug usage guidelines en el en ro - 372K sentences - 17,907 unique English MWTs - 321K sentences - 16,625 unique English MWTs

Experiments: Evaluation • Randomly sampled 1,000 English MWTs • for each English MWT, we selected the top 20 translation candidates. en-el en-ro RF SPA Voting RF SPA Voting

Experiments: Results English-Greek dataset

Experiments: Results English-Romanian dataset

Experiments: Results English-Greek dataset

Experiments: Results English-Romanian dataset

Error Analysis • Partial matches (disorder) (cycle) (urea) διαταραχών του κύκλου της ουρίας urea cycle disorder RF • discontinuous translations (diseases) (metabolic) (hereditary) boliereditarede metabolism metabolic diseases • Statistically-based tool. • -Performance largely affected by term frequency SPA • top-20 precision on terms having varying frequency

Error Analysis Performance decreases for lower frequency terms English-Greek dataset

Error Analysis English-Romanian dataset

Discussion • Hybrid approach • Compilation of bilingual terminologies from parallel corpora • Enrich UMLS with two under-resource languages • Observations: • Substantially improves top-1 precision of RF and SPA • Outperforms SPA when translating low-frequency terms • Low recall

Future Work • Investigate integration of bilingual terminologies with SMT SMT SPA Parallel corpus Phrase table SPA LM RF Lower top-1 precision Poor performance for low-frequency terms

Questions ?

Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014

Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014

Presentation Transcript

TCT, October 14 th , 2008

Week 6: October 13 th – 14 th

April 14 th , 2014

Week of: October 14 th – October 18 th

Week of October 14 th – 18 th

Tuesday, October 14, 2014

14 th February 2014

6 th – 12 th October 2014

October 14 th Lecture

October 7 th , 2014

Monday 14 th October 2013

Today is Tuesday, October 14 th , 2014

October 14 th , 2014

Planner October 14. th

8 th -10 th October 2014

October 10 th , 2014

14 October 2014

October 10 th , 2014

14 October 2014

October 14, 2014 Day A

US History October 14, 2014

14 th November 2014