220 likes | 359 Views
A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14 th October 2014. Overview. Parallel Corpus Problem Motivation. Background. Random Forest Classifier Statistical Phrase Alignment
E N D
A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora Georgios Kontonatsios georgios.kontonatsios@cs.man.ac.uk 14th October 2014
Overview • Parallel Corpus • Problem • Motivation Background • Random Forest Classifier • Statistical Phrase Alignment • Hybrid Approach Methods Experiments • English-Greek & English-Romanian • Error Analysis Conclusions • Discussion • Future Work
Background: Parallel Corpus “A parallel corpus is a collection of documents in a source language paired with their direct translation in a target language” Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού Greek
Background: Parallel Corpus Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer English • 1) Useful for SMT • 2) Relatively scarce resources • Koehn (2005) trained 110 SMT systems (11 languages) • in three weeks. • Available finance, law, medicine etc. • 3) Excellent resources for mining bilingual terminologies • Exact translations => No missing translations of terms • sentence aligned => limited search space of candidate translations • Same size => term frequencies are comparable η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου του µαστού Greek
Background: Problem Parallel Corpus Dictionary of MWT Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού metastatic breast cancer µεταστατικού καρκίνου µαστού
Background: Biomedical Domain Existing resources in the biomedical domain remain incomplete • A multilingual terminological resource (more than 20 languages) • Indexes ~7.6M English terms UMLS expand UMLS for English-Greek and English-Romanian ~6.3M missing tranlsations
Methodology: Term Alignment Pipeline Parallel Corpus Link to UMLS MetaMap Term Alignment Abraxanemonotherapy is indicated for the treatment of metastatic breast cancer C0278488, Neoplastic Process C0278488, Neoplastic Process η µονοθεραπεία µε abraxane ενδείκνυται για τη θεραπεία µεταστατικού καρκίνου µαστού
Methodology: Term Alignment Algorithms • Supervised machine learning method • Exploits internal structure of terms • (character n-gram feature representation) • Requires positive and negative instances for training • Out-of-domain seed dictionary (i.e. BabelNet) Random Forest Classifier (EACL 2014, EMNLP 2014) • Unsupervised approach • Part of Moses SMT (Koehn et al., 2007) • (Out of the box solution) • Exploits co-occurrences of source and target terms • Works well for frequently occurring terms • Performance decreases for rare terms Statistical Phrase Alignment (Koehn et al., 2003)
Methodology: Hybrid Approach • For s to be translated, RF and SPA suggest N ranked candidate translations Translation probability Classification margin type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2
Methodology: Hybrid Approach • Dictionaries containing N candidate translations have a limited number of applications • (e.g., SMT) • To enrich existing terminologies, human curators need to post-edit the output • of term alignment methods • Objective is to improve the precision of higher ranking candidates (precision@N=1) • Intersection of RF and SPA; ranking candidates according to translation probability by SPA type 2 diabetes mellitus SPA RF του σακχαρώδη διαβήτη τύπου 2 σακχαρώδη διαβήτη τύπου 2 σακχαρώδους διαβήτη τύπου 2 διαβήτη τύπου 2 διαβήτη τύπου 2 και καρδιακή σακχαρώδη διαβήτη τύπου 2 Voting σακχαρώδη διαβήτη τύπου 2
Experiments: Corpora • EMEA (Tiedemann, 2009), a biomedical parallel corpus from European Medicines Agency • 1.5K sentence aligned documents in 22 languages • Drug usage guidelines en el en ro - 372K sentences - 17,907 unique English MWTs - 321K sentences - 16,625 unique English MWTs
Experiments: Evaluation • Randomly sampled 1,000 English MWTs • for each English MWT, we selected the top 20 translation candidates. en-el en-ro RF SPA Voting RF SPA Voting
Experiments: Results English-Greek dataset
Experiments: Results English-Romanian dataset
Experiments: Results English-Greek dataset
Experiments: Results English-Romanian dataset
Error Analysis • Partial matches (disorder) (cycle) (urea) διαταραχών του κύκλου της ουρίας urea cycle disorder RF • discontinuous translations (diseases) (metabolic) (hereditary) boliereditarede metabolism metabolic diseases • Statistically-based tool. • -Performance largely affected by term frequency SPA • top-20 precision on terms having varying frequency
Error Analysis Performance decreases for lower frequency terms English-Greek dataset
Error Analysis English-Romanian dataset
Discussion • Hybrid approach • Compilation of bilingual terminologies from parallel corpora • Enrich UMLS with two under-resource languages • Observations: • Substantially improves top-1 precision of RF and SPA • Outperforms SPA when translating low-frequency terms • Low recall
Future Work • Investigate integration of bilingual terminologies with SMT SMT SPA Parallel corpus Phrase table SPA LM RF Lower top-1 precision Poor performance for low-frequency terms