370 likes | 406 Views
The PERTOMed Project: Exploiting and validating terminological resources of comparable Russian-French-English corpora within Pharmacovigilance. Cedric BOUSQUET INSERM U729 (Faculté de Médecine - Paris 5) cedric.bousquet@spim.jussieu.fr Maria ZIMINA
E N D
The PERTOMed Project:Exploiting and validating terminological resources of comparable Russian-French-English corpora within Pharmacovigilance Cedric BOUSQUET INSERM U729 (Faculté de Médecine - Paris 5) cedric.bousquet@spim.jussieu.fr Maria ZIMINA EA2290 SYLED (Paris 3) / CRIM-INaLCO (Paris) zimina@msh-paris.fr International Conference on Terminology “Terminology and Society".
Outline • Introduction. The PERTOMed Project: • Background • Objectives • Material: parallel French/English vs. comparable Russian corpora • Research methodology: • SYNTEX • Repeated Segments extraction • Multiple co-occurrences • Collaboration domain expert / corpus linguist • Discussion: • Positive results • Limits • Conclusions and future work International Conference on Terminology “Terminology and Society".
The PERTOMed Project: • Project Directors: • Marie-Christine Jaulent(INSERM, Paris). • Jean Charlet (INSERM, Paris). • Partners: • INSERM U729, Faculté de Médecine - Paris 5 (France). • ERSS: Equipe de Recherche en Syntaxe et Sémantique, UMR 5610 CNRS and Toulouse le Mirail University (France). • CRIM: Centre de Recherche en Ingénierie Multilingue, INaLCO (Paris, France). International Conference on Terminology “Terminology and Society".
The PERTOMed Project: • Development of terminological resources in medicine is a major issue to allow collecting data and browsing knowledge databases. • The objective of the PERTOMed project (Production et évaluation de ressources terminologiques et ontologiques dans le domaine de la médecine) was to build terminological or ontological resources from texts in the medical domain. • Potential applications concern several fields: • Pharmacovigilance • Pneumology • Drug-drug interactions • Multilingual terminologies International Conference on Terminology “Terminology and Society".
Pharmacy-related issues in Russia • Several pharmaceutical companies are present in Russia: medicines produced in EU or USA are also commercialised in Russia. • Required qualities of translation of product information: Precision, Reproducibility, Exactness… • High quality of drug product information translations is vital for • Pharmaceutical companies • Russian regulatory authorities • Medical doctors • Pharmacists • Consumers International Conference on Terminology “Terminology and Society".
Pharmacovigilance According to World Health Organization (WHO), pharmacovigilance is “the science and activities relating to the detection, assessment, understanding and prevention of adverse effects or any other drug-related problems.” International Conference on Terminology “Terminology and Society".
Available international terminologies in Pharmacovigilance • WHO-ART (World Health Organization – Adverse Reaction Terminology) was developed in English with translations into French, German, Spanish, Portuguese and Italian. • MedDRA (Medical Dictionary for Drug regulatory Activities) defines fully equivalent medical terms in different languages, including English, French, German, Japanese and Spanish. International Conference on Terminology “Terminology and Society".
Objectives • To propose methods for creating terminological resources from comparable French-English-Russian corpora on adverse drug reactions. • To build a trilingual French-English-Russian terminological resource describing adverse drug reactions. International Conference on Terminology “Terminology and Society".
Available resources: • Parallel French-English medical text corpora:Summaries of Product Characteristics (SPC). • Comparable medical corpora on Russian Web sites. International Conference on Terminology “Terminology and Society".
SPC: Summary of Product Characteristics • European Medicines Agency (EMEA) is a decentralised EU body with headquarters in London. • Companies submit a single marketing authorisation application to the EMEA. • In case of approval given by the Committee for Medicinal Products for Human Use (CMPHU), applicants receive a single market authorisation valid for the entire EU. • The SPCs are provided in all EU languages (undesirable effects are described in Section 4.8). International Conference on Terminology “Terminology and Society".
French-English corpus from the PERTOmed Project (C. Bousquet) • 156 SPCs in French and English downloaded as PDF files. • NLP processing by SYNTEX (French/English parser and term extractor). International Conference on Terminology “Terminology and Society".
SYNTEX (D. Bourigault, S. Ozdowska) • Step 1: Sentence alignment (JAPA) • Step 2: Part-of-Speech tagging (TreeTagger) • Step 3: Parsing (Syntax): • syntactic dependencies are identified (subjects, direct and indirect objects of verbs… • Step 4: Identification of anchor pairs: • cognates, translation equivalents within aligned sentences… • Step 5: Alignment by syntactic propagation: Subject …the two medicinal products are used concurrently. …ces deux produits sont administrés de manière concomitante. Subject International Conference on Terminology “Terminology and Society".
Comparable resources on medicinal products in Russia (J. Ivanova et I. Nuk) • Russian Websites selected for the Project: • RECIPE: http://www.recipe.ru • RLS: http://www.rlsnet.ru • Russian Vidal:http://www.vidal.ru • Criteria for comparability with SPC: • Degree of specialisation • Clarity and precision • Recognition by domain experts in Russia • Information granularity • Style (summarization) • Possible text to text alignment: direct search in Russian by active component or medicinal product International Conference on Terminology “Terminology and Society".
The RECIPE Website: The site of legal pharmacological documentation; Medline user manual, index of Russian bio-medical Websites, several criteria to search for medical products (including ICD-10): http://www.recipe.ru International Conference on Terminology “Terminology and Society".
The RLS Website: Le site RLS, acronym of Регистр Лекарственных Средств России [Register of Medical Substances of Russia] : http://www.rlsnet.ru/: Encyclopaedia of medical products and product description. International Conference on Terminology “Terminology and Society".
The Russian Vidal Website: Russian Vidal: http://www.vidal.ru is edited and regularly updated by the private company AstraPharmService in accordance with the Industrial Standard of Russian Federation. International Conference on Terminology “Terminology and Society".
Russian corpus from the PERTOMed Project: results of Correspondence Analysis (Lexico3) Regardless various origins (different Websites used to collect information), the descriptions of medical products in Russian gathered within the corpus tend to share common lexical characteristics… International Conference on Terminology “Terminology and Society".
comparable parallel Material: parallel vs. comparable corpora • Difficulties: • Corpus size differences… • Information coverage? • Degree of comparability? • NLP tools/methods for comparable multilingual text processing? Delimiting characters: .,:;!?/_-\"'()[]{}§$ International Conference on Terminology “Terminology and Society".
Methods for building terminologies from comparable corpora (1/2) • If two words are mutual translations, their collocates are likely to correspond as well… • Collocation is defined as a co-occurrence relation. • Domain specific words co-occur with general words (possibility to use general bilingual dictionaries). • Mapping through bilingual dictionary: • Build context vectors for source and target words. • Translate context vectors. • Compute similarity between source and target context vectors. International Conference on Terminology “Terminology and Society".
Methods for building terminologies from comparable corpora (2/2) • Statistical Machine Translation: • A translation model is learnedfrom existing translations (parallel corpora). • Alignment probabilities are introduced to refine the model. • Limits: considerable amounts of training data, several heuristics possible. • Mixed approaches: • Syntactic relations transfer, co-occurrence relations, dictionary mapping, alignment probabilities … • Problems: • Lack of equivalence between tools performing similar tasks on different languages • Term extraction from comparable corpora not satisfying yet. International Conference on Terminology “Terminology and Society".
Repeated Segments extraction Repeated Segments(SALEM 1987): series of consecutive forms whose frequency is greater then or equal to 2 in the corpus: FR EN syndrome de stevens johnson 27 syndrome pseudo grippal 23 de syndrome de 13 syndrome de lyell 11 syndrome de détresse respiratoire 8 syndrome de stevens johnson et 8 de syndrome de stevens johnson 7 un syndrome de 7 syndrome d’hyperstimulation ovarienne 7 un syndrome grippal 5 un syndrome pseudo grippal 5 syndrome de turner 5 stevens johnson syndrome 25 respiratory distress syndrome8 ovarian hyperstimulation syndrome 7 flu like syndrome 6 multiforme stevens johnson syndrome 6 adult respiratory distress syndrome 5 erythema multiforme stevens johnson syndrome 5 RU гриппоподобный синдром13 синдром стивенса джонсона7 или синдром2 синдром гиперстимуляции яичников3 синдром и интерстициальная пневмония2 синдром лайелла2 синдром лизиса опухоли2 синдром высвобождения цитокинов2 International Conference on Terminology “Terminology and Society".
ABEG ABEH ABFI Multiple co-occurrences MARTINEZ (2003): The method is based on iterative calculation of lexical attractions. Filtering techniques reduce the number of contextual explorations: A B E G A B E G H F I A B E H C D A B F I Only non-inclusive paths are selected A C A D International Conference on Terminology “Terminology and Society".
Choosing comparable textual units as starting points for exploration… со стороны(F=216) CONTEXT: FR:troubles du système nerveux : insomnie ; hypoesthésie ; paresthésies. EN:nervous system disorders: dizziness, paraesthesia, hyperaesthesia. RU:со стороны нервной системы: головокружение, головная боль, тревога, депрессия, парестезии, гиперстезии, возбуждение, нарушения сна, нервозность, астения. International Conference on Terminology “Terminology and Society".
Exploring collocation networks : French Contexte n°1292 (15 formes dont 3 vedettes) Densité info.=0.20 troubles du métabolisme et de la nutrition : augmentation des triglycérides sériques, augmentation du cholestérol sérique. Contexte n°1414 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : augmentation de la créatinine, hypokaliémie. Contexte n°1425 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : élévation de l'urée sanguine. Contexte n°3180 (10 formes dont 3 vedettes) Densité info.=0.30 troubles du métabolisme et de la nutrition : oedèmes, oedèmes périphériques Contexte n°4667 (12 formes dont 3 vedettes) Densité info.=0.25 troubles du métabolisme et de la nutrition : élévation de l'urée sanguine. Contexte n°6157 (10 formes dont 3 vedettes) Densité info.=0.30 troubles du métabolisme et de la nutrition : fréquent : hypertriglycéridémie, hyperglycémie Contexte n°8334 (13 formes dont 3 vedettes) Densité info.=0.23 troubles du métabolisme et de la nutrition : prise de poids ou amaigrissement, oedèmes. Contexte n°10151 (19 formes dont 3 vedettes) Densité info.=0.16 troubles du métabolisme et de la nutrition très fréquents perte de poids fréquents perte d'appétit, prise de poids Legend: f [s] c f = co-frequency s = specificity c = number of contexts International Conference on Terminology “Terminology and Society".
Exploring collocation networks : English Contexte n°32 (4 formes dont 3 vedettes) Densité info.=0.75 metabolism and nutritiondisorders Contexte n°3715 (7 formes dont 3 vedettes) Densité info.=0.43 metabolism and nutritiondisorders : oedema, peripheral oedema Contexte n°3763 (5 formes dont 3 vedettes) Densité info.=0.60 metabolism and nutritiondisorders : hypokalaemia. Contexte n°5392 (23 formes dont 3 vedettes) Densité info.=0.13 metabolism and nutritiondisorders : very common : hypercholesterolemia, hypertriglyceridemia (hyperlipemia) ; hypokalaemia ; increased lactic dehydrogenase (ldh) common : liver function tests abnormal ; increased sgot, increased sgpt. Contexte n°7698 (7 formes dont 3 vedettes) Densité info.=0.43 metabolism and nutritiondisorders : common : hypertriglyceridaemia, hyperglycaemia Contexte n°8856 (11 formes dont 3 vedettes) Densité info.=0.27 metabolism and nutritiondisorders : abnormal renal function tests (increased creatinine, bun) Contexte n°9771 (9 formes dont 3 vedettes) Densité info.=0.33 metabolism and nutritiondisorders : weight gain or loss, oedema. Contexte n°11578 (13 formes dont 3 vedettes) Densité info.=0.23 metabolism and nutritiondisorders very common weight loss common decreased appetite, weight increase Legend: f [s] c f = co-frequency s = specificity c = number of contexts International Conference on Terminology “Terminology and Society".
Exploring collocation networks : Russian Contexte n°150 (10 formes dont 4 vedettes) Densité info.=0.40 со стороны обмена веществ : обострение сахарного диабета, отеки или обезвоживание. Contexte n°832 (37 formes dont 4 vedettes) Densité info.=0.11 c=10500 побочное действие : со стороны обмена веществ : после назначения натеглинида, как и при применении других гипогликемических препаратов, были отмечены симптомы, предположительно свидетельствующие о развитии гипогликемии, такие как повышенная потливость, тремор, головокружение, повышенный аппетит, сердцебиение, тошнота, слабость, недомогание. Contexte n°930 (18 formes dont 4 vedettes) Densité info.=0.22 со стороны обмена веществ : гиперпролактинемия, увеличение (редко уменьшение) массы тела, сахарный диабет, гипергликемия, диабетический кетоацидоз, диабетическая кома, зоб. Contexte n°1439 (10 formes dont 4 vedettes) Densité info.=0.40 со стороны обмена веществ : гипергликемия, сахарный диабет, кетоацидоз ожирение, дегидратация. Contexte n°1459 (12 formes dont 4 vedettes) Densité info.=0.33 со стороны обмена веществ : обезвоживание, обострение сахарного диабета, ожирение по центральному типу. / Contexte n°647 (16 formes dont 4 vedettes) Densité info.=0.25 со стороны пищеварительной системы : возможны тошнота, рвота, повышение активности act, алт и ггт ; описаны случаи гепатита. Contexte n°718 (23 formes dont 4 vedettes) Densité info.=0.17 c=752 побочное действие : со стороны пищеварительной системы : возможны боли и дискомфорт в эпигастральной области, тошнота, рвота, диарея, снижение аппетита, повышение активности печеночных трансаминаз. Contexte n°873 (31 formes dont 4 vedettes) Densité info.=0.13 со стороны пищеварительной системы : часто-повышение активности ггт ; возможны-повышение активности алт, аст, щф и уровня общего билирубина, тошнота, рвота, диарея, боли в животе ; в единичных случаях-желтуха, тяжелые гепатотоксические реакции. Legend: f [s] c f = co-frequency s = specificity c = number of contexts International Conference on Terminology “Terminology and Society".
Combining with French-English lexicon extracted from parallel corpus… • Automatic segmentation into textual units (Lexico3): • Forms, Repeated Segments (Russian-French-English) • Identification of anchor pairs (starting points): • Frequency counts, cognates, general words, French/English terminology… (syndrome / syndrome / синдром) • Trilingual collocation networks (COOCS): • Identification of similar context vectors. • Semi-automatic segmentation into terminological units. • Cross-language check. • Expert validation. International Conference on Terminology “Terminology and Society".
Building trilingual terminology: collaboration domain expert/corpus linguist Two different kinds of knowledge / skills: • From corpus linguist: • Methodological knowledge: tools and methods for text exploration. • Quantitative results on corpora. • From domain expert: • Domain specific knowledge on ADRs. • Choice of relevant terms/contexts when several variants attested in texts. International Conference on Terminology “Terminology and Society".
Results: trilingual lexicon on the Web PERTOMed Server : http://baneyx.net/SPIP/ Each trilingual entry comprises the following fields: - Simple term (with possible variants) - Abbreviation (if applicable) - Related composed term(s) - Domain(s) - Medical product(s) concerned International Conference on Terminology “Terminology and Society".
Results: choosing terms… гриппоподобный синдром / syndrome pseudo-grippal / influenza-like symptom, flu-like symptom International Conference on Terminology “Terminology and Society".
Results: choosing domains… International Conference on Terminology “Terminology and Society".
Discussion: positive results • 430 validated trilingual terminological entries in XML format • 2002 simple terminological records (single word terms) • 1006 complex terminological records (~50%) (multiword terms) • Co-occurrence relations: International Conference on Terminology “Terminology and Society".
Discussion: limits • Lexical coverage. • Contextual access. • Presentation (no visual aids for navigation yet…). • Evaluation difficulties: • Choice of criteria. • Comparable resources needed. International Conference on Terminology “Terminology and Society".
Conclusions • Creating terminological resources from comparable corpora is faced with intrinsic heterogeneity of texts. • The challenge of exploring texts coming from different cultural and linguistic sources should be taken into account in the terminology project feasibility study. • Creation of Russian Internet corpus in the field of Pharmacovigilance is a pioneering work. • The use of textometric approach for comparable corpora exploration gives encouraging results. • Our methods should be improved taking into account the availability of new tools / resources for processing Russian texts. International Conference on Terminology “Terminology and Society".
Future work… • Intertextual exploration on the document level based on visual aids: DISTRIBUTION INVENTORY OF REPEATED SEGMENTS 216 ---- ---- ---- ---- ---- со стороны 2 ---- со стороны центральной и периферической нервной системы 16 ---- ---- ---- ---- со стороны цнс 3 ---- со стороны цнс головная боль головокружение 3 ---- со стороны цнс и периферической нервной системы 4 ---- ---- ---- со стороны цнс возможны 2 ---- со стороны цнс возможны повышенная утомляемость головная боль 5 ---- ---- ---- со стороны дыхательной системы 2 ---- ---- со стороны дыхательной системы возможны 10 ---- ---- ---- со стороны кожных покровов 2 ---- ---- со стороны кожных покровов алопеция 2 ---- ---- со стороны кожных покровов сыпь 4 ---- ---- со стороны костно мышечной системы 2 ---- ---- ---- ---- со стороны крови 2 ---- ---- ---- со стороны лабораторных показателей 4 ---- ---- ---- со стороны мочевыделительной системы International Conference on Terminology “Terminology and Society".
Publications on the PERTOMed Project: • Baneyx A., Charlet J., Jaulent M.-C. (2005) "Building medical ontologies based on terminology extraction from texts: methodological propositions". In Proceedings of the 10th Conference on Artificial Intelligence in Medicine in Europe, Lecture Notes in Computer Science, Aberdeen, GB, July 2005. Springer. • Jaulent M.-C., Charlet J. (2006) "PERTOMed : Production et évaluation de ressources terminologiques et ontologiques dans le domaine de la médecine". PERTOMed : Rapport de fin de projet, INSERM U729. • Nuk I., Ivanova J. (2005) "Création d’une terminologie français/russe dans le domaine de la pharmacovigilance". Mémoire de DESS (dir. Monique Slodzian), Centre de Recherche en Ingénierie Multilingue, INaLCO. • Ozdowska S., Névéol A., Thirion B. (2005) "Traduction compositionnelle automatique de bitermes dans des corpus anglais/français alignés". Actes de la Conférence Terminologie et Intelligence Artificielle, TIA'05, Rouen, France. International Conference on Terminology “Terminology and Society".