250 likes | 451 Views
Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image. 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011. Marc Zimmermann, Martin Hofmann- Apitius
E N D
Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical entities from text and image 5th Meeting on U.S. Government Chemical Databases and Open Chemistry, August 2011 Marc Zimmermann, Martin Hofmann-Apitius Bonn-Aachen Center for Information Technology, University of Bonn
SCAIView: gene + protein index (lucene + semantic entities) Link out to biological reference databases Entities handled by ProMiner Semantic tagging
Automatic Binning of Images Database Curation Trash
Challenge • Problem • Predict the quality of the reconstruction result without a reference molecule • Solution • Machine learning • Expected results • Quality of new reconstructions estimated by trained models
The Evaluation Conceptof SCAI andInfoChem 1 Manual abstractionofchemicalnames Comparison (quantitative) 2 NER N2S ICANNOTATOR 5 Pdf to Text Automatic chemical verification Database 3 Page seg- mentation Chemical recognition Image classifier chemoCR (Fraunhofer) Comparison (quantitative) 4 “Similarity MCD” Manual abstraction of structures from images
Quality Measure: Graph Matching OK bad • SimilarityMCD (Minimal Chemical Distance) • Module from InfoChem • Graph-matching on • Reconstruction result of chemoCR • The reference molecule • Results in • Numerical value, [0,1]
Chemical Error Classification Scheme • MISSED • BOND_MISSED • COMPLETE_BOND_MISSED • ORDER_BOND_MISSED • CHIRAL_BOND_MISSED • SYMBOL_MISSED • ATOM_SYMBOL_MISSED • ISOTOPE_SYMBOL_MISSED • CHARGE_SYMBOL_MISSED • RADICAL_SYMBOL_MISSED
Mapping of Reaction Schemes with Spatial Constraints reconstruction reference
Mining of Chemical Names • Chemical names should be found in the text • Synonyms and spelling variations in different databases • Several Text Mining techniques developed Sodium lauryl sulfate (DB00815 DrugBank) : 230 brand names and 26 synonyms
Compounds sharing a Synonym “Livesan” An entry DB00436 Bendroflumethiazidefrom DrugBank An entry Procetofen C07586 from the KEGG Compound
Task: Generating a Dictionary • (-)-Epiafzelechin • epi-Afzelechin UID1 • 5-(1-cycloheptenyl)-5-ethyl-1,3-diazinane-2,4,6-trione • Heptabarbital • rather reliable data sources • recognizes different chemical names referring to the same structure and to map them to the unique identifier
Different Mapping Approaches C02265 D-Phenylalanine; D-alpha-Amino-beta-phenylpropionic acid. DB02556 D-Phenylalanine; (2R)-2-amino-3-phenylpropanoic acid Synonym based Interlink based Structure based
Interlink based Approach D02592 from KEGG Drug DB01234 from DrugBank Non-unified approach towards parametric isomers Link structurally different compounds:
Problem: Merging Data Sources to UID Identity problem (“parametric isomers”): • Stereochemistry • Tautomerism • Charges • Isotopes • Mixtures • Polymers • Aromaticity • Markush Structures
Importing SDF into SQL Schema KEGG COMPOUND2 KEGG DRUG2 SDF files Drugcard files DrugBank1 • http://www.drugbank.ca/ last accessed August 2010 • http://kegg.jp/ last accessed August 2010
Dictionary Comparison Dictionary 1 Dictionary 2 Entry1. Compound1, Compound2, Compound3 Entry1. Compound1, Compound2, Compound4 Dictionary 1 Dictionary 2 Entry1. Compound1, Compound2 - present Compound1, Compound3 - absent Compound2, Compound3 - absent Entry1. Compound1, Compound2 - present Compound1, Compound4 - absent Compound2, Compound4 - absent Entries are transformed into Binary correspondences – all possible pairs between the compounds from one entry
Overlap of Binary Correspondences DrugBank& KEGG
The Open PharmacologicalConcepts Triple Store • Develop a setof robust standards… • Implement the standards in a semantic integration hub (“Open Pharmacological Space”)… • Deliver services to support on-going drug discovery programs in pharma and public domain… Prototype: www.openphacts.org
Conclusions http://trec.nist.gov/ • Chemical information extraction is an ongoing effort • Task is challenging • In need of critical assessments and gold standards • Structure reconstruction • Database mapping • Retrieval tasks • In need of strategies • Deal with reconstruction errors • Extended file formats & search algorithms • Result visualizations