Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications Martin Hofmann Department of Bioinformatics

Challenges in HealthGrids • Information explosion in the Life Sciences • Highly parallel experimental procedures (e.g. 30,000 genes on one microarray) • Dominance of descriptive information, mostly in poorly structured text • Insufficient representation of knowledge in databases / knowledge bases • Insufficient integration of biomedical data

Data Integration in HealthGRIDs • Excursion on the SIMDAT Project • Pharma Activity • LION bioscience Ltd. • NEC Research Laboratories (CCRLE) • Fraunhofer SCAI Dept. of Bioinformatics • Free University of Brussels / EMBnet Node Belgium • Glaxo Smith Kline (GSK) • University of Karlsruhe

SRS Screenshot

SRS Servers Within Multi Site Organisation Site with SRS server • Central Server • low maintenance • no independence • no fault tolerance Site without SRS server • Cooperative Servers • maximum independence • data exchange necessary • high maintenance • Federated Servers • max. resource sharing • max. cooperation • technologically more difficult

Characteristics of SRS Federation • One server knows all resources shared among servers • databanks • tools • canned queries, data views • users: access rights, user sessions, personalization • Fail safe mechanism if local databank or tool fails • Servers in SRS federation can automatically synchronize and exchange • meta information • indices (optional) • flat files (optional) • SRS Federation behaves like a single SRS server • redundancy can be configured to improve performance and fail safety

Technical Challenges and Opportunities • Collaboration of servers to • follow and execute cross database queries • build composite objects and reports with information from several databases • Optimization • through finding optimal paths for database linking (cross database queries) • use ‘clever’ algorithms to minimize number of transactions between servers to build composite objects and reports

Information Extraction for HealthGRIDs • Towards a “Semantic Hub“ Covering the BioMedical Name Space

Growth of life science data(here: entries in EMBL) exceeds the growth rate of compute (CPU) power as described by Moore´s law An update every second Megabases Moore’s Law schema taken (with permission) from Graham Cameron, Deputy Director of the European Bioinfrormatics Institute (EBI)

Growth rate of Medline • Updates: Since 2002, between 1,500-3,500 completed references are added each day Tuesday through Saturday; over 571,000 total added during 2004. Source: http://www.nlm.nih.gov/pubs/factsheets/medline.html

Effort for Information (better: knowledge) Retrieval as Defined by PubMed Searches Source: http://www.nlm.nih.gov/pubs/factsheets/medline.html

A Basic Observation The more complex a subject …. …, the more likely you will find it only in unstructuredtext

One Possible Solution: Information Extraction • Computer – Aided, Automated Information Extraction

WAS, STEP, iCE, StAR Interleukin 1 alphaTumor necrosis factor beta p21, EPO, large T antigen TNF receptor 1collagen, type I, alpha receptor Neuronectin, GMEM, tenascin, HXB, cytotactin, hexabrachion F12A Collagen, type I, alpha 1Collagen alpha 1(I) chainAlpha 1 collagenAlpha-1 type I collagen COL1A1 Some Specific Problems of Information Extraction from Life Science Publications • Multiple names for one gene • Ambiguous names in databases • Common word names • Multi-word terms • Spelling variants • Permutations • Nested names

Protein and Gene Name Recognition Semi-automated generation of biomedical dictionaries- Example: Human Protein Dictonary with ~20.000 objects, ~160.000 synonyms. Mapping tables allow linkage of extracted protein and gene objects to experimental data (e.g. gene expression data, Gene Ontology, …). Scoring algorithm for multi-word term disambiguation based on token classes (Hanisch et al., 2003)* Fast, approximative matching algorithm for rapid, distributed entity recognition in scientific text. Fraunhofer SCAI ProMiner and the Biological Entity Recognition Module search ~13,000,000 MEDLINE abstracts for all human protein and gene names over night on a 8 CPU parallel computer** * Hanisch D, Fluck J, Mevissen HT, Zimmer R. Pac Symp Biocomput. 2003;:403-14. ** if a 36 node SUN cluster is used, it takes about 90 minutes

Critical Evaluation of Information Extraction Approaches in Molecular Biology and Genome Research: BioCreative

mouse yeast fly Overview on Results of the BioCreative Competition

Example for the Application of Text Mining

Gene – Disease Associations: Osteoarthrose Relationship Between a Specific Disease and Protein Names • used co-occurance of disease terms (MESH) • and genes • use statistical measure to determine significant associations protein-protein-interaction networks representing the top-scoring 70 proteins associated with osteoarthritis red: significant associations white: no significant association

Osteoarthrose Sub-Network zooming into a protein-protein- interaction sub-network and relationships between proteins involved in osteoarthritis

Use of a Concept – Based Semantic Hub in HealthGRIDs • Using the named-entity recognition machinery for distributed indexing of databases and documents • Information extraction from distributed documents • Large scale information extraction NOT limited to MedLine • Semantic mediation through populated ontologies

Reconstruction of Pharmaceutical Information Recognition and Reconstruction of Chemical Structures

Information on Chemical Structures in Scientific Text

Aim: Multimodal Extraction and Reconstruction of Chemical Structure Information from Patents and other Scientific Text Image Analysis / Structure Reconstruction -CH3 -CH2-CH3 -CH3 -CH2-CH3 -CH2-CNHS -CH2-CNHS -COOH Text Analysis / Entity Recognition -COOH Reconstruction of Published ChemSpace including PatentSpace

Reconstruction of Chemical Structure Information from Images Design of the „chemOCR“ System

Structure Reconstruction Workflow chemical cartridge SVG converterer line filter modul approx. graphmatcher molecular graph converter BMP PDF filter rules common fragments modul chemical rules modul molecule database machine learning tool manual curation tool

Character Recognition and Resolution of Superatoms:

Correction of conversion errors: BMP SVG Identifying disconnected bonds: Relative Neighbourhood Graph (RNG)

Most common fragment patterns:

Input-Graph found Subgraphisomorphismen used fragments for the reconstruction Graph Matching Example Decomposition network

chemOCR - Prototype: SVG graphics Line filtering and matching Graph editor and file conversion

First Results:

Grid Service Annotation and Discovery A Universal, Easy-to-use Tool for Grid Service and Data Annotation [a first result of our work in the SIMDAT project]

Top level classes • Scientific Domain • Scientific Theory • Method • Tool • Workflow • Experiment • Data • Repository Methodology

Domains

method Scientific Methodology

Data

From Domain to Data

Grid Service Annotation Ontology Classes Grid Services • Just like ontologies, semantic annotations build on those ontologies have to be stored centrally • Annotations should ideally not disturb the annotated entities (GS, data,...) • => nondestructive annotation, store away safely in a third place S P O Annotations as Subject – Predicate – Object e.g. „Service „xyz“ provides BLASTX search“

Acknowledgement • Chemical Structure Reconstruction • Le Thuy Bui Thi • Marc Zimmermann • Tanja Fey • Grid Service Discovery and Annotation • Kai Kumpf • Extraction of Biological Information • Juliane Fluck • Theo Mevissen • Hartwig Deneke • Daniel Hanisch • Prof. Ralf Zimmer* • Florian Sohler* • Katrin Fundel*

Information Extraction, Service Discovery and Semantic Services in HealthGrid Applications