200 likes | 363 Views
Life Science Knowledge Collider. Vassil Momtchev (Ontotext). Presentation Outline. Life Sciences Domain Integration Problems Pathway and Interaction Knowledge Base Linked Life Data LifeSKIM Application to Show Case Platform. Andy Law’s First Law.
E N D
Life Science Knowledge Collider Vassil Momtchev (Ontotext)
Presentation Outline • Life Sciences Domain Integration Problems • Pathway and Interaction Knowledge Base • Linked Life Data • LifeSKIM Application to Show Case Platform ESTC
Andy Law’s First Law “The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats.” ESTC
The problem! • The data is supported by different organizations • The information is highly distributed and redundant • There are tons of flat file formats with special semantics • The knowledge is locked in vast data silos • There are many isolated communities which could not reach cross-domain understanding ESTC
Andy Law’s Second Law “The second step in developing a new genetic analysis algorithm is to decide how to make the output data file format incompatible with all pre-existing analysis data file input formats.” ESTC
Take Your Best Guess ESTC
PIKB Overview • Stands for Pathway and Interaction Knowledge Base (PIKB) • Interactions in the cell unveil the molecular mechanisms • Which molecular function or a biological process is affected after the admission of given drug? • What is the involvement of chemical compounds to a specific biological process or disease? • The work is developed in context LARKC and it is refined with AstraZeneca researcher • The use case of “Semantic Integration for Early Clinical and Drug Development” will be assessed with clinical data of AstraZeneca ESTC
LARKC Project • “Web Scale and Style Reasoning” • Giving up 100% correctness: • trading quality for size • often completeness is not needed • sometimes even soundness is not needed logic precision (soundness) Semantic Web IR recall (completeness) ESTC
PIKB Objectives • Easily integrate pathway and interaction data from different sources • Allow straightforward updates of the information • Provide scientists with computational support to conceptualize the breath and depth of relationships between data • Scale up to billions of statements ESTC
PIKB Data Sources Type of data sources Database name Entrez-Gene Uniprot iProClass GeneOntology GeneOntology NCBI Taxonomy BioGRID, NCI, Reactome, BioCarta, KEGG, BioCyc Sometimes we need to ask far more questions efficiently: Give me all proteins which interacts in nucleus and are annotated with repressor and have at least one participants that is encoded by gene annotated with specific term and is located in chromosome X? Filter the results for Mammalia organisms! • Gene and gene annotations • Protein sequences • Protein cross references • Gene and gene product annotations • Organisms • Molecular interaction and pathways Give all terms more specific than “cell signaling” (e.g., synaptic transmission, transmission of nerve impulse) List all primates sub categories? Give me all human genes which are located in X chromosome? List all protein identifiers encoded by gene IL2? Give me all human proteins associated with endoplasmic reticulum? List all articles where protein Interleukin-2 is mentioned? Give me all interactions of cell division protein kinase? List me all cross references to a protein Interleukin-2? ESTC
Possible Solutions • Classical data-integration with: • data warehouses • federation middleware frameworks • database middleware technology • Not really... • Mapping works efficiently on a small scale • Different design paradigm can be a real challenge • Direct mapping usually does not work • No standard way to integrate textual information ESTC
Our Approach • Convert all data sources to RDF representation (if not already distributed) • Collide the data to scalable semantic repository • Apply light-weight reasoning to specify formal interpretations of the data (e.g., remove redundancy) • Derive new implicit knowledge ESTC
Try to Visualise it urn:biogrid:Interaction urn:uniprot:Protein urn:uniprot:FBgn0068575 urn:biogrid:FBgn0068575 rdf:type sameAs rdf:type urn:pubmed:15904 rdf:seeAlso rdf:type urn:intact:Interaction urn:uniprot:Q709356 hasParticipant Use relationships to derive new implicit knowledge Resolve the syntactic differences in the identifiers interactsWith sameAs rdf:type interactsWith urn:biogrid:15904 hasParticipant urn:uniprot:P104172 urn:intact:1007 sameAs rdf:seeAlso urn:biogrid:FBgn00134235 urn:uniprot:FBgn00134235 These are only examples resource names ESTC
Linked Life Data Overview • Platform to automate the process: • Infrastructure to store and inferences • Transform the structured data sources to RDF • Provide web interface to access the data • Currently operates over OWLIM semantic repository • LinkedLifeData - PIKB statistics: • Number of statements: 1,159,857,602 • Number of explicit statements: 403,361,589 • Number of entities: 128,948,564 • Publicly available at: http://www.linkedlifedata.com ESTC
LifeSKIM Application • A platform offering software infrastructure for: • automatic semantic annotation of text • ontology population • Store the extracted facts and reason on top of them • Semantic indexing and retrieval of content • Query and navigation involving structured knowledge • Based on Information Extraction (i.e. text-mining) technology ESTC
How LifeSKIM Searchers Better? • LifeSKIM can match a query Documents about interleukin 6 (interferon, beta 2) where is connected to apoptosis of neutrophils . • With a document containing …. the same effect was not observed for IFNB2, IL-8 and TNF-alpha…….. …. is induced neutrophil programmed cell death by apoptosis…… ESTC
How LifeSKIM Searchers Better? The classical IR could not match: • interleukin 6 with a HGF; HSF; BSF2; IL-6; IFNB2 Interleukin 6 is a an entity in Entrez-Gene with GeneID: 3569, and HGF; HSF; BSF2; IL-6; IFNB2 are aliases for the same gene entity. • apoptosis of neutrophilswith neutrophil apoptosis; programmed cell death of neutrophils by apoptosis; programmed cell death, neutrophils; neutrophil programmed cell death by apoptosis; GeneOntology thesaurus adds the above list of terms as part of apoptosis of neutrophils term. ESTC
Thanks AstraZeneca • Bosse Andersson • Elisabet Söderhielm • Kaushal Desai Ontotext • Deyan Peychev • Georgi Georgiev • OWLIM team • KIM team The development of PIKB and Linked Life Data is partially funded by FP7 215535 LarKC ESTC