380 likes | 510 Views
Gene Anal ytics: Discovery and Contextualization of Enriched Gene Groups. N. Lavra č , I. Mozeti č, V. Podpečan, P. Kralj Novak ( Jo ž ef Stefan Institute, Ljubljana ) H. Motaln, M. Petek, K. Gruden (National Institute of Biology, Ljubljana). Talk outline.
E N D
Gene Analytics: Discovery and Contextualization of Enriched Gene Groups N. Lavrač, I. Mozetič, V. Podpečan, P. Kralj Novak (Jožef Stefan Institute, Ljubljana) H. Motaln, M. Petek, K. Gruden (National Institute of Biology, Ljubljana)
Talk outline • Relational data mining and subgroup discovery • Semantic data mining: Using ontologies in SEGS • BISON clallenge • Experimental use case: Glioma cancer treatment • BISON methodology: combining SEGS+Biomine • Gene analytics services and future work BISON Bled WK, Aug. 2009
Data Mining knowledge discovery from data Data Mining model, patterns, … data • Given:transaction data table, relational database, text • documents, Web pages • Find: aclassification model, a set of interesting patterns BISON Bled WK, Aug. 2009
Subgroup discovery task definition (Kloesgen, Wrobel 1997) • Given: a population of individuals and a property of interest (e.g. AML class, in the task of finding genes differentially expressed in AML leukemia as opposed to ALL leukemia) • Find: `most interesting’ descriptions of population subgroups • are as large as possible (high target class coverage) • have most unusual distribution of the target property (high TP/FP ratio, high significance) BISON Bled WK, Aug. 2009
Sample microarray analysis tasks • Two-class diagnosis problem of distinguishing between acute lymphoblastic leucemia (ALL, 27 samples) and acute myeloid leukemia (AML, 11 samples), with 34 samples in the test set. Every sample is described with gene expression values for 7129 genes. • Multi-class cancer diagnosis problem with 14 different cancer types, in total 144 samples in the training set and 54 samples in the test set. Every sample is described with gene expression values for 16063 genes. • SD results in simple IF-THEN rules, interpretable by biologists IF(KIAA0128_gene DIFF-EXPRESSED) AND(prostaglandin_d2_synthase_geneNOT-DIFF-EXP) THEN Leukemia BISON Bled WK, Aug. 2009
Relational Data Mining (Inductive Logic Programming) Relational Data Mining knowledge discovery from data model, patterns, … • Given: a relational database, a set of tables. sets of logical facts, a graph, … • Find: aclassification model, a set of interesting patterns BISON Bled WK, Aug. 2009
Learning from multiple tables Complex relational problems: structured data: representation of molecules and their properties in protein engineering, biochemistry, ... Semantic relational data mining Using domain ontologies as background knowledge for relational data mining Relational Data Mining (ILP) BISON Bled WK, Aug. 2009
Gene Ontology (GO) • GO is a database of terms for genes: • Function - What does the gene product do? • Process - Why does it perform these activities? • Component - Where does it act? • Known genes are annotated to GO terms(www.ncbi.nlm.nih.gov) • Terms are connected as a directed acyclic graph (is_a, part_of) • Levels represent specificity of the terms 12093 biological process 1812 cellular components 7459 molecular functions BISON Bled WK, Aug. 2009
Ontology encoded as relational background knowledge Prolog facts: predicate(geneID, CONSTANT). interaction(geneID, geneID). component(2532,'GO:0016020'). component(2532,'GO:0005886'). component(2534,'GO:0008372'). function(2534,'GO:0030554'). function(2534,'GO:0005524'). process(2534,'GO:0007243'). interaction(2534,5155). interaction(2534,4803). Basic, plus generalized background knowledge using GO zinc ion binding -> metal ion binding, ion binding, binding BISON Bled WK, Aug. 2009
Multi-Relational representation GENE-GENEINTERACTION GENE(main table,class labels) GENE-FUNCTION GENE-PROCESS GENE-COMPONENT FUNCTION PROCESS COMPONENT is_a part_of is_a part_of is_a part_of BISON Bled WK, Aug. 2009
Propositionalization in RDM Propositionalization through first-order feature construction (KARDIO 1994, LINUS 1991, RSD 2006) Novelty of SEGS (2008): Feature construction from ontology information, features with support > min_support f(7,A):-function(A,'GO:0046872'). f(8,A):-function(A,'GO:0004871'). f(11,A):-process(A,'GO:0007165'). f(14,A):-process(A,'GO:0044267'). f(15,A):-process(A,'GO:0050874'). f(20,A):-function(A,'GO:0004871'),process(A,'GO:0050874'). f(26,A):-component(A,'GO:0016021'). f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020'). f(122,A):-interaction(A,B),function(B,'GO:0004872'). f(223,A):-interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613'). f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231'). existential BISON Bled WK, Aug. 2009
Gene set enrichmentanalysis with SEGS • A gene set is enriched if the genes that are members of that gene set are statistically significantly differentially expressed compared to the rest of the genes. • New gene set enrichment method: SEGS - Searching for Enriched Gene Sets (JSI) (Trajkovski et al. JBI 2008) • SEGS approach: Using GO, KEGG and ENTREZ ontologies as background knowledge for semantic subgroup discovery BISON Bled WK, Aug. 2009
Ontologies • Gene Ontology (GO): standardized biological terms used to annotate gene products • Molecular Function • Biological Process • Cellular Component • Kyoto Encyclopedia of Genes and Genomes (KEGG): manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks • ENTREZ: gene annotations with GO and KO terms and gene-gene interaction data BISON Bled WK, Aug. 2009
Gene i Sample j … Identifying differentially expressed genes in data preprocessing To identify genes that display a large difference in gene expression between groups (class A and class B)and are homogeneous within groups, statistical tests (e.g. t-test) and p-values (e.g. permutation test) are computed. Two sample t–statistic is used to test the equality of group means mAandmB. BISON Bled WK, Aug. 2009 14/28
Ranking of differentially expressed genes The genes can be ordered in a ranked list L, according to their differential expression between the classes. The challenge is to extract meaning from this list, to describe them. The terms of the Gene Ontology were used as a vocabulary for the description of the genes. BISON Bled WK, Aug. 2009
Gene expression data: Positive and negative examples for data mining fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … BISON Bled WK, Aug. 2009
Ontology encoded as relational background knowledge + gene expression data fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … Prolog facts: predicate(geneID, CONSTANT). interaction(geneID, geneID). component(2532,'GO:0016020'). component(2532,'GO:0005886'). component(2534,'GO:0008372'). function(2534,'GO:0030554'). function(2534,'GO:0005524'). process(2534,'GO:0007243'). interaction(2534,5155). interaction(2534,4803). Basic, plus generalized background knowledge using GO zinc ion binding -> metal ion binding, ion binding, binding BISON Bled WK, Aug. 2009
Ontology encoded as relational features + gene expression data fact(class, geneID, weight). fact(‘diffexp',64499, 5.434). fact(‘diffexp',2534, 4.423). fact(‘diffexp',5199, 4.234). fact(‘diffexp',1052, 2.990). fact(‘diffexp',6036, 2.500). … … fact(‘random',7443, 1.0). fact('random',9221, 1.0). fact('random',23395,1.0). fact('random',9657, 1.0). fact('random',19679, 1.0). … … f(7,A):-function(A,'GO:0046872'). f(8,A):-function(A,'GO:0004871'). f(11,A):-process(A,'GO:0007165'). f(14,A):-process(A,'GO:0044267'). f(15,A):-process(A,'GO:0050874'). f(20,A):-function(A,'GO:0004871'),process(A,'GO:0050874'). f(26,A):-component(A,'GO:0016021'). f(29,A):- function(A,'GO:0046872'), component(A,'GO:0016020'). f(122,A):-interaction(A,B),function(B,'GO:0004872'). f(223,A):-interaction(A,B),function(B,'GO:0004871'), process(B,'GO:0009613'). f(224,A):-interaction(A,B),function(B,'GO:0016787'), component(B,'GO:0043231'). BISON Bled WK, Aug. 2009
Propositionalization BISON Bled WK, Aug. 2009
Propositional subgroup discovery f2 and f3 [4,0] BISON Bled WK, Aug. 2009
Summary: SEGS Method and Results • SEGS method: • Through semantic subgroup discovery SEGS generates candidate gene set descriptions as conjunctions of first-order features, combining individual GO, KEGG and ENTREZ terms • SEGS combines Fisher, GSEA and PAGE enrichment tests to select most interesting groups of differentially expressed genes • SEGS results: • Descriptions of subgroups of genes that are differentially expressed (e.g., belong to class DIFF-EXP of top 300 most differentially expressed genes) in contrast with RANDOM genes (randomly selected genes with low differential expression). • Sample subgroup description: diffexp(A) ;- interaction(A,B) & function(B,'GO:0004871') & process(B,'GO:0009613') BISON Bled WK, Aug. 2009
SEGS implementationQuery: Results: BISON Bled WK, Aug. 2009
BISON project • The challenge: Support humans to find new, interesting linksaccross domains, named bisociations • across different contexts • across different types of data and knowledge sources • Open problems: • Fusion of heterogeneous data/knowledge sources into a joint representation format - a large information network named BisoNet (consisting of nodes and relatioships between nodes) • Finding unexpected, previously unknown links between BisoNet nodes belonging to different contexts BISON Bled WK, Aug. 2009
Heterogeneous data sources(BISON, M. Berthold, 2008) BISON Bled WK, Aug. 2009
Bridging concepts(BISON, M. Berthold, 2008) BISON Bled WK, Aug. 2009
Use Case: Glioma Cancer(investigated at NIB) Glioma a type of brain cancer different types Glioblastoma: life expectany less than 1 year Glioma treatment No efficient treatment available Testing new hypotheses for treatment: using stem cells for drug transport to the brain ? New insights in stem cell behavior and brain cancer mechanisms ? BISON Bled WK, Aug. 2009
Glioma treatment Biological questions: Are stems cells efficient/effective for drug transport ? What are the risks associated? ad. Risks: Evaluation of BM-hMSC stem cells stability Biological experiments: 4 stem cell lines RNA was isolatedand sent for transcriptome analysis hMSC growth curves reveal “fast” & “slow” growing clones Slow: hMSC-1, hMSC-3 5-6 passages/6-weeks Fast: hMSC-2, hMSC-4 8 passages/6-weeks Risk of malignant transformation: Two lines (hMSC-1 and hMSC2) transformed into cancer cells Microarray analysis is performed to find groups of differentially expressed genes in several experiments: slow vs. fast growing cell lines normal vs. cancerous cell lines BISON Bled WK, Aug. 2009
SEGS+Biomine Methodology Microarray: Gene sets: Contextualization, Exploratory link discovery e.g. - slow-vs-fast cell growth BISON Bled WK, Aug. 2009
Biomine • In Biomine (UH) (DILS 2006), data from numerous public databases are merged into a large graph: currently consisting of 1,968,951 vertices and 7,008,607 edges. • Vertices correspond to entities and concepts • Edges represent known, annotated relationships between vertices. • A link (a relation between two entities) is manifested as a path or a subgraph connecting the corresponding vertices. • A bisociative link is a path traversing nodes belonging to different domains/contexts • In Biomine, a method for link discovery between entities in queries was developed for graph exploration. BISON Bled WK, Aug. 2009
Biomine Information fusion • Biomine graph integrates numerous databases BISON Bled WK, Aug. 2009
SEGS+Biomine Methodology • Biomine information fusion into a BisoNet information network • Interesting node discoveryand contextualisation with SEGS • Information fusion of GO, KEGG, ENTREZ • Identify conjunctions of concepts from different domains (ontologies) • Interesting cross=context link discovery with Biomine • create bisociative links as paths in the Biomine subgraph connecting the concepts proposed by SEGS • BisoNet Exploration/Explanation • explore BisoNet paths ranked according to weigths/probabilities (as currently implemented in Biomine) BISON Bled WK, Aug. 2009
Biomine: Bisociative link discoveryQuery: Result: BISON Bled WK, Aug. 2009
SEGS+Biomine Information fusion SEGS merges GO, KEGG and ENTREZ, BisoNet is used for concept visualization BISON Bled WK, Aug. 2009
SEGS+Biomine Creative knowledge discovery Identify interesting concepts (BisoNet nodes) from different contexts (different databases) BISON Bled WK, Aug. 2009
SEGS+Biomine Creative link discovery Create bisociative cross-context links/paths linking BisoNet concepts from different contexts BISON Bled WK, Aug. 2009
SEGS+Biomine Exploration and explanation Explore and interpret most interesting cross-context BisoNet links/paths between concepts BISON Bled WK, Aug. 2009
Summary • SEGS discovers interesting descriptions of differentially expressed gene groups as conjunctions of concepts from different contexts • Biomine finds cross-context links (paths) between concepts discovered by SEGS • The SEGS+Biomine approach has the potential for creative knowledge and bisociative link discovery • Preliminary results in stem cell microarray data analysis (EMBC 2009)indicate that the SEGS+Biomine methodology may lead to new insights – in vitro experiments are being planned at NIB to verify and validate the preliminary insights BISON Bled WK, Aug. 2009