280 likes | 712 Views
Mining the Genome Filip Železný ČVUT FEL, Prague Dept. of Cybernetics Gerstner Laboratory Intro Research at ČVUT FEL Dept. of Cybernetics Nature Inspired Technologies machine learning evolutionary computation Agent Computing Robotics Computer Vision EU Projects (6 FP)
E N D
Mining the Genome Filip Železný ČVUT FEL, Prague Dept. of Cybernetics Gerstner Laboratory
Intro • Research at ČVUT FEL Dept. of Cybernetics • Nature Inspired Technologies • machine learning • evolutionary computation • Agent Computing • Robotics • Computer Vision • EU Projects (6 FP) • 14 running in 2005, 9 new starting 2006
Machine Learning & Data Mining • Supervised learning • given examples and their class labels • find a model for predicting class labels of new examples • also: “concept learning”, “predictive classification”, ... • Example • Given: • Discover: size=small & luxury=low affordable
Machine Learning Plethora of paradigms Decision trees Artificial NeuralNetworks Support VectorMachines “Symbolic” “Subsymbolic” “Statistical” Learning = optimization in structure / parameter space Learning = search AI techniques employed (gradient descent, heuristic search)
Relational Learning What if examples have a structure? Not an attribute tuple ! Description spread in multiple tables of a relational database
Relational Learning • Relational learning • Representing data and rules in relational logic (Prolog) • Exploits background knowledge (eg. “charge”) • Inductive Logic Programming carcinogenic(Compound) IF has_atom(Compound, Atom) & type(Atom, carbon) & charge(Atom, Charge) & Charge > 0.0133 & has_atom(Compound, Atom2) & double_bond(Atom1, Atom2)
Applications of Interest 3 hot fields intersection BIOtechnologies(genomics) INFORMATIONtechnologies(machine learning) NANOtechnologies(microarray chips)
Background: GENETICS How does a cell know what to do?
Chromosomes Chromosomes get copied during mitosis They carry the assembly instructions? How? Chromosomes = proteins + DNA where is the information ??
DNA 1953: Jim Watson & Francis Crick Discover the DNA structure. That is where the information is. 4-symbol alphabet Guanin, Adenin, Cytosin, Tymin Double-helix pairing: C-G A-T video
The CENTRAL DOGMA of Molecular Biology • Gene = DNA subsequence • Genes code for proteins • Gene expression • DNA piece transcribes to RNA • RNA translates into a protein • Proteins `do the job’ • - enzymes • - building blocks • - ... video
Protein Coding Codon(3 bases) DNA strand aminoacid Protein
Protein structures “resolution”
Secondary structure prediction Two common secondary structures - sheet - helix Primary structure determines secondary structure. Computational problem:Given primary structure, predict if - sheet or - helix NOBODY CAN DO THAT !
Secondary structure prediction • Secondary structure prediction with ILP [Muggleton 1992] Using ILP, obtained rulessuch as alpha0(A,B) ... position(A,D,O) & not_aromatic(O) & small_or_polar(O) & position(A,B,C) & very_hydrophobic(C) & not_aromatic(C) ...etc (22 literals) • Note the incorporation of background knowledge • Accuracy 81%, best at the time • Published in JrProtein Engineering
The Genome project • 1993 – 2003 All human genes sequenced Celera X NIH race • Challenge NOW: annotate the genes • discover functions • interactions • dynamic pathways video
Genomics research Verification(targeted assay) Human intuition Hypotheses • Traditional functional genomics research • Hypothesis - driven • eg. a gene is suspected to be responsible for ... • then tracing its expression in relevant tissues • “First hypothesize, then measure”
Gene Expression Microarrays • Microarray chip: • Measures expression of tens of thousands genes simultaneously: “high-throughput” • pioneering technology (mid to late 90’s) • A grid carrying synthesized DNA probes • Breakthrough in genomics research? photo scan
Genomics Research • High-Throughput approach to functional genomics ? • Data-driven, unbiased, “First measure, then hypothesize” • Might reveal never-thought-of relationships Microarray data Human analysis Hypotheses IMPOSSIBLE (TOO MUCH DATA) Expression of almost entire genome(tens of thousands genes)
Genomics Research through Machine Learning • AI based High-Throughput functional genomics ? High-throughputscreening High-performancecomputing Microarray data Machine Learning Hypotheses Interpretation
Genomics Research with AI • This concept has recently been proven to work • Golub et al., Science286:531-537 1999 • leukemia classification model (AML vs. ALL) • voting of informative attributes (genes) • Discovery of new classes (clustering) • Ramaswamy et al., PNAS 98:15149-54 2001 • Tumor classification • 14 classes of cancer • used Support Vector Machines video
Interpretable classifiers • Comprehensibility Pursuit: Rule Based Models • Models interpretable by biologists • Our work • D. Gamberger, N. Lavrač, F. Železný, J. Tolar Jr Biomed Informatics 37(5):269-284 2004 IF gene_20056 EXPRESSEDAND gene_23984 NOT_EXPRESSEDTHEN cancer_class = AML Class
Exploiting Background knowledge • Tons of genomic background knowledge available • Relational learning would allow to exploit it!
Relational Genomic Data Mining • Our current work Combining expression & gene annotation data Rule Based Model
Relational Genomic Data Mining • Example rule algorithmically discovered • ... open end, no conclusions expressed_in_all(Gene) IF has_location(Gene, integral_to_membrane) & has_function(Gene, receptor_activity) Expression of genescoding for proteinslocated in the integral to membrane cell component,whose functions include receptor activity, has a high correlation with the BCR class of acute lymphoblastic leukemia (ALL) and a low correlation with other classes of ALL.