680 likes | 699 Views
Bioinformation Technology: Case Studies in Bioinformatics and Biocomputing with DNA Chips. Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) Seoul National University btzhang@bi.snu.ac.kr http://bi.snu.ac.kr/~btzhang. Outline. Bioinformation Technology Bioinformatics
E N D
Bioinformation Technology: Case Studies in Bioinformatics andBiocomputing with DNA Chips Byoung-Tak Zhang Center for Bioinformation Technology (CBIT) Seoul National University btzhang@bi.snu.ac.kr http://bi.snu.ac.kr/~btzhang
Outline • Bioinformation Technology • Bioinformatics • DNA Chip Data Analysis: IT for BT • DNA Computing: BT for IT • DNA Computing with DNA Chips • Outlook
A New Disease Encyclopedia New Genetic Fingerprints Genome Health Implications New Diagnostics New Treatments Human Genome Project Goals • Identify the approximate 40,000 genes in human DNA • Determine the sequences of the 3 billion bases that make up human DNA • Store this information in database • Develop tools for data analysis • Address the ethical, legal and social issues that arise from genome research
BT IT Bioinformatics vs. Biocomputing Bioinformatics Biocomputing
Bio– molecular biology Informatics – computer science Bioinformatics – solving problems arising from biology using methodology from computer science. What is Bioinformatics? • Bioinformatics vs. Computational Biology • Bioinformatik (in German): Biology-based computer science as well as bioinformatics (in English)
Molecular Biology: Flow of Information DNA RNA Protein Function ACTGG Leu Ala Ser A Arg A Phe Cys Lys Cys Cys Asp G C DNA Protein T T A T C
DNA (Gene) RNA Protein TATA start Termination stop Control statement Control statement Gene Transcription (RNA polymerase) Ribosome binding 5’ utr mRNA 3’ utr Translation (Ribosome) Protein
Nucleotide and Protein Sequence aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggctgc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cggagacccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg cggagacccc DNA (Nucleotide) Sequence SQ sequence 1344 BP; 291 A; C; 401 G; 278 T; 0 other Protein (Amino Acid) Sequence gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg agttaaaact ttcaacaatg gatctcttgg ttccggctgc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cggagacccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg cggagacccc gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg cgcttgtcgg ccgccggggg ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg aacctgcgga ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg agttaaaact ttcaacaatg gatctcttgg ttccggctgc tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg ccccccgggc ccgtgcccgc cggagacccc tgttgcttcg CG2B_MARGL Length: 388 April 2, 1997 14:55 Type: P Check: 9613 .. 1 MLNGENVDSR IMGKVATRAS SKGVKSTLGT RGALENISNV ARNNLQAGAK KELVKAKRGM TKSKATSSLQ SVMGLNVEPM EKAKPQSPEP MDMSEINSAL EAFSQNLLEG VEDIDKNDFD NPQLCSEFVN DIYQYMRKLE REFKVRTDYM TIQEITERMR SILIDWLVQV HLRFHLLQET LFLTIQILDR YLEVQPVSKN KLQLVGVTSM LIAAKYEEMY PPEIGDFVYI TDNAYTKAQI RSMECNILRR LDFSLGKPLC IHFLRRNSKA GGVDGQKHTM AKYLMELTLP EYAFVPYDPS EIAAAALCLS SKILEPDMEW GTTLVHYSAY SEDHLMPIVQ KMALVLKNAP TAKFQAVRKK YSSAKFMNVS TISALTSSTV MDLADQMC
Some Facts • 1014cells in the human body. • 3 109 letters in the DNA code in every cell in your body. • DNA differs between humans by 0.2% (1 in 500 bases). • Human DNA is 98% identical to that of chimpanzees. • 97% of DNA in the human genome has no known function.
Sequence analysis • Sequence alignment • Structure and function prediction • Gene finding • Structure analysis • Protein structure comparison • Protein structure prediction • RNA structure modeling • Expression analysis • Gene expression analysis • Gene clustering • Pathway analysis • Metabolic pathway • Regulatory networks Topics in Bioinformatics
Extension of Bioinformatics Concept • Genomics • Functional genomics • Structural genomics • Proteomics: large scale analysis of the proteins of an organism • Pharmacogenomics: developing new drugs that will target a particular disease • Microarray: DNA chip, protein chip
Applications of Bioinformatics • Drug design • Identification of genetic risk factors • Gene therapy • Genetic modification of food crops and animals • Biological warfare, crime etc. • Personal Medicine? • E-Doctor?
GenBank SWISS-PROT Database Information Retrieval Hardware Supercomputing Bioinformatics Biomedical text analysis Algorithm Agent Information filtering Monitoring agent Sequence alignment Machine Learning Clustering Rule discovery Pattern recognition Bioinformatics as Information Technology
Background of Bioinformatics • Biological information infra • Biological information management systems • Analysis software tools • Communication networks for biological research • Massive biological databases • DNA/RNA sequences • Protein sequences • Genetic map linkage data • Biochemical reactions and pathways • Need to integrate these resources to model biological reality and exploit the biological knowledge that is being gathered.
Areas and Workflow of Bioinformatics AGCTAGTTCAGTACA TGGATCCATAAGGTA CTCAGTCATTACTGC AGGTCACTTACGATA TCAGTCGATCACTAG CTGACTTACGAGAGT Microarray (Biochip) Structural Genomics Functional Genomics Proteomics Pharmaco- genomics Infrastructure of Bioinformatics
cDNA Microarray Excitation Scanning Laser 2 Laser 1 cDNA clones (probes) PCR product amplification purification mRNA target Emission Printing Overlay images and normalize Hybridize target to microarray 0.1nl/spot Microarray Analysis
The Complete Microarray Bioinformatics Solution Databases Cluster Analysis Data Management Statistical Analysis Data Mining Image Processing Automation
DNA Chip Applications • Gene discovery:gene/mutated gene • Growth, behavior, homeostasis … • Disease diagnosis • Cancer classification • Drug discovery: Pharmacogenomics • Toxicological research: Toxicogenomics
Disease Diagnosis:Cancer Classification with DNA Microarray • cDNA microarray data of 6567 gene expression levels [Khan ’01]. • Filter genes that are correlated to the classification of cancer using PCA and ANN learning. • Hierarchical clustering of the DNA chip samples based on the filtered 96 genes. • Disease diagnosis based on DNA chip. [Fig.] Flowchart of the experimental procedure.
Disease Diagnosis:Hierarchical Clustering Based on Gene Expression Levels • Hierarchical clustering of cancer by 96 gene expression levels. • The relation between gene expression and cancer category. • Four cancer diagnostic categories [Fig.] The dendrogram of four cancer clusters and gene expression levels (row: genes, column: samples).
AI Methods for DNA Chip Data Analysis • Classification and prediction • ANNs, support vector machines, etc. • Disease diagnosis • Cluster analysis • Hierarchical clustering, probabilistic clustering, etc. • Functional genomics • Genetic network analysis • Differential models, relevance networks, Bayesian networks, etc. • Functional genomics, drug design, etc.
Cluster Analysis [Gene Cluster 1] [Gene Cluster 2] [Gene Cluster 3] [DNA microarray dataset] [Gene Cluster 4]
Methods for Cluster Analysis • Hierarchical clustering [Eisen ’98] • Self-organizing maps [Tamayo ’99] • Bayesian clustering [Barash ’01] • Probabilistic clustering using latent variables [Shin ’00] • Non-negative matrix factorization [Shin ’00] • Generative topographic mapping [Shin ’00]
Clustering of Cell Cycle-regulated Genes in S. cerevisiae (the Yeast) • Identify cell cycle-regulated genes by cluster analysis. • 104 genes are already known to be cell-cycle regulated. • Known genes are clustered into 6 clusters. • Cluster 104 known genes and other genes together. • The same cluster similar functional categories. [Fig.] 104 known gene expression levels according to the cell cycle (row: time step, column: gene).
Probabilistic Clustering Using Latent Variables gi: ith gene zk: kth cluster tj: jth time step p(gi|zk): generating probability of ith gene given kth cluster vk=p(t|zk): prototype of kth cluster : (*) objective function (maximized by EM)
Experimental Result:Identify Cell Cycle-Regulated Genes • Clustering result [Table] Clustering result with -factor arrest data. In 4 clusters, the genes, that have high probability of being cell cycle-regulated, were found.
Experimental Result:Prototype Expression Levels of Found Clusters • The genes in the same cluster show similar expression patterns during the cell cycle. • The genes with similar expression patterns are likely to have correlated functions. [Fig.] Prototype expression levels of genes found to be cell cycle-regulated (4 clusters).
h1 h2 hr … W … g1 g2 gn Clustering Using Non-negative Matrix Factorization (NMF) • NMF (non-negative matrix factorization) • NMF as a latent variable model G: gene expression data matrix W : basis matrix (prototypes) H : encoding matrix (in low dimension)
Experimental Result:Five Clusters Found by NMF • 5prototype expression levels during the cell cycle. Expression level Time step in cell cycle
Grid x2 x1 Clustering Using Generative Topographic Mapping (GTM) • GTM: a nonlinear, parametric mapping y(x;W) from a latent space to a data space. t3 Generation y(x;W): mapping t2 Visualization t1 <Latent space> <Data space>
Experimental Result:Clusters Found by GTM • Three cell cycle-regulated clusters found by GTM
Experimental Result:Comparison with other methods • Comparison of prototype expression levels
Genetic Network Analysis • Discover the complex regulatory interaction among genes. • Disease diagnosis, pharmacogenomics and toxicogenomics • Boolean networks • Differential equations • Relevance networks [Butte ’97] • Bayesian networks [Friedman ’00] [Hwang ’00] [Fig.] Basin of attraction of 12-gene Boolean genetic network model [Somogyi ’96].
A B C D E Bayesian Networks • Represent the joint probability distribution among random variables efficiently using the concept of conditional independence. An edge denotes the possibility of the causal relationship between nodes. • A, C and D are independent given B. • C asserts dependency between A and B. • A, B and E are independent given C.
Bayesian Networks Learning • Dependence analysis [Margaritis ’00] • Mutual information and 2 test • Score-based search • D: data, S: Bayesian network structure • NP-hard problem • Greedy search • Heuristics to find good massive network structuresquickly (local to global search algorithm)
The Small Bayesian Network for Classification of Cancer • The Bayesian network was learned by full search using BD (Bayesian Dirichlet) score with uninformative prior [Heckerman ’95] from the DNA microarray data for cancer classification(http://waldo.wi.mit.edu/MPR/). Zyxin Leukemia class [Table] Comparison of the classification performance with other methods [Hwang ’00]. C-myb LTC4S MB-1
Large-Scale Bayesian Network with 1171 Genes • Genetic networks for understanding the regulatory interactionamong genes and their derivatives • Pharmacogenomics and Toxicogenomics [Fig.] The Bayesian network structure constructed from DNA microarray data for cancer classification (partial view).
DNA Computing: BioMolecules as Computer 011001101010001 ATGCTCGAAGCT
Why DNA Computing? • 6.022 1023molecules / mole • Immense, brute force search of all possibilities • Desktop: 109 operations / sec • Supercomputer: 1012 operations / sec • 1 mmol of DNA: 1026 reactions • Favorable energetics: Gibb’s free energy • 1 J for 2 1019 operations • Storage capacity: 1 bit per cubic nanometer
Flow of DNA Computing Node 0: ACG Node 3: TAA Node 1: CGA Node 4: ATG Node 2: GCA Node 5: TGC Node 6: CGT Encoding HPP ... ... TAAACG Ligation ... 4 3 ... ATG ... ATGTGCTAACGAACG ... ACGCGAGCATAAATGTGCACGCGT CGA ACG GCA 1 ... ... ... 0 ... TAAACGGCAACG ... TAA ... ACGCGAGCATAAATGTGCCGT CGT TGC ... 6 ... ... ... ACGCGAGCATAAATGCGATGCACGCGT ... ... CGACGTAGCCGT 2 5 ... CGACGT ... ... ... ... ... Gel Electrophoresis PCR (Polymerase Chain Reaction) ACGCGAGCATAAATGTGCCGT ACGGCATAAATGTGCACGCGT ACGCGAGCATAAATGCGATGCCGT Solution Decoding 4 3 Affinity Column ... ACGCGTAGCCGT 1 ACGCGAGCATAAATGTGCCGT 0 ... ... ... ... ACGCGAGCATAAATGTGCACGCGT ... ... 6 ACGCGT ... ... ACGCGAGCATAAATGTGCCGT ... ACGCGAGCATAAATGCGATGCACGCGT 2 5
Biointelligence on a Chip? Bioinformation Technology Biological Computer Information Technology Biointelligence Chip Computing Models: The limit of conventional computing models Molecular Electronics Computing Devices: The limit of silicone semiconductor technology Biotechnology
Output Intelligent Biomolecular Information Processing Theoretical Models Input A Controller Input A Reaction Chamber (Calculating) Bio-Memory Bio-Processor Biocomputing
Evolvable Biomolecular Hardware • Sequence programmable and evolvable molecular systems have been constructed as cell-free chemical systems using biomolecules such as DNA and proteins.
Molecular Operators for DNA Computing • Hybridization: complementary pairing of two single-stranded polynucleotides 5’- AGCATCCA –3’ 5’- AGCATCCA –3’ 3’- TGCTAGGT –5’ + 3’- TCGTAGGT –5’ • Ligation: attaching sticky ends to a blunt-ended molecule ATGCATGC TACG ATGCATGCTGAC TACGTACGTGAC TGAC TACGACTG + sticky end
Research Groups • MIT, Caltech, Princeton University, Bell Labs • EMCC (European Molecular Computing Consortium) is composed of national groups from 11 European countries • BioMIP Institute (BioMolecular Information Processing) at the German National Research Center for Information Technology (GMD) • Molecular Computer Project (MCP) in Japan • Leiden Center for Natural Computation (LCNC)
Applications of Biomolecular Computing • Massively parallel problem solving • Combinatorial optimization • Molecular nano-memory with fast associative search • AI problem solving • Medical diagnosis • Cryptography • Drug discovery • Further impact in biology and medicine: • Wet biological data bases • Processing of DNA labeled with digital data • Sequence comparison • Fingerprinting