620 likes | 1.44k Views
BK21 BT · IT Integrationist Program Omics data integration & mining The Sixth Sino-Japan-Korea Bioinformatics Training Course Shanghai, Ma rch 27-30, 200 7 2007. 3. 29 Sangsoo Kim & KOBIC Omics Team What is the goal of Biosciences? Ultimately, the complete understanding of life phenomena
E N D
BK21 BT·IT Integrationist Program Omics data integration & mining The Sixth Sino-Japan-Korea Bioinformatics Training Course Shanghai, March 27-30, 2007 2007. 3. 29 Sangsoo Kim & KOBIC Omics Team
What is the goal of Biosciences? • Ultimately, the complete understanding of life phenomena • Complex organization • Regulatory mechanism (homeostasis) • Growth & development • Energy utilization • Response to the environmental stimuli • Reproduction (DNA guaranties exact replication) • Evolution (capacity of species to change over time)
Spider Silk: Stronger than Steel • Life’s diversity results from the variety of molecules in cells • A spider’s web-building skill depends on its DNA molecules • DNA also determines the structure of silk proteins • These make a spiderweb strong and resilient
The capture strand contains a single coiled silk fiber coated with a sticky fluid • The coiled fiber unwinds to capture prey and then recoils rapidly Coiled fiberof silk protein Coating of capture strand
Evidence from flagelliform silk cDNA for the structural basis of elasticity and modular nature of spider silks J Mol Biol. 1998 Feb 6;275(5):773-84 • They report the cloning of substantial cDNA for flagelliform gland silk protein, which forms the core fiber of the catching spiral • The dominant repeat of this protein is Gly-Pro-Gly-Gly-X, which can appear up to 63 times in tandem arrays • They propose that the spring-like helix is the basis for the elasticity of silk
Central dogma of molecular biology DNA RNA protein
Paradigm Shift in Biosciences • So far, biologists have focused certain phenotypes and hunted the genes responsible, one at a time • New trend is • Catalog all the parts: genes and proteins • Understand how each part works • Model & simulate the collective behavior of the parts Genomics & Proteomics FunctionalGenomics Systems Biology
genome transcriptome proteome Central dogma of bioinformatics and genomics Central dogma of molecular biology DNA RNA protein
Base pairs of DNA (billions) Sequences (millions) 1982 1986 1990 1994 1998 2002 Year
With $1,000 genome sequencing technologies in 10 years coupled with functional data, we need better IT solutions!
Proliferation of Genomics • Explosion of data • Human genes: 25,000 • Human genome: 3x109 bp • DNA-protein or protein-protein interactions could increase data dramatically • Chimpanzee, mouse, rat, dog, cow, chicken, insects, worms, plants, fungi, algae, bacteria, archaea, viruses …
Genome Projects (385 finished)as of June 4, 2006 Ongoing projects 608 eukaryotes 989 prokaryotes
Top ten challenges for bioinformatics [1] Precise models of where and when transcription will occur in a genome (initiation and termination) [2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli [4] Determining protein:DNA, protein:RNA, protein:protein recognition codes [5] Accurate ab initio protein structure prediction
Top ten challenges for bioinformatics [6] Rational design of small molecule inhibitors of proteins [7] Mechanistic understanding of protein evolution [8] Mechanistic understanding of speciation [9] Development of effective gene ontologies: systematic ways to describe gene and protein function [10] Education: development of bioinformatics curricula Source: Ewan Birney, Chris Burge, Jim Fickett
Functional Genomics & Systems Biology • New data types: • Sequences • Structures • High throughput expression profiles in (10,000 x 100) matrix forms • Interactions, Pathways, Networks • Mathematical modeling & simulation of biological processes • Algorithms • Graphical visualization
K-JIST 18C 19C 20C
Genome Transcriptome Proteome Metabolome Genomics Transcriptomics Proteomics Metabolomics DNA RNA Protein Metabolite K-JIST Terminology More than 50-omes including “Unknownome”
Omics data • In the Omics era, we see proliferation of genome/proteome-wide high throughput data that are available in public archives • Comparative genome sequences • Sequence variation & phenotypes • Epigenetics & chromatin structure • Regulatory elements & gene expression • Protein expression, modification & localization • Protein domain, structure, interaction • Metabolic, signal, regulatory pathways • Drug, toxicogenomics, toxicoproteomics
Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
As an example, • Suppose you are interested in how much the CDK2 trascription control is conserved, you may need • Orthologs in various model organisms • Genome alignments of promoter regions among phylogenetic cousins • Among mammalians or vertebrates • Among yeast subsepecies • Transfac-type of TF binding database • ChIP-chip data for each organism • Orthology map of the TF’s and so on • You may add proteome and interactome • Only part of them are available at NCBI • Rest of them are available in the public domain as an supplementary materials or at the author’s web sites
Integration of Omics data • Systematic mining • Cross-knowledge domain validation • Cross-species interpolation • Generation of hypotheses that can be tested • Biologically very interesting queries • Requires cross-functional knowledge • The way to go
Where to look for • Nature provides omics section • www.nature.com/omics • Science • Cell • PLoS Biology • Genes & Development • Stem Cell • Relevant articles (PubMed, Google Scholar)
Phase 1 of ENCODE • NHGRI’s ENCODE project generates such data at a pilot scale • The data are deposited and integrated into the UCSC Genome Browser • It offers data mining capability via Table Browser • There is no ‘biological links’ among the 3,000+ tables (Ensembl’s BioMart is more ‘biological’) • It is upto the users how to combine the tables • It is limited to genomic coordinates, not intended for proteome work
A ~2kb conserved, transcribable, Ac-histone, pol2-binding element in the 1st intron of ST7
Application Examples Joyce et al.Nature Reviews Molecular Cell Biology7, 198–210 (March 2006) | doi:10.1038/nrm1857
Protein-DNA Interaction & Transcriptomics • Yeast rich medium gene modules network • ChIP-chip location and expression data • 106 modules containing 655 genes regulated by 68 TFs
Predicting Protein-Protein Interaction by combining multiple datasets
Predicting Protein-Protein Interaction by combining multiple datasets
Predicting Protein-Protein Interaction by combining multiple datasets
How to participate • Domain knowledge group • Monitoring papers and websites of relevant data • Collect the omics data and transform into common formats • Develop hypotheses & mining strategies • Data integration group • Develop DB schema • Integration with bio-matrix & bio-engine • Querying biological concepts • Graphic visualization
Practice Session - Cytoscape • Installation • One of the most widely used and broadly accessible software packages designed to facilitate omics data integration and analysis • Totorials • Interaction network display • Expression analysis • Literature searching