1 / 47

Bioinformatics & Computational Biology

Bioinformatics & Computational Biology. Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs. Drena Dobbs Iowa State University. What is Bioinformatics? (& What is Computational Biology?). Wikipedia :

uzuri
Download Presentation

Bioinformatics & Computational Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics & Computational Biology Thanks to Mark Gerstein (Yale) & Eric Green (NIH) for many borrowed & modified PPTs Drena Dobbs Iowa State University

  2. What is Bioinformatics?(& What is Computational Biology?) Wikipedia: • Bioinformatics & computational biology involve the use of techniques from mathematics,informatics, statistics, and computer science(& engineering) to solve biological problems Gerstein: • (Molecular) Bioinformatics is conceptualizing biology in terms of molecules & applying “informatics” techniques - derived from disciplines such as mathematics, computer science, and statistics - to organize and understand information associated with these molecules, on a large scale

  3. Central Dogma of Molecular Biology DNA sequence -> RNA -> Protein -> Phenotype Molecules Sequence, Structure, Function Processes Mechanism, Specificity, Regulation Central Paradigm for Bioinformatics Genomic (DNA) Sequence -> mRNAs & other RNA sequences -> Protein sequences -> RNA & Protein Structures -> RNA & Protein Functions -> Phenotype Large Amounts of Information Standardized Statistical What is the Information?Biological Sequences, Structures, Processes Modified from Mark Gerstein idea from D Brutlag, Stanford, graphics from S Strobel)

  4. Explosion of "Omes" & "Omics!"Genome, Transcriptome, Proteome • Genome - the complete collection of DNA (genes and "non-genes") of an organism • Transcriptome - the complete collection of RNAs (mRNAs & others) expressed in an organism • Proteome - the complete collection of of proteins expressed in an organism

  5. Genome = ConstantTranscriptome & Proteome = Variable • Genome - the complete collection of DNA (genes and "non-genes") of an organism * Note: Although the DNA is "identical" in all cells of an organism, the sets of RNAs or proteins expressed in different cells & tissues of a single organism vary greatly -- and depend on variables such as environmental conditions, age. developmental stage disease state, etc. • Transcriptome- the complete collection of RNAs (mRNAs & others) expressed in an organism* • Proteome- the complete collection of proteins expressed in an organism*

  6. Molecular Biology Information: DNA & RNA Sequences Functions: • Genetic material • Information transfer (mRNA) • Protein synthesis (tRNA/mRNA) • Catalytic & regulatory activities (some very new!) Information: • 4 letter alphabet • (DNA nucleotides: AGCT) • ~ 1,000 base pairs in a small gene • ~ 3 X 109 bp in a genome (human) DNA sequence: atggcaattaaaattggtatcaatggttttggtcgtat gcacaacaccgtgatgacattgaagttgtaggtattaa atggcttatatgttgaaatatgattcaactcacggtcg aaagatggtaacttagtggttaatggtaaaactatccg Gcaaacttaaactggggtgcaatcggtgttgatatcgctttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagtt RNA sequence has "U" instead of "T" • Where are the genes? • Which DNA sequences encode mRNA? • Which DNA sequences are "junk"? • Which RNA sequences encode protein? Modified from Mark Gerstein

  7. Molecular Biology Information: Protein Sequences Functions:Most cellular functions are performed or facilitated by proteins • Biocatalysis • Cofactor transport/storage • Mechanical motion/support • Immune protection • Regulation of growth and differentiation Information: • 20 letter alphabet (amino acids) • ACDEFGHIKLMNPQRSTVWY (but not BJOUXZ) • ~ 300 aa in an average protein (in bacteria) • ~ 3 X 106 known protein sequences Protein sequences: d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTT d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTS d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV • What is this protein? • Which amino acids are most important -- for folding, activity, interaction with other proteins? • Which sequence variations are harmful (or beneficial)? Modified from Mark Gerstein

  8. Molecular Biology Information:Macromolecular Structures DNA/RNA/Protein Structures • How does a protein (or RNA) sequence fold into an active 3-dimensional structure? • Can we predict structure from sequence? • Can we predict function from structure (or perhaps, from sequence alone?) Modified from Mark Gerstein

  9. We don't yet understand the protein folding code - but we try to engineer proteins anyway! Modified from Mark Gerstein

  10. Molecular Biology Information:Biological Processes Functional Genomics • How do patterns of gene expression determine phenotype? • Which genes and proteins are required for differentiation during during development? • How do proteins interact in biological networks? • Which genes and pathways have been most highly conserved during evolution?

  11. On a Large Scale?Whole GenomeSequencing Genome sequences now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading. -- G A Pekso, Nature401: 115-116 (1999) Modified from Mark Gerstein

  12. Automated Sequencing for Genome Projects Another recent improvement: rapid & high resolution separation of fragments in capillaries instead of gels (E Yeung,Ames Lab, ISU) More recently? Pyro-sequencing 454 sequencing http://www.454.com/ $ 1000 genomes? Modified from Eric Green

  13. 1st Draft Human Genome - "Finished" in 2001 Modified from Eric Green

  14. Human Genome Sequencing • Two approaches: • Public (government) - International Consortium • (6 countries, NIH-funded in US) • "Hierarchical" cloning & BAC-by-BAC sequencing • Map-based assembly • Private (industry) - Celera (Craig Venter) • Whole genome random "shotgun" sequencing • Computational assembly • (took advantage of public maps & sequences,too) Guess which human genome they sequenced? Craig's How many genes? ~ 20,000 (Science May 2007)

  15. Public Sequencing - International Consortium Modified from Eric Green

  16. Comparison of Sequenced Genome Sizes Plants? Some have much larger genomes than human! Modified from Eric Green

  17. "Complete" Human Genome Sequence - What next? from Eric Green

  18. Understanding Gene Function on a Genomic Scale Next Step after the Sequence? • Expression Analysis • Structural Genomics • Protein Interactions • Pathway Analysis • Systems Biology • Evolutionary Implications of: • Introns & Exons • Intergenic Regions as "Gene Graveyard" Modified from Mark Gerstein

  19. Interpreting the Human Genome Sequence! from Eric Green

  20. Comparative Genomics: compare entire genomic sequences from Eric Green

  21. Comparing Genomes: Functional Elements from Eric Green

  22. Gene Expression Data: the Transcriptome (& Proteome) MicroArray Data • Yeast Expression Data: • Levels for all 6,000 genes! • Experiments to investigate how genes respond to changes in environment or how patterns of expression change in normal vs cancerous tissue ISU's Biotechnology Facilities include state-of-the-art Microarray & Proteomics instrumentation Modified from Mark Gerstein (courtesy of J Hager)

  23. Other Whole-Genome Experiments Systematic Knockouts: Make "knockout" (null) mutations in every gene - one at a time - and analyze the resulting phenotypes! For yeast: 6,000 KO mutants! 2-hybrid Experiments: For each (and every) protein, identify every other protein with which it interacts! For yeast: 6000 x 6000 / 2 ~ 18M interactions!! Modified from Mark Gerstein

  24. Molecular Biology Information:Integrating Data • Understanding the function of genomes requires integration of many diverse and complex types of information: • Metabolic pathways • Regulatory networks • Whole organism physiology • Evolution, phylogeny • Environment, ecology • Literature (MEDLINE) Modified from Mark Gerstein

  25. Storing & Analyzing Large-scale Information:Exponential Growth of Data Matched by Development of Computer Technology CPU vs Disk & Net • Both the increase in computer speed and the ability to store large amounts of information on computers have been crucial • Improved computing resources have been a driving force in Bioinformatics ISU's supercomputer "CyBlue" is among 100 most powerful in the world Modified from Mark Gerstein (Internet picture adaptedfrom D Brutlag, Stanford)

  26. Weber Cartoon from Mark Gerstein

  27. Challenges in Organizing & Understanding High-throughput Data:Redundancy and Multiplicity • Different sequences can have the same structure • Organism has many similar genes • Single gene may have multiple functions • Genes and proteins function in genetic and regulatory pathways • How do we organize all this information so that we can make sense of it? Integrative Genomics: genes >< structures <> functions <> pathways <> expression <>regulatory systems <> …. Modified from Mark Gerstein

  28. "Simple" example? ProteinsMolecular Parts = Conserved Domains Modified from Mark Gerstein

  29. "Parts List" approach to bike maintenance: How many roles can these play? How flexible and adaptable are they mechanically? What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Where are the parts located? Modified from Mark Gerstein

  30. World of protein structures is also finite,providing a valuable simplification! (human) ~20,000 genes ~2,000 folds (T. pallidum) ~2,000 genes Global Surveys of a Finite Set of Parts from Many Perspectives Same logic for pathways, functions, sequence families, blocks, motifs.... Modified from Mark Gerstein Functions picture from www.fruitfly.org/~suzi (Ashburner); Pathways picture from, ecocyc.pangeasystems.com/ecocyc (Karp, Riley). Related resources: COGS, ProDom, Pfam, Blocks, Domo, WIT, CATH, Scop....

  31. BUT, what actually happens in cells & in whole organisms is much more complex! providing a challenging complication!! Exploring the Virtual Cell at ISU Virtual Cell projects elsewhere... NCBI's Bookshelf - a great resource!

  32. So, having a list of parts is not enough! BIG QUESTION? How do parts work together to form a functional system? SYSTEMS BIOLOGY What is a system? Macromolecular complex, pathway, network, cell, tissue, organism, ecosystem…

  33. Is this Bioinformatics? (#1,with Answers) YES • Creating digital libraries • Automated bibliographic search and textual comparison • Knowledge bases for biological literature • Motif discovery using Gibb's sampling • Methods for structure determination • Computational X-ray crystallography • NMR structure determination • Distance Geometry • Metabolic pathway simulation YES YES YES Modified from Mark Gerstein

  34. Is this Bioinformatics? #2 YES • Gene identification by sequence inspection • Prediction of splice sites, promoters, etc. • DNA methods in forensics • Modeling populations of organisms • Ecological Modeling • Genomic sequencing methods • Assembling contigs • Physical and genetic mapping • Linkage analysis • Linking specific genes to various traits YES YES YES YES Modified from Mark Gerstein

  35. Is this Bioinformatics? #3 • Rational drug design • RNA structure prediction • Protein structure prediction • Radiological image processing • Computational representations for human anatomy • (e.g., Visible Human) • Artificial life simulations • Artificial immunology • Virtual cells YES Maybe Yes Modified from Mark Gerstein

  36. So, this is Bioinformatics What is it good for?

  37. EXAMPLES OF BIOINFORMATICS RESEARCH A few general ones & a few personal favorites!

  38. Designing New Drugs • Understanding how proteins bind other molecules • Structural modeling & ligand docking • Designing inhibitors or modulators of key proteins Figures adapted from Olsen Group Docking Page at Scripps, Dyson NMR Group Web page at Scripps, and from Computational Chemistry Page at Cornell Theory Center). Modified from Mark Gerstein

  39. Finding homologs of "new" human genes Modified from Mark Gerstein

  40. Finding WHAT? Homologs - "same genes" in different organisms(actually, orthologs) • Human vs. Mouse vs. Yeast • Much easier to do experiments on yeast to determine function • Often, function of an ortholog in at least one organism is known Best Sequence Similarity Matches to Date Between Positionally Cloned Human Genes and S. cerevisiae Proteins Human Disease MIM # Human GenBank BLASTX Yeast GenBank Yeast Gene Gene Acc# for P-value Gene Acc# for Description Human cDNA Yeast cDNA Hereditary Non-polyposis Colon Cancer 120436 MSH2 U03911 9.2e-261 MSH2 M84170 DNA repair protein Hereditary Non-polyposis Colon Cancer 120436 MLH1 U07418 6.3e-196 MLH1 U07187 DNA repair protein Cystic Fibrosis 219700 CFTR M28668 1.3e-167 YCF1 L35237 Metal resistance protein Wilson Disease 277900 WND U11700 5.9e-161 CCC2 L36317 Probable copper transporter Glycerol Kinase Deficiency 307030 GK L13943 1.8e-129 GUT1 X69049 Glycerol kinase Bloom Syndrome 210900 BLM U39817 2.6e-119 SGS1 U22341 Helicase Adrenoleukodystrophy, X-linked 300100 ALD Z21876 3.4e-107 PXA1 U17065 Peroxisomal ABC transporter Ataxia Telangiectasia 208900 ATM U26455 2.8e-90 TEL1 U31331 PI3 kinase Amyotrophic Lateral Sclerosis 105400 SOD1 K00065 2.0e-58 SOD1 J03279 Superoxide dismutase Myotonic Dystrophy 160900 DM L19268 5.4e-53 YPK1 M21307 Serine/threonine protein kinase Lowe Syndrome 309000 OCRL M88162 1.2e-47 YIL002C Z47047 Putative IPP-5-phosphatase Neurofibromatosis, Type 1 162200 NF1 M89914 2.0e-46 IRA2 M33779 Inhibitory regulator protein Choroideremia 303100 CHM X78121 2.1e-42 GDI1 S69371 GDP dissociation inhibitor Diastrophic Dysplasia 222600 DTD U14528 7.2e-38 SUL1 X82013 Sulfate permease Lissencephaly 247200 LIS1 L13385 1.7e-34 MET30 L26505 Methionine metabolism Thomsen Disease 160800 CLC1 Z25884 7.9e-31 GEF1 Z23117 Voltage-gated chloride channel Wilms Tumor 194070 WT1 X51630 1.1e-20 FZF1 X67787 Sulphite resistance protein Achondroplasia 100800 FGFR3 M58051 2.0e-18 IPL1 U07163 Serine/threoinine protein kinase Menkes Syndrome 309400 MNK X69208 2.1e-17 CCC2 L36317 Probable copper transporter Modified from Mark Gerstein

  41. Comparative GenomicsGenome/Transcriptome/Proteome/Metabolome Databases, statistics • Occurrence of a specific genes or features in a genome • How many kinases in yeast? • Compare Tissues • Which proteins are expressed in cancer vs normal tissues? • Diagnostic tools • Drug target discovery Modified from Mark Gerstein

  42. Molecular Recognition:Analyzing & Predicting Macromolecular Interfaces (in DNA, RNA & protein complexes) Drena Dobbs, GDCB Jae-Hyung Lee Michael Terribilini Jeff Sander Pete Zaback Vasant Honavar, Com S Feihong Wu Cornelia CarageaRobert Jernigan, BBMB Taner Sen Andrzej KloczkowskiKai-Ming Ho, Physics

  43. Designing Zinc Finger DNA-binding proteins to recognize specific sites in genomic DNA Drena Dobbs, GDCB Jeff Sander Pete Zaback Dan Voytas, GDCB Fenglli FuLes Miller, ComSVasant Honavar, ComSKeith Joung, Harvard

  44. Structure & function of human telomerase:Predicting structure & functional sites in a clinically important but "recalcitrant" RNP Cell Biologist: Biochemist: Imagined structure: www.intl-pag.org/ www.chemicon.com Lingner et al (1997) Science 276: 561-567. How would a systems biologist study telomerase?

  45. Resources for Bioinformatics & Computational Biology • Wikipedia: Bioinformatics • NCBI - National Center for Biotechnology Information • ISCB - International Society for Computational Biology • JCB - Jena Center for Bioinformatics • UBC - Bioinformatics Links Directory

  46. ISU Resources & Experts ISU Research Centers & Graduate Training Programs: BCB - Bioinformatics & Computational Biology Baker Center - Bioinformatics & Biological Statistics CIAG - Center for Integrated Animal Genomics CILD - Computational Intelligence, Learning & Discovery ISU Facilities: Biotech - Instrumentation Facilities CIAG - Center for Integrated Animal Genomics PSI - Plant Sciences Institute PSI Centers

  47. For fun: DNA Interactive: "Genomes" A tutorial on genomic sequencing, gene structure, genes prediction Howard Hughes Medical Institute (HHMI) Cold Spring Harbor Laboratory (CSHL) http://www.dnai.org/c/index.html

More Related