1 / 67

Genomics, Bioinformatics & Systems Biology

Genomics, Bioinformatics & Systems Biology. Steve Sontum & Jeremy Ward Middlebury College. What We Hope to Learn. Course structure, projects and policies Overview of Bioinformatics Computational Biology, definition, subdisciplines Genomic and Proteomic data bases

pooky
Download Presentation

Genomics, Bioinformatics & Systems Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomics, Bioinformatics & Systems Biology Steve Sontum & Jeremy Ward Middlebury College

  2. What We Hope to Learn • Course structure, projects and policies • Overview of Bioinformatics • Computational Biology, definition, subdisciplines • Genomic and Proteomic data bases • OMICS and Molecular Biology OMES • Bike maintenance metaphor • OMIM Data base – Focus of Homework • Topics in Bioinformatics (Applications) • Read Mark Gerstein's paper Core

  3. MBBC324 SyllabusGenomics,Bioinformatics, & Systems Biology • Instructors • Class Time • Texts • Web Sites • Objectives • Grading • Homework & Notebook • Discussion & Preparation • Midterm Exam • Term Project • Etiquette • Project Topics

  4. http://s11.middlebury.edu/MBBC0324A/ http://s11.middlebury.edu/MBBC0324A/

  5. What is Life? Reproduction Metabolism Evolution ? Life Performs Computation

  6. Genome as Program actcttctggtccccacagactcagagagaacccaccatggtgctgtctcctgccgacaagaccaacgtcaaggccgcctggggtaaggtcggcgcgcacgctggcgagtatggtgcggaggccctggagaggatgttcctgtccttccccaccaccaagacctacttcccgcacttcgacctgagccacggctctgcccaggttaagggccacggcaagaaggtggccgacgcgctgaccaacgccgtggcgcacgtgg... A computer program for …

  7. 1 0 0 0 0 0 0 0 0 0 1 = $ 1025 Genome as Program A sensitivity to small changes, a single mutation, can result in amplified large changes Changing a single bit in your bank balance 0 0 0 0 0 0 0 0 0 0 1 = $ 1

  8. Are 30,000 Human Genes Too Few? • Revealed: The secret of human behavior:Environment, not genes, key to our acts. London Observer February 11, 2001 • Craig Venter on breaking the news that the human genome had only 30,000 genes said:“We simply do not have enough genes for the idea of biological determinism to be right. The wonderful diversity of the human species is not hard-wired in our genetic code. Our environments are critical.”

  9. Are 30,000 Human Genes Too Few • Venter’s statement was the making of a new myth, that 30,000 genes were “too few” to explain human nature. • How many genes (each coming in only two varieties (on or off) would we need to make each of the 7 billion people in the world unique?

  10. Are 30,000 Human Genes Too Few • Venter’s statement was the making of a new myth, that 30,000 genes were “too few” to explain human nature. • How many genes (each coming in only two varieties (on or off) would we need to make each of the 7 billion people in the world unique? • 233 = 8.6 billion ! • Only 33 are needed !

  11. 21,000 Human Genes is Not Too Few Proteins the products of genes are like Legos, they interact and connect in various ways. Bioinformatics deals with the information that organizes these pieces into a Human. Systems Biology deals with the interactions between these pieces. And finally, Genomics is the study of genes.

  12. What Does the Dark Matter Do? Shining a light on the Genome’s Dark Matter E. Pennisi Science 330, 1614(2010) Is it “junk” DNA 2% Translated into proteins 5% Evolutionarily conserved between Mouse and Human 80% Transcribed into non-coding RNA

  13. Biological Data Computer Analysis + What is Bioinformatics National Center for Biological Information recieves 5 terabytes/day !!!!!!!!!!!!!!!!!!!!!!!! Mouse Genome: 2.5 billion base pairsHuman Genome: 3 billion base pairs

  14. What is “informatics” • Derived from the French word informatique • Referring to automated information processing and storage • The name is a contraction of “information” and “automatic” • Definition:Informatics is the science that deals with information, its structure, its storage, and its use • Tends to be associated with specific application areas • Medical informatics (applied to clinical medicine) • Bioinformatics (applied to biological research) • Business informatics (applied to management and information systems) Musen

  15. Where does Bioinformatics come from? Information Theory Artificial Intelligence Systems Analysis Graph Theory Robotics Algorithms Statistics Methods Bioinformatics Genomics Systems Biology Data from the Human Genome Project has fueled the development of new bioinformatics methods Computational Biology

  16. Definition for Bioinformatics? Core • (Molecular)Bio - informatics • One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. • Bioinformatics is a practical discipline with many applications. Read Mark Gerstein's paper it will help you formulate a project proposal.

  17. What is Bioinformatics? (Molecular)Bio - informatics One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Bioinformatics is a practical discipline with many applications. Gerstein

  18. Central Dogmaof Molecular BiologyDNA -> RNA -> Protein -> Phenotype -> DNA Molecules Sequence, Structure, Function, Interaction Processes Mechanism, Specificity, Regulation Central Paradigmfor BioinformaticsGenomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Protein Interaction -> Phenotype Large Amounts of Information Standardized and Statistical Computer Processing What is the Information?Molecular Biology as an Information Science Core Information Flow Data Structures Mimic Biology (idea from D Brutlag)

  19. NCBI Data Bases

  20. Molecular Biology Information - DNA FASTA format >16 dna:chromosome:NM_000558:HBA1 actcttctgg tccccacaga ctcagagaga acccaccatg gtgctgtctc ctgccgacaa gaccaacgtc aaggccgcct ggggtaaggt cggcgcgcac gctggcgagt atggtgcgga ggccctggag aggatgttcc tgtccttccc caccaccaag acctacttcc cgcacttcga cctgagccac ggctctgccc aggttaaggg ccacggcaag aaggtggccg acgcgctgac caacgccgtg gcgcacgtgg acgacatgcc caacgcgctg tccgccctga gcgacctgca cgcgcacaag cttcgggtgg acccggtcaa cttcaagctc ctaagccact gcctgctggt gaccctggcc gcccacctcc ccgccgagtt cacccctgcg gtgcacgcct ccctggacaa gttcctggct tctgtgagca ccgtgctgac ctccaaatac cgttaagctg gagcctcggt ggccatgctt cttgcccctt gggcctcccc ccagcccctc ctccccttcc tgcacccgta cccccgtggt ctttgaataa agtctgagtg ggcggc • Raw DNA Sequence • Coding or Not? • Open Reading Frame • Promotor sites? • Exons • Introns

  21. Molecular Biology Information - DNA Raw DNA Sequence Coding or Not? HBA1 gene Open Reading Frame atg - taa Promotor sites? Exons Three Introns none (mRNA) >16 dna:chromosome:NM_000558:HBA1 actcttctgg tccccacaga ctcagagaga acccaccatggtgctgtctc ctgccgacaa gaccaacgtc aaggccgcct ggggtaaggt cggcgcgcac gctggcgagt atggtgcgga ggccctggagaggatgttcctgtccttccc caccaccaag acctacttcc cgcacttcga cctgagccac ggctctgccc aggttaaggg ccacggcaag aaggtggccg acgcgctgac caacgccgtg gcgcacgtgg acgacatgcc caacgcgctg tccgccctga gcgacctgca cgcgcacaag cttcgggtgg acccggtcaacttcaagctcctaagccact gcctgctggt gaccctggcc gcccacctcc ccgccgagtt cacccctgcg gtgcacgcct ccctggacaa gttcctggct tctgtgagca ccgtgctgac ctccaaataccgttaagctggagcctcggt ggccatgctt cttgcccctt gggcctcccc ccagcccctc ctccccttcc tgcacccgta cccccgtggt ctttgaataa agtctgagtg ggcggc • Raw DNA Sequence • Parse into genes? • 4 bases: agct • ~1 Kb in a gene • ~2 Mb in genome • ~3 Gb Human

  22. Molecular Biology Information - mRNA • Expressed Sequence Tags • Cloned mRNA • cDNA library • Mapped to genome • Expressed genes • Tissue specific • One-shot sequencing • 4 bases: AGCU • 300-800 bases • ~52 M ESTs Poly-T primer + reverse transcriptase 3' EST cDNA

  23. The Genetic Code ATG start condon ;TAG, TAA, TGA stop codons

  24. Molecular Biology Information: Protein Sequence • 20 letter alphabet • ACDEFGHIKLMNPQRSTUVWY but not BJOXZ • Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain • ~200 K known protein sequences Fasta format Insulin > Insulin:Homo Sapiens: AAA59172 1 malwmrllpl lallalwgpd paaafvnqhl cgshlvealy lvcgergffy tpktrreaed 61 lqvgqvelgg gpgagslqpl alegslqkrg iveqcctsic slyqlenycn Dihydrofolate Reductase d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHP d8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKP d4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKA d3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV

  25. Molecular Biology Information:Macromolecular Structure • DNA/RNA/Protein • Almost all protein (RNA Adapted From D Soll Web Page, Right Hand Top Protein from M Levitt web page)

  26. Atom # Residue Residue # Molecular Biology Information: Protein Structure Details • Statistics on Number of XYZ triplets • 200 residues/domain -> 200 CA atoms, separated by 3.8 A • Avg. Residue size is Leu: 4 backbone atoms + 4 sidechain atoms, 150 cubic A or a bead of diameter 6.6 A • => ~1500 xyz triplets (=8x200) per protein domain X Y Z PDB format ATOM 1 C ACE 0 9.401 30.166 60.595 1.00 49.88 1GKY 67 ATOM 2 O ACE 0 10.432 30.832 60.722 1.00 50.35 1GKY 68 ATOM 3 CH3 ACE 0 8.876 29.767 59.226 1.00 50.04 1GKY 69 ATOM 4 N SER 1 8.753 29.755 61.685 1.00 49.13 1GKY 70 ATOM 5 CA SER 1 9.242 30.200 62.974 1.00 46.62 1GKY 71 ATOM 6 C SER 1 10.453 29.500 63.579 1.00 41.99 1GKY 72 ATOM 7 O SER 1 10.593 29.607 64.814 1.00 43.24 1GKY 73 ATOM 8 CB SER 1 8.052 30.189 63.974 1.00 53.00 1GKY 74 ATOM 9 OG SER 1 7.294 31.409 63.930 1.00 57.79 1GKY 75 ATOM 10 N ARG 2 11.360 28.819 62.827 1.00 36.48 1GKY 76 ATOM 11 CA ARG 2 12.548 28.316 63.532 1.00 30.20 1GKY 77 ATOM 12 C ARG 2 13.502 29.501 63.500 1.00 25.54 1GKY 78 ... ATOM 1444 CB LYS 186 13.836 22.263 57.567 1.00 55.06 1GKY1510 ATOM 1445 CG LYS 186 12.422 22.452 58.180 1.00 53.45 1GKY1511 ATOM 1446 CD LYS 186 11.531 21.198 58.185 1.00 49.88 1GKY1512 ATOM 1447 CE LYS 186 11.452 20.402 56.860 1.00 48.15 1GKY1513 ATOM 1448 NZ LYS 186 10.735 21.104 55.811 1.00 48.41 1GKY1514 ATOM 1449 OXT LYS 186 16.887 23.841 56.647 1.00 62.94 1GKY1515 TER 1450 LYS 186 1GKY1516

  27. Molecular Biology Information:Whole Genomes • The Revolution Driving Everything Fleischmann, R. D., Adams, M. D., White, O., Clayton, R. A., Kirkness, E. F., Kerlavage, A. R., Bult, C. J., Tomb, J. F., Dougherty, B. A., Merrick, J. M., McKenney, K., Sutton, G., Fitzhugh, W., Fields, C., Gocayne, J. D., Scott, J., Shirley, R., Liu, L. I., Glodek, A., Kelley, J. M., Weidman, J. F., Phillips, C. A., Spriggs, T., Hedblom, E., Cotton, M. D., Utterback, T. R., Hanna, M. C., Nguyen, D. T., Saudek, D. M., Brandon, R. C., Fine, L. D., Fritchman, J. L., Fuhrmann, J. L., Geoghagen, N. S. M., Gnehm, C. L., McDonald, L. A., Small, K. V., Fraser, C. M., Smith, H. O. & Venter, J. C. (1995). "Whole-genome random sequencing and assembly of Haemophilus influenzae rd." Science 269: 496-512. Genome sequence now accumulate so quickly that, in less than a week, a single laboratory can produce more bits of data than Shakespeare managed in a lifetime, although the latter make better reading. -- G A Pekso, Nature401: 115-116 (1999) • Integrative Data 1995, HI (bacteria): 1.6 Mb & 1600 genes done 1997, yeast: 13 Mb & ~6000 genes for yeast 1998, worm: ~100Mb with 19 K genes 1999: >30 completed genomes! 2003, human: 3 Gb & $3 billion 14 years 2007, Watson: 3 Gb & $1 million 2 months

  28. Molecular Biology Information - Microarray Scan Hybridize cDNAs Biotin Label Affymetrix GeneChip® 55,000 transcripts Gives levels of gene expression but the signal is noisy

  29. Molecular Biology Information - Epigenetics • Raw DNA Sequence • Coding or Not? • Promotor sites? • Exons • Introns • Parse into genes? • 4 bases: AGCT • ~1 K in a gene • ~2 M in genome • ~3 Gb Human

  30. Molecular Biology Information:Other Integrative Data • Information to understand genomes • Metabolic Pathways (glycolysis), traditional biochemistry • Regulatory Networks • Whole Organisms Phylogeny, traditional zoology • Environments, Habitats, ecology • The Literature (MEDLINE) • The Future.... (Pathway drawing from P Karp’s EcoCyc, Phylogeny from S J Gould, Dinosaur in a Haystack)

  31. What is Bioinformatics? (Molecular)Bio - informatics One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Bioinformatics is a practical discipline with many applications. Gerstein

  32. Large-scale Information:GenBank Growth Core

  33. Large-scale Information:GenBank Growth Core Bioinformatics Is Essential Now

  34. Large-scale Information:Explonential Growth of Data Matched by Development of Computer Technology • CPU vs Disk & Net • As important as the increase in computer speed has been, the ability to store large amounts of information on computers is even more crucial • Driving Force in Bioinformatics (Internet picture adaptedfrom D Brutlag, Stanford) Internet Hosts Core Num.Protein DomainStructures

  35. transistors Features per chip oligo features Large-scale: Features per Slide 65,000 genes x 106 probes =65,000,000,000 features

  36. Large-scale:Bioinformatics is born! (courtesy of Finn Drablos)

  37. What is Bioinformatics? (Molecular)Bio - informatics One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Bioinformatics is a practical discipline with many applications. Gerstein

  38. OrganizingMolecular Biology Information:Redundancy and Multiplicity • Different Sequences Have the Same Structure • Organism has many similar genes • Single Gene May Have Multiple Functions • Genes are grouped into Pathways • Genomic Sequence Redundancy due to the Genetic Code • How do we find the similarities?.... Core Integrative Genomics - genes  structures functionspathways expression levels  regulatory systems ….

  39. 'Omics: studying populations of molecules in a database framework

  40. 'Omics: studying populations of molecules in a database framework

  41. Proteome PubMed Hits Core 'Omics: studying populations of molecules in a database framework

  42. How Do We Organize Information in an OME? • Controlled Vocabulary • Ranked Tree Structure • Biota • Eukarya • Animalia • Chordata • Mammalia • Primates • Hominidae • Homo @ Homo sapiens • Google: Brin and Page’s (PageRank)

  43. A Parts List Approach to Bike Maintenance Extra

  44. A Parts List Approach to Bike Maintenance How many roles can these play? How flexible and adaptable are they mechanically? Core What are the shared parts (bolt, nut, washer, spring, bearing), unique parts (cogs, levers)? What are the common parts -- types of parts (nuts & washers)? Extra Where are the parts located?

  45. Organize: ENCODE Project The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of the consortium is to build a comprehensive parts list of the functional elements of the human genome, including elements that act at the protein level (coding genes) and RNA level (non-coding genes), and regulatory elements that control the cells and circumstances in which a gene is active. The results of ENCODE experiments, collected in the ENCODE DCC database, are displayed on the UCSC Genome Browser.

  46. Organize: Parts = Disease Genes http://www.ncbi.nlm.nih.gov/OMIM/

  47. Organizing Information: Cross-referenceshttp://www.ncbi.nlm.nih.gov/Database/ Core Homework this week OMIM

  48. End of First Lecture 2009 Remaining Slides Lecture 2

  49. What is Bioinformatics? (Molecular)Bio - informatics One idea for a definition?Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale. Bioinformatics is a practical discipline with many applications. Gerstein

  50. Databases Building, Querying Object DB Text String Comparison Controlled Vocabulary 1D Alignment (Blast) Significance Statistics Finding Patterns AI / Machine Learning Statistical Models (HMM) Clustering Datamining Geometry Robotics Graphics (Surfaces, Volumes) Comparison and 3D Matching (Vision, recognition) Physical Simulation Newtonian Mechanics Electrostatics Numerical Algorithms Simulation General Types of“Informatics” techniquesin Bioinformatics

More Related