1 / 82

Introduction

Introduction. based on Chapter 1 Lesk, Introduction to Bioinformatics. Contents. Molecular biology primer The role of computer science Phylogeny Sequence Searching Protein structure Clinical implications Read chapter 1. 23 June 2000: Draft of Human genome sequenced!.

alcina
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics

  2. Contents • Molecular biology primer • The role of computer science • Phylogeny • Sequence Searching • Protein structure • Clinical implications • Read chapter 1

  3. 23 June 2000: Draft of Human genome sequenced! • 1953: Watson and Crick discover the structure of DNA • 2000: Draft of human genome is published • “The most wondrous map ever produced by human kind” • “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”

  4. High-throughput biomedicine • Microarrays • Measure activity of thousands of genes at the same time • Example: • Cancer • Compare activity with and without drug treatment • Result: Hundreds of candidate drug targets • RNAi (Noble prize 2004, Fire and Mello) • Knock-down genes and observe effect • Example: • Infectious diseases • Which proteins orchestrate entry into cell? • Result: Hundreds of candidate proteins • Atomic force microscopes (Noble prize Binnig) • Pull protein out of membrane and measure force • Example: • Eye diseases resulting fomr misfolding • Result: Hundreds of candidate residues

  5. Drug Discovery • Challenge: Longer time to market, fewer drugs, exploding costs • Approach: Use of compound libraries and high-throughput screening

  6. HTS and Bioinformatics • High-throughput technologies have completely changed the work of biomedical researchers • Challenge: Interpret (often large) results of screens • Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information

  7. Good News >1.000.000 Sequences >16.000.000 Articles >700 DBs/Tools >30.000 3D Structures

  8. Bad News: Data != Knowledge • How to analyse data, how to integrate data? • Comptuer science to the rescue…

  9. Examlpe: computer science is key for sequencing • Human genome is a string of length 3.200.000.000 • Shotgun sequencing: Break multiple copies of string into shorter substrings • Example: • shotgunsequencing shotgunsequencing shotgunsequencing • cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un • Computing problem: Assemble strings

  10. Computer science key for sequencing • sh • sho • shot • otgu • tg • gun • un • ns • seq • sequ • equ • uenc • encing • en • cing • ing QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready?

  11. Arabidopsis thaliana Buchnerasp. APS Yersinia pestis Aquifex aeolicus Archaeoglobus fulgidus Borrelia burgorferi Mycobacterium tuberculosis Vibrio cholerae Caenorhabitis elegans Campylobacter jejuni Chlamydia pneumoniae Drosophila melanogaster Escherichia coli Neisseria meningitidis Z2491 Plasmodium falciparum Ureaplasma urealyticum Helicobacter pylori Mycobacterium leprae Pseudomonas aeruginosa mouse Bacillus subtilis Thermotoga maritima Xylella fastidiosa Rickettsia prowazekii Saccharomyces cerevisiae Salmonella enterica rat Thermoplasma acidophilum

  12. Break through of the year 2000 Next quest: Sequencing a genome for 1000$

  13. Quantity and quality of data lead to ambitious goals • Understand integrative aspects of the biology of organisms • Interrelate sequence, three-dimensional structure, interactions, function of proteins, nucleic acids and protein-nucleic acid complexes • Travel in time • backward (deduce events in evolutionary history) and • forward (deliberate modification of biological systems) • Applications in medicine, agriculture, and other scientific fields

  14. New virus (e.g. SARS) and goal to develop treatment Scientists isolate genetic material of virus Screen genome for relationships with previously studied viruses [10] From virus’ DNA they compute the proteins it produces [1] Compute proteins’ three-dimensional structure and thereby obtain clues about their functions Screen for similar proteins sequences with known structure [15] If any are found Then interpret difference (homology modelling) [25] Else predict structure from sequence [55] Identify or design small molecule blocking relevant active sites of the protein [50] Design antibodies to neutralize the virus [50] Index of problem difficulty: <30: solution exists already, >30: we cannot solve this (yet) Scenario

  15. Life in Time and Space • Life • A biological organism is a naturally-occurring, self-reproducing device that effects controlled manipulations of matter, energy and information • Time • Species evolve through • natural mutation, • recombination of genes in sexual reproduction, or • direct gene transfer • Read the past in contemporary genomes • Space • Species occupy local ecosystems • Species are composed of organisms • Organisms are composed of cells • Cells are composed of molecules

  16. DNA – the molecule of life http://www.ornl.gov/hgmis

  17. Proteins • 20 naturally occurring amino acids in proteins • Non-polar • G glycine, A alanine, P proline, V valine • I isoleucine, L leucine, F phenylalanine, M methionine • Polar • S serine, C cysteine, T threonine, N asparagine • Q glutamine, H histidine, Y tyrosine, W tryptophan • Charged • D aspartic acid, E glutamic acid, K lysine, R arginine • Other classification • H,F,Y,W are aromatic and play role in membrane proteins • Distinguish • atg = adenine-thymine-guanine and • ATG = Alanine-Threonine-Glycine

  18. The genetic code

  19. Protein Structure • DNA: • Nucleotides are very similar and hence the structure of DNA is very uniform • Proteins: • Great variety in three-dimensional conformation to support diverse structure and functions • If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds

  20. Paradox • Translation from DNA sequence to amino acid sequence • is very simple to describe, • but requires immensely complicated machinery (ribosome, tRNA) • The folding of the protein sequence into its three-dimensional structure • is very difficult to describe • But occurs spontaneously

  21. Central Dogma • DNA sequence determines protein sequence • Protein sequence determines protein structure • Protein structure determines protein function

  22. Observables and Data Archives • Databases in molecular biology cover • Nucleic acid and protein sequences, • Macromolecular structures and functions • Archival databanks of biological information • DNA and protein sequences including annotations • Nucleic acid and protein structures including annotations • Protein expression patterns • Derived Databases • Sequence motifs (“signatures” of protein families) • Mutations and variants in DNA and protein sequences • Classification or relationships (e.g. hierarchy of structures) • Bibliographic databases (PubMed with 17M abstracts) • Collections • of links to web sites • of databases

  23. What is Bioinformatics • Bioinformatics is the marriage of biology and information technology • Bioinformatics is an integrated multidisciplinary field • Covers computational tools and methods for managing, analysing and manipulating sets of biological data • Disciplines include: • biochemistry, genetics, structural biology, artificial intelligence, machine learning, software engineering, statistics, database theory, information visualisation, algorithm design

  24. Bioinformatics • Has three components • Creation of databases • Development of algorithms to analyse data • Use of these tools for analysing biological data

  25. Databases: Types of Queries 1/2 • 1. Given a sequence (fragment), find sequences in the database that are similar to it • 2. Given a protein structure (or fragment), find protein structures in the database that are similar to it • 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures • 4. Given a protein structure, find sequences in the database that correspond to similar structures.

  26. Databases: Given sequence, find structure • 3. Given sequence of a protein of unknown structure, find structures in the database that adopt similar three-dimensional structures. But How? • Easy: Find similar sequences with known structure! • But: There might be similar structures, whose sequence is not similar! • 4. Given a protein structure, find sequences in the database that correspond to similar structures.But How? • Easy: Find similar structures and hence sequences • But: There are so many more sequences with unknown structure that the above method will have only very limited success • 1 and 2 are solved, 3 and 4 are active fields of research

  27. Databases: Types of Queries 2/2 • E.g. for which proteins of known structure involved in disease of disrupted purine biosynthesis in humans, are there related proteins in yeast? • Solution: Virtual databases that provide transparent access to a number of underlying data sources and query and analysis tools

  28. Databases: Curation and Quality • Problems: • Given that there are primary and secondary databases, • how to control updates, • how to propagate change, • how to maintain consistency? • Contents (experimental results, annotations, supplementary information) all have there own source of error • Older data were limited by older techniques

  29. Databases: Annotation • Experimental data (e.g. raw DNA sequence) needs to be enriched with annotations • Source of data • Investigators responsible • Relevant publication • Feature tables (e.g. coding regions) • Problems: • (often) lack of controlled and coherent vocabulary • Computer parseable • Automated annotation needed • SwissProt = ca. 540.000 annotated sequences • TrEMBL = ca. 40 Mio unannotated sequences • Maintanence of annotations (what if error detected?)

  30. Relevant areas: Artificial Intelligence Machine Learning Neural networks, rule-based learning Datamining Association rules Software Engineering Design, implementation, testing of software Programming Object-oriented C++, Java Imperative: C, Modula, Pascal, Cobol, Fortran Logic: Prolog Funtional: ML Scripting: Perl, Python Statistics Database theory Design and maintenance of databases How to index sequences, time series, 3D strucutres Information Visualisation Graph drawing, diagrams, cartoons, 3D graphics Algorithm design Complexity of algorithms Efficient data structures Computers and Computer Science

  31. Programming • We will use Python • Scripting language • Supports string processing well • Widely used in bioinformatics

  32. Biological Classification and Nomenclature • Back in 18th century, Linnaeus, a Swedish naturalist, classified living things according to a hierarchy: Kingdom, Phylum, Class, Order, Family, Genus, Species • Generally only genus and species are used for identification • Homo sapiens • Drosophila melanogastor • Bos taurus • Linnaeus’ classification based on observed similarity • Widely reflects biological ancestry

  33. Classification of Humans and Fruit Flies • Kingdom: Animalia Animalia • Phylum: Chordata Chordata • Class: Mammalia Insecta • Order: Primata Diptera • Family: Hominidae Drosophilidae • Genus: Homo Drosophila • Species: sapiens melanogastor

  34. Homology = derived from common ancestor • Characteristics derived from a common ancestor are called homologous • E.g. eagle’s wing and human’s arm • Other apparently similar characteristics may have arisen independently by convergent evolution • E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings • Homologous characters may diverge functionally • E.g. bones in human middle and jaws of primitive fish

  35. Sequence analysis and Homology • Sequence analysis gives unambiguous evidence for relationship of species • For higher organisms sequence analysis and the classical tools of comparative anatomy, palaeontology, and embryology are often consistent • For microorganisms there are problems • Classical methods: how to describe features • Sequence analysis: lateral gene transfer

  36. Domains of Life • Ribosomal RNA is present in all organisms • Based on 15S ribosomal RNAs life is divided • Bacteria • No nucleus (procaryote) • E.g. tuberculosis and E. coli • Archaea • No nucleus (procaryote) • few organisms living in hostile environments (termophiles, halophiles, sulphur reducers, methanogens) • Eukarya • Has a nucleus contained in membrane • Nucleus contains chromosomes • Internal compartments called organelles for specialised biological processes • Area outside nucleus and organelles called cytoplasm • E.g. yeast and human beings

  37. Eukaryotic cell

  38. Domains of Life

  39. Example: Use of sequences to determine phylogenetic relationships • Use ExPASy (www.expasy.ch) to search for pancreatic ribonuclease for • horse (Equuscaballus), • minke whale (Balaenopteraacutorostrata), • red kangaroo (Macropusrufus) • >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equuscaballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQKNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHFDASVEVST • Use sequence alignment to determine evolutionary relationship

  40. Sequence alignment • Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe|||| |||||||||||||||||||||||| ||||||And.then,.from.hour.to.hour.we.rot-.and.rot- • Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| ||Your.care.is.gain.of.care,.by.new.care.won

  41. Sequence alignment 3. Motif search: • find matches of short sequence in long sequence • Option: • perfect, • 1 mismatch, • mismatches+gaps+insertions+deletions match ||||for the watch to babble and to talk is most tolerable

  42. Sequence alignment 4. Multiple sequence alignment No.sooner.---met.--------.but.they.look’d No.sooner.look’d.--------.but.they.lo-v’d No.sooner.lo-v’d.--------.but.they.sigh’d No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they.-------------sought.the.remedy No.sooner. .but.they.

  43. Example: Multiple alignment • Use sequence alignment to determine evolutionary relationship… • Example: horse, whale and kangaroo • Expected: horse and whale are placental mammals, kangaroo is marsupial • Multiple alignment with CLUSTAL-W (http://www.genome.jp/tools/clustalw) • multiple sequence alignment computer program • main parameters: gap opening/extension penalty

  44. >sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV FASTA format

  45. Multiple Alignment with ClustalW (http://www.genome.jp/tools/clustalw) CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV---- 124 DAYV---- 122 * *

  46. Example: Number of Aligned Residues • Horse and Minke whale: 95 • Minke whale and Red kangoroo: 82 • Horse and Red kangoroo: 75 • Conclusion: Horse and whale share the most identical residues

  47. New Example: Elephant and Mammoth • Mitochondrial cytochrome b from • Siberian woolly mammoth (Mammuthus primigenius) preserved in arctic permafrost • African elephant (Loxodontaafricana) • Indian elephant (Elephansmaximus) Q: To which one is the Mammuth more closely related?

  48. Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:*******

  49. Example: Elephant and Mammoth • Mammoth and African elephant have 10 mismatches, • Mammoth and Indian elephant 14. • Significant? Q1: can we tell from these sequences alone that they are closely related? Q2: differences are small – do they come from selection, random noise or drift • Strategies needed difference judging of similiarities

  50. Excursion: Similarity and Homology • Important difference: • Similarity is the measurement of resemblance of sequences • Homology: common ancestor • Similarity is gradual, homology is either true or false • Similarity = now, homology = past events • Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) • Homology is inferred from sequence similarity

More Related