420 likes | 524 Views
Introduction. based on Chapter 1 Lesk, Introduction to Bioinformatics. Contents. Molecular biology primer The role of computer science Phylogeny Sequence Searching. 23 June 2000: Draft of Human genome sequenced!. 1953: Watson and Crick discover the structure of DNA
E N D
Introduction based on Chapter 1 Lesk, Introduction to Bioinformatics
Contents • Molecular biology primer • The role of computer science • Phylogeny • Sequence Searching
23 June 2000: Draft of Human genome sequenced! • 1953: Watson and Crick discover the structure of DNA • 2000: Draft of human genome is published • “The most wondrous map ever produced by human kind” • “One of the most significant scientific landmarks of all time, comparable with the invention of the wheel or the splitting of the atom”
High-throughput biomedicine • Microarrays • Measure activity of thousands of genes at the same time • Example: • Cancer • Compare activity with and without drug treatment • Result: Hundreds of candidate drug targets • RNAi (Noble prize 2004, Fire and Mello) • Knock-down genes and observe effect • Example: • Infectious diseases • Which proteins orchestrate entry into cell? • Result: Hundreds of candidate proteins • Atomic force microscopes (Noble prize Binnig) • Pull protein out of membrane and measure force • Example: • Eye diseases resulting fomr misfolding • Result: Hundreds of candidate residues
Drug Discovery • Challenge: Longer time to market, fewer drugs, exploding costs • Approach: Use of compound libraries and high-throughput screening
HTS and Bioinformatics • High-throughput technologies have completely changed the work of biomedical researchers • Challenge: Interpret (often large) results of screens • Approach: Before running secondary assays use bioinformatics and IT to assemble all possible information
Good News Millions of Sequences Millions of Articles Hundreds of DBs/Tools 10 thousands of 3D Structures
Bad News: Data != Knowledge • How to analyse data, how to integrate data? • Comptuer science to the rescue…
Examlpe: computer science is key for sequencing • Human genome is a string of length 3.200.000.000 • Shotgun sequencing: Break multiple copies of string into shorter substrings • Example: • shotgunsequencing shotgunsequencing shotgunsequencing • cing en encing equ gun ing ns otgu seq sequ sh sho shot tg uenc un • Computing problem: Assemble strings
Computer science key for sequencing • sh • sho • shot • otgu • tg • gun • un • ns • seq • sequ • equ • uenc • encing • en • cing • ing QUESTION: How can you handle long repetitive sequences? Heeeeelllllllllllooooooo QUESTION: Why was a draft announced? When was the final version ready?
Arabidopsis thaliana Buchnerasp. APS Yersinia pestis Aquifex aeolicus Archaeoglobus fulgidus Borrelia burgorferi Mycobacterium tuberculosis Vibrio cholerae Caenorhabitis elegans Campylobacter jejuni Chlamydia pneumoniae Drosophila melanogaster Escherichia coli Neisseria meningitidis Z2491 Plasmodium falciparum Ureaplasma urealyticum Helicobacter pylori Mycobacterium leprae Pseudomonas aeruginosa mouse Bacillus subtilis Thermotoga maritima Xylella fastidiosa Rickettsia prowazekii Saccharomyces cerevisiae Salmonella enterica rat Thermoplasma acidophilum
DNA – the molecule of life http://www.ornl.gov/hgmis
Protein Structure • DNA: • Nucleotides are very similar and hence the structure of DNA is very uniform • Proteins: • Great variety in three-dimensional conformation to support diverse structure and functions • If heated, protein “unfolds” to biologically-inactive structure; in normal conditions protein folds
Paradox • Translation from DNA sequence to amino acid sequence • is very simple to describe, • but requires immensely complicated machinery (ribosome, tRNA) • The folding of the protein sequence into its three-dimensional structure • is very difficult to describe • But occurs spontaneously
Central Dogma • DNA sequence determines protein sequence • Protein sequence determines protein structure • Protein structure determines protein function
Sequence vs. structure similarity Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Sequence vs. structure similarity High sequence similarity = high structure similary Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Sequence vs. structure similarity Low sequence similarity usually low structure similarity Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Sequence vs. structure similarity Low sequence similarity possibly still high structure similary Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt
Sequence similarity is key concept Similar sequences are a hint for common ancestry and possibly similar function
Sequence similarity is key concept Similar sequences are a hint for common ancestry and possibly similar function
Sequence similarity is key conceptExample: v-sys vs. PDGF Example from early 80s: V-sys in simian sarcoma virus leads to cancer in infected cells PDGF in humans is a normal growth factor for cells V-sys and PDGF are 85% similar Alignment from: http://pdf.aminer.org/000/244/500/design_and_implementation_of_a_dna_sequence_processor.pdf
Sequence similarity is key concept If an unknown sequence is found, deduce its function/structure indirectly by finding similar sequences, whose function/structure is known Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure
Sequence similarity is key concept Similar sequences are a hint for common ancestry and possibly similar function
>sp|P00674|RNP_HORSE Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Equus caballus (Horse). KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF DASVEVST >sp|P00673|RNP_BALAC Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Balaenoptera acutorostrata (Minke whale) (Lesser rorqual). RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF DNSV >sp|P00686|RNP_MACRU Ribonuclease pancreatic (EC 3.1.27.5) (RNase 1) (RNase A) - Macropus rufus (Red kangaroo) (Megaleia rufa). ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQE NVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEGQYVPVHFDA YV How similar are sequences?
Multiple Alignment with ClustalW (www.ebi.ac.uk/clustalw) CLUSTAL W (1.82) multiple sequence alignmen sp|P00674|RNP_HORSE sp|P00673|RNP_BALAC sp|P00686|RNP_MACRU KESPAMKFERQHMDSGSTSSSNPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60 RESPAMKFQRQHMDSGNSPGNNPNYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60 -ETPAEKFQRQHMDTEHSTASSSNYCNLMMKARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59 *:** **:*****: :......*** ** *.**.* ***:***:**. *.*:* * KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120 KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120 ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118 :*: ****::***:*.* : **:** *..****** *:**: :::******* ****** DASVEVST 128 DNSV---- 124 DAYV---- 122 * *
Example: Number of Aligned Residues • Horse and Minke whale: 95 • Minke whale and Red kangoroo: 82 • Horse and Red kangoroo: 75 • Conclusion: Horse and whale share the most identical residues • Horse and whale are placental, kangaroo is marsupial
Example: Elephant and Mammoth • Mitochondrial cytochrome b from • Siberian woolly mammoth(Mammuthus primigenius) preserved in arctic perma frost • African elephant (Loxodonta africana) • Indian elephant (Elephans maximus)
Indian elephant: sp|P24958|CYB_LOXAF Mammoth: sp|P92658|CYB_MAMPR African elephant: sp|O47885|CYB_ELEMA MTHIRKSHPLLKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHIRKSHPLLKILNKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 MTHTRKFHPLFKIINKSFIDLPTPSNISTWWNFGSLLGACLITQILTGLFLAMHYTPDTM 60 *** ** ***:**:********************************************** TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 TAFSSMSHICRDVNYGWIIRQLHSNGASIFFLCLYTHIGRNIYYGSYLYSETWNTGIMLL 120 ************************************************************ LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTDLVEWIWGGFSVDKATLNRFFA 180 LITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFA 180 **************************************:********************* LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 LHFILPFTMIALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILFLL 240 FHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLL 240 :********:***********************************************:** LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALLLSILI 300 LLALLSPDMLGDPDNYMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSILI 300 ******************************************************:***** LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEYPYIIIGQMASILYFS 360 LGIMPLLHTSKHRSMMLRPLSQVLFWTLATDLLMLTWIGSQPVEYPYIIIGQMASILYFS 360 LGLMPLLHTSKHRSMMLRPLSQVLFWTLTMDLLTLTWIGSQPVEHPYIIIGQMASILYFS 360 **:*************************: *** **********:*************** IILAFLPIAGVIENYLIK 378 IILAFLPIAGMIENYLIK 378 IILAFLPIAGMIENYLIK 378 **********:*******
Example: Elephant and Mammoth • Mammoth and African elephant have 10 mismatches, • mammoth and Indian elephant 14. • Significant?
Similarity and Homology • Important difference: • Similarity is the measurement of resemblance of sequences • Homology: common ancestor • Similarity is gradual, homology is either true or false • Similarity = now, homology = past events • Homology is only very rarely directly observed (e.g. lab population, clinical study of viral infection) • Homology is inferred from sequence similarity
Homology = derived from common ancestor • Characteristics derived from a common ancestor are called homologous • E.g. eagle’s wing and human’s arm • Other apparently similar characteristics may have arisen independently by convergent evolution • E.g. eagle’s wing and bee’s wing. The most common ancestor of eagles and bees did not have wings • Homologous characters may diverge functionally • E.g. bones in human middle and jaws of primitive fish
Example: Homology/Similarity • The assertion that the cytocrome b sequences are homologuesmeans that there is a common ancestor • BUT: • 1. Maybe cytochrome b functionally requires so many conserved residues and will hence occur in many species ( In fact, This is not the case here) • 2. Maybe cytochrome b has to function this way in elephant-like species, but in fact started out from different ancestors (i.e. convergent evolution) • 3. Maybe mammoth and African elephant have only fewer mismatches, because Indian elephant’s DNA mutated faster • 4. Maybe all of them acquired cytochrome b through a virus (horizontal gene transfer)
Similarity vs. Homology Any sequence can be similar Sequences homologues if evolved from common ancestor Homologous sequences: Orthologs: similar biological function Paralogs: different biological function (after gene duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein Assumption: Similarity indicator for homology Note, altered function of the expressed protein will determine if the organism will survive to reproduce, and hence pass on the altered gene
Sequence similarity is key concept How similar are two sequences? How to align the sequences? How to align multiple sequences? How to find motifs?
Sequence alignment • Global match: align all of one with all of the other sequence (mismatches, insertions, deletions) And.--so,.from.hour.to.hour.we.ripe.and.ripe|||| |||||||||||||||||||||||| ||||||And.then,.from.hour.to.hour.we.rot-.and.rot- • Local match: find region in one sequence that matches the other (mismatches, insertions, deletions ; ends can be ignored) My.care.is.loss.of.care,.by.old.care.done, ||||||||| ||||||||||||| |||||| ||Your.care.is.gain.of.care,.by.new.care.won
Sequence alignment • Motif search: • find matches of short sequence in long sequence • Option: • perfect, • 1 mismatch, • mismatches+gaps+insertions+deletions • match ||||for the watch to babble and to talk is most tolerable
Sequence alignment Multiple sequence alignment No.sooner.---met.--------.but.they.look’d No.sooner.look’d.--------.but.they.lo-v’d No.sooner.lo-v’d.--------.but.they.sigh’d No.sooner.sigh’d.--------.but.they.--asked.one.another.the.reason No.sooner.knew.the.reason.but.they.-------------sought.the.remedy No.sooner. .but.they.
Quick check • By now you should • Know the main data sources (sequence and structure) • Know the role that bioinformatics plays • Understand the difference between homology and similarity • Understand what sequence comparison and alignment are