1 / 28

La place de la phylogénie en Bio-informatique

La place de la phylogénie en Bio-informatique. Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy , France. http://www.igs.cnrs-mrs.fr. Journées RNG « Phylogénie ». Bio-Informatique des Séquences: Une progression naturelle.

garin
Download Presentation

La place de la phylogénie en Bio-informatique

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. La place de la phylogénie en Bio-informatique Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy, France http://www.igs.cnrs-mrs.fr Journées RNG « Phylogénie »

  2. Bio-Informatique des Séquences:Une progression naturelle • Une séquence: trouver où sont les motifs et les gènes • Deux séquences: les aligner • Trois et plus: faire une arbre

  3. Gene finding: the general principles behind all current methods • Gene-encoded peptide sequences have to lead to foldable, compact structure in a water environnement, • Real protein are thus made of a balanced mixture of hydrophilic, hydrophobic, rigid, flexible, neutral, or charged residues • This constraint induces a recognizable statistical bias in coding DNA sequences, that we use for detection. • Around the good looking regions, we look for facultative additional signals, • The optimal prediction is then the best scoring one according to a « standard » gene model.

  4. Non-coding vs. Protein-coding DNA sequence: Why is there a bias? 64 codons, including 3 STOPs : TGA, TAA, TAG Simple situation like E.coli: (A=T=G=C=0.25) Between two successive STOP codons: each of the 61 codons -> 1.6 % expected relative frequency In contrast, codon distributions within bona fide genes have to results into ‘foldable’ proteins, hence balancing hydrophylic and hydrophobic, + and - charged, rigid and flexible residues.  Coding segments have to be compatible with the typical amino acid composition of natural proteins

  5.   Amino Acid % in N codon % expected proteins at random Ala 9.46 4 6.4 + Arg 5.84 6 9.6 - Asn 3.99 2 3.2 + Asp 5.10 2 3.2 + Cys 1.20 2 3.2 - Gln 4.57 2 3.2 + Glu 5.50 2 3.2 + Gly 7.41 4 6.4 + His 2.44 2 3.2 - Ile 5.95 3 4.8 + Leu 10.5 6 9.6 + Lys 4.14 2 3.2 + Met 2.72 1 1.6 + Phe 4.19 2 3.2 + Pro 4.39 4 6.4 - Ser 5.71 6 9.6 - Thr 5.50 4 6.4 - Trp 1.48 1 1.6 Tyr 2.76 2 3.2 Val 7.16 4 6.4 AA with + or – 20% deviation from random

  6. Coding Region A variety of methods take advantage of this bias Chi2 (obs-rand) Sliding Window Sequence

  7. The Caesar (Julius) Cipher Plaintext : veni, vidi, vici Ciphertext: YHQL, YLGL, YLFL Plain alphabet: abcdefghijklmnopqrstuvwxyz Cipher alphabet: DEFGHIJKLMNOPQRSTUVWXYZABC The Caesar cipher is based on a cipher alphabet that is shifted a certain number of places, relative to the plain alphabet

  8. Frequency of letters in English From 100,362 characters (Beker & Piper, 1950)

  9. e = 12.7 XZAVOIDBYGERSPCFHJKLMNQTUW abcdefghijklmnopqrstuvwxyz Frequency analysis of an enciphered message

  10. ‘Al-kin-dee-u’ Al-Kindī The founding Father of « Gene-finding » bioinformatics Al-Kindī (0850) Decyphering Cryptographic Messages PMID: 0000001

  11. Homophonic substitution cipher Frequent letters, many codes

  12. « Homophonic » Substitution in the Genetic Code   Amino Acid % in N codon % expected proteins at random Leu 10.5 6 9.6 + Arg 5.84 6 9.6 - Ser 5.71 6 9.6 - Ala 9.46 4 6.4 + Gly 7.41 4 6.4 + Val 7.16 4 6.4 Thr 5.50 4 6.4 - Pro 4.39 4 6.4 - Ile 5.95 3 4.8 + Glu 5.50 2 3.2 + Asp 5.10 2 3.2 + Gln 4.57 2 3.2 + Phe 4.19 2 3.2 + Lys 4.14 2 3.2 + Asn 3.99 2 3.2 + Tyr 2.76 2 3.2 His 2.44 2 3.2 - Cys 1.20 2 3.2 - Met 2.72 1 1.6 + Trp 1.48 1 1.6

  13. From old to new methods • Methods: • Codon usage (Staden & McLachlan, 1982) : profile • Differential k-tuple frequency (1986) : profile

  14. K-tuple concepts and methods ATGCTAGCATAGCTGCATGACATGCATGCA ATGC TGCA TGCT ATGC GCTA CATG CTAG GCAT TAGC TGCA AGCA ATGC 4-tuple (tetramer) Pre-compute: Fcoding (ATGC) , Fncoding (ATGC) ……………. , ……………… ……………. , ……………… Fcoding (°°°°) , Fncoding (°°°°)

  15. Coding Region Fcoding(ATGC) ------------------------------------------------ Fcoding(ATGC) + Fncoding(ATGC) K-tuple profile 0.5 Fc ------- Fc+Fnc ATGC Sequence

  16. Homophonic deciphering: looking for 2-tuples that often, never or rarely occurs: je, jx, wa k-tuple frequency analysis was probably known of the Arabs (1000) and rediscovered during the Renaissance

  17. Blaise de Vigenère (1586) invented his undecipherable cipher to resist k-tuple analysis

  18. Plain abcdefghijklmnopqrstuvwxyz 1BCDEFGHIJKLMNOPQRSTUVWXYZA 2 CDEFGHIJKLMNOPQRSTUVWXYZAB 3 DEFGHIJKLMNOPQRSTUVWXYZABC 4 EFGHIJKLMNOPQRSTUVWXYZABCD 5 FGHIJKLMNOPQRSTUVWXYZABCDE 6 GHIJKLMNOPQRSTUVWXYZABCDEF 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 9 JKLMNOPQRSTUVWXYZABCDEFGHI 10 KLMNOPQRSTUVWXYZABCDEFGHIJ 11-20 …………………………………………………………………… 21 VWXYZABCDEFGHIJKLMNOPQRSTU 22 WXYZABCDEFGHIJKLMNOPQRSTUV 23 XYZABCDEFGHIJKLMNOPQRSTUVW 24 YZABCDEFGHIJKLMNOPQRSTUVWX 25 ZABCDEFGHIJKLMNOPQRSTUVWXY 26 ABCDEFGHIJKLMNOPQRSTUVWXYZ The Vigenère square

  19. The Vigenère square: example Keyword : WHITE WHITEWHITEWHITEWHI Plaintext : diverttroopstoeastridge Ciphertext: ZPDXVPAZHSLZBHIWZBKMZNM WHITE Plain abcdefghijklmnopqrstuvwxyz 4 EFGHIJKLMNOPQRSTUVWXYZABCD 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 19 TUVWXYZABCDEFGHIJKLMNOPQRS 22 WXYZABCDEFGHIJKLMNOPQRSTUV

  20. Charles Babbage: Another pioneer In bioinformatics Cryptanalysis of The Vigenère cypher (1854) PMID: 0000002

  21. The invention of inhomogeneous (and hidden) Markov Models • Look for repeats in the cipher text • Infer the keyword lenght from consistent repeat distance: L • Then analyze the character frequency at each position independently: 1, 2, 3 , …., L • Deduce (« call ») the corresponding Ceasar shift

  22. From old to new methods • Methods: • Codon usage (Staden & McLachlan, 1982) : profile • Differential k-tuple frequency (1986) : profile • Differential in-phase k-tuples (1988) : profile • Non-homogeous Markov chains (Borodvsky,1986; Tavare & Song, 1989) : profile/call • Neural-net (Mural & Uberbacher, 1991) : call • Hidden Markov chain (Kulp & al., 1996) : call

  23. Bioinformatics vs.Biology • 1951 1st protein sequence (Insulin, Sanger) • 1960 Sequence-structure relationship (Globins, Perutz) • 1965 "Evolutionary divergence & convergence in Proteins" Zuckerkandl & Pauling • 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 Atlas of Protein Sequences (M. Dayhoff, Georgetown) • 1970 "A general method applicable to the search for similaries in amino-acid sequences of two proteins" Needleman & Wunsch • 1973 Genetic engineering (Cohen, Boyer et al.) • 1974 "Prediction of Protein Conformation" Chou & Fasman • 1977 ADN sequencing (Sanger, Maxam, Gilbert) • 1977 1st bioinformatic "package" (Staden; DB/ assembly, analysis) • 1978 Databases: ACNUC, PIR, EMBL, GenBank • 1980 Database access via telephone lines ( PIR ) • 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt

  24. Bioinformatics vs. Computers • 1965 First «  industrial » computer IBM/360 • "Evolutionary divergence and convergence in Proteins" Zuckerkandl & Pauling • 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 First « mini » computer DEC PDP-8 (floor top) • Atlas of Protein Sequences (M. Dayhoff, Georgetown) • 1970 "A general method applicable to the search for similaries " Needleman & Wunsch. • 1971 1st work on RNA folding (Ninio) • 1972 First « micro » processor Intel 8008 • 1973 Genetic engineering (Cohen et al.) • 1974 "Prediction of Protein Conformation" Chou & Fasman • 1975 Intel 8080, kit Altair • 1977 1st bioinformatic package (Staden; PDP 8/11) • DEC-VAXMini-computers • Micro-computer (Apple, Commodore, Radioshack) • 1978 Databases: ACNUC, PIR, EMBL, GenBank • 1980 Telephone access to the PIR database • 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt • IBM-PC (8088), 16-32kb • 1983 IBM-XT HARD disk (10 Mbytes) • 1984 MacIntosh : graphic/mouse interface

  25. Perspective historique (suite) Bioinformatics & Genomics: more recent past • 1981 Local Alignement (Smith-Waterman , JMB) • 1985 "Fasta" (Pearson-Lipman, PNAS) • 1989 ARPANET --> INTERNET • 1990 "Blast" (Altschul et al., JMB) • 1990 Positional cloning of NF-1 • 1991 "Grail", first practical gene «finder » (Mural et al., PNAS) • 1991 "EST" (Venter et al., Matsubara et al.) • 1992 Complete sequence of yeast chr. 3 • 1995 Complete sequence of H. influenza • 1996 Complete sequence of S. cerevisiae • 1997 "Gapped Blast" (Alschul et al., NAR)/Genscan (Burge & Karlin) • 1997 11 complete bacterial genomes available • 1998 2 Mbase/day of new public sequence data, C. elegans • 2000 Human Chr 22, 21, 90%draft, Drosophila, 30 bacterial genomes

  26. Bon workshop!

  27. BioInformatics Genes & Genomes Gene 1 Gene 2 Gene 3 Gene 4 DNA transcription RNA (transcripts) translation + folding Proteins (or RNAs) Function 1 Function 2 Function 3 Function 4

More Related