280 likes | 468 Views
La place de la phylogénie en Bio-informatique. Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy , France. http://www.igs.cnrs-mrs.fr. Journées RNG « Phylogénie ». Bio-Informatique des Séquences: Une progression naturelle.
E N D
La place de la phylogénie en Bio-informatique Jean-Michel CLAVERIE Structural & Genetic Information Laboratory CNRS UPR2589 Luminy, France http://www.igs.cnrs-mrs.fr Journées RNG « Phylogénie »
Bio-Informatique des Séquences:Une progression naturelle • Une séquence: trouver où sont les motifs et les gènes • Deux séquences: les aligner • Trois et plus: faire une arbre
Gene finding: the general principles behind all current methods • Gene-encoded peptide sequences have to lead to foldable, compact structure in a water environnement, • Real protein are thus made of a balanced mixture of hydrophilic, hydrophobic, rigid, flexible, neutral, or charged residues • This constraint induces a recognizable statistical bias in coding DNA sequences, that we use for detection. • Around the good looking regions, we look for facultative additional signals, • The optimal prediction is then the best scoring one according to a « standard » gene model.
Non-coding vs. Protein-coding DNA sequence: Why is there a bias? 64 codons, including 3 STOPs : TGA, TAA, TAG Simple situation like E.coli: (A=T=G=C=0.25) Between two successive STOP codons: each of the 61 codons -> 1.6 % expected relative frequency In contrast, codon distributions within bona fide genes have to results into ‘foldable’ proteins, hence balancing hydrophylic and hydrophobic, + and - charged, rigid and flexible residues. Coding segments have to be compatible with the typical amino acid composition of natural proteins
Amino Acid % in N codon % expected proteins at random Ala 9.46 4 6.4 + Arg 5.84 6 9.6 - Asn 3.99 2 3.2 + Asp 5.10 2 3.2 + Cys 1.20 2 3.2 - Gln 4.57 2 3.2 + Glu 5.50 2 3.2 + Gly 7.41 4 6.4 + His 2.44 2 3.2 - Ile 5.95 3 4.8 + Leu 10.5 6 9.6 + Lys 4.14 2 3.2 + Met 2.72 1 1.6 + Phe 4.19 2 3.2 + Pro 4.39 4 6.4 - Ser 5.71 6 9.6 - Thr 5.50 4 6.4 - Trp 1.48 1 1.6 Tyr 2.76 2 3.2 Val 7.16 4 6.4 AA with + or – 20% deviation from random
Coding Region A variety of methods take advantage of this bias Chi2 (obs-rand) Sliding Window Sequence
The Caesar (Julius) Cipher Plaintext : veni, vidi, vici Ciphertext: YHQL, YLGL, YLFL Plain alphabet: abcdefghijklmnopqrstuvwxyz Cipher alphabet: DEFGHIJKLMNOPQRSTUVWXYZABC The Caesar cipher is based on a cipher alphabet that is shifted a certain number of places, relative to the plain alphabet
Frequency of letters in English From 100,362 characters (Beker & Piper, 1950)
e = 12.7 XZAVOIDBYGERSPCFHJKLMNQTUW abcdefghijklmnopqrstuvwxyz Frequency analysis of an enciphered message
‘Al-kin-dee-u’ Al-Kindī The founding Father of « Gene-finding » bioinformatics Al-Kindī (0850) Decyphering Cryptographic Messages PMID: 0000001
Homophonic substitution cipher Frequent letters, many codes
« Homophonic » Substitution in the Genetic Code Amino Acid % in N codon % expected proteins at random Leu 10.5 6 9.6 + Arg 5.84 6 9.6 - Ser 5.71 6 9.6 - Ala 9.46 4 6.4 + Gly 7.41 4 6.4 + Val 7.16 4 6.4 Thr 5.50 4 6.4 - Pro 4.39 4 6.4 - Ile 5.95 3 4.8 + Glu 5.50 2 3.2 + Asp 5.10 2 3.2 + Gln 4.57 2 3.2 + Phe 4.19 2 3.2 + Lys 4.14 2 3.2 + Asn 3.99 2 3.2 + Tyr 2.76 2 3.2 His 2.44 2 3.2 - Cys 1.20 2 3.2 - Met 2.72 1 1.6 + Trp 1.48 1 1.6
From old to new methods • Methods: • Codon usage (Staden & McLachlan, 1982) : profile • Differential k-tuple frequency (1986) : profile
K-tuple concepts and methods ATGCTAGCATAGCTGCATGACATGCATGCA ATGC TGCA TGCT ATGC GCTA CATG CTAG GCAT TAGC TGCA AGCA ATGC 4-tuple (tetramer) Pre-compute: Fcoding (ATGC) , Fncoding (ATGC) ……………. , ……………… ……………. , ……………… Fcoding (°°°°) , Fncoding (°°°°)
Coding Region Fcoding(ATGC) ------------------------------------------------ Fcoding(ATGC) + Fncoding(ATGC) K-tuple profile 0.5 Fc ------- Fc+Fnc ATGC Sequence
Homophonic deciphering: looking for 2-tuples that often, never or rarely occurs: je, jx, wa k-tuple frequency analysis was probably known of the Arabs (1000) and rediscovered during the Renaissance
Blaise de Vigenère (1586) invented his undecipherable cipher to resist k-tuple analysis
Plain abcdefghijklmnopqrstuvwxyz 1BCDEFGHIJKLMNOPQRSTUVWXYZA 2 CDEFGHIJKLMNOPQRSTUVWXYZAB 3 DEFGHIJKLMNOPQRSTUVWXYZABC 4 EFGHIJKLMNOPQRSTUVWXYZABCD 5 FGHIJKLMNOPQRSTUVWXYZABCDE 6 GHIJKLMNOPQRSTUVWXYZABCDEF 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 9 JKLMNOPQRSTUVWXYZABCDEFGHI 10 KLMNOPQRSTUVWXYZABCDEFGHIJ 11-20 …………………………………………………………………… 21 VWXYZABCDEFGHIJKLMNOPQRSTU 22 WXYZABCDEFGHIJKLMNOPQRSTUV 23 XYZABCDEFGHIJKLMNOPQRSTUVW 24 YZABCDEFGHIJKLMNOPQRSTUVWX 25 ZABCDEFGHIJKLMNOPQRSTUVWXY 26 ABCDEFGHIJKLMNOPQRSTUVWXYZ The Vigenère square
The Vigenère square: example Keyword : WHITE WHITEWHITEWHITEWHI Plaintext : diverttroopstoeastridge Ciphertext: ZPDXVPAZHSLZBHIWZBKMZNM WHITE Plain abcdefghijklmnopqrstuvwxyz 4 EFGHIJKLMNOPQRSTUVWXYZABCD 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 19 TUVWXYZABCDEFGHIJKLMNOPQRS 22 WXYZABCDEFGHIJKLMNOPQRSTUV
Charles Babbage: Another pioneer In bioinformatics Cryptanalysis of The Vigenère cypher (1854) PMID: 0000002
The invention of inhomogeneous (and hidden) Markov Models • Look for repeats in the cipher text • Infer the keyword lenght from consistent repeat distance: L • Then analyze the character frequency at each position independently: 1, 2, 3 , …., L • Deduce (« call ») the corresponding Ceasar shift
From old to new methods • Methods: • Codon usage (Staden & McLachlan, 1982) : profile • Differential k-tuple frequency (1986) : profile • Differential in-phase k-tuples (1988) : profile • Non-homogeous Markov chains (Borodvsky,1986; Tavare & Song, 1989) : profile/call • Neural-net (Mural & Uberbacher, 1991) : call • Hidden Markov chain (Kulp & al., 1996) : call
Bioinformatics vs.Biology • 1951 1st protein sequence (Insulin, Sanger) • 1960 Sequence-structure relationship (Globins, Perutz) • 1965 "Evolutionary divergence & convergence in Proteins" Zuckerkandl & Pauling • 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 Atlas of Protein Sequences (M. Dayhoff, Georgetown) • 1970 "A general method applicable to the search for similaries in amino-acid sequences of two proteins" Needleman & Wunsch • 1973 Genetic engineering (Cohen, Boyer et al.) • 1974 "Prediction of Protein Conformation" Chou & Fasman • 1977 ADN sequencing (Sanger, Maxam, Gilbert) • 1977 1st bioinformatic "package" (Staden; DB/ assembly, analysis) • 1978 Databases: ACNUC, PIR, EMBL, GenBank • 1980 Database access via telephone lines ( PIR ) • 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt
Bioinformatics vs. Computers • 1965 First « industrial » computer IBM/360 • "Evolutionary divergence and convergence in Proteins" Zuckerkandl & Pauling • 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 First « mini » computer DEC PDP-8 (floor top) • Atlas of Protein Sequences (M. Dayhoff, Georgetown) • 1970 "A general method applicable to the search for similaries " Needleman & Wunsch. • 1971 1st work on RNA folding (Ninio) • 1972 First « micro » processor Intel 8008 • 1973 Genetic engineering (Cohen et al.) • 1974 "Prediction of Protein Conformation" Chou & Fasman • 1975 Intel 8080, kit Altair • 1977 1st bioinformatic package (Staden; PDP 8/11) • DEC-VAXMini-computers • Micro-computer (Apple, Commodore, Radioshack) • 1978 Databases: ACNUC, PIR, EMBL, GenBank • 1980 Telephone access to the PIR database • 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt • IBM-PC (8088), 16-32kb • 1983 IBM-XT HARD disk (10 Mbytes) • 1984 MacIntosh : graphic/mouse interface
Perspective historique (suite) Bioinformatics & Genomics: more recent past • 1981 Local Alignement (Smith-Waterman , JMB) • 1985 "Fasta" (Pearson-Lipman, PNAS) • 1989 ARPANET --> INTERNET • 1990 "Blast" (Altschul et al., JMB) • 1990 Positional cloning of NF-1 • 1991 "Grail", first practical gene «finder » (Mural et al., PNAS) • 1991 "EST" (Venter et al., Matsubara et al.) • 1992 Complete sequence of yeast chr. 3 • 1995 Complete sequence of H. influenza • 1996 Complete sequence of S. cerevisiae • 1997 "Gapped Blast" (Alschul et al., NAR)/Genscan (Burge & Karlin) • 1997 11 complete bacterial genomes available • 1998 2 Mbase/day of new public sequence data, C. elegans • 2000 Human Chr 22, 21, 90%draft, Drosophila, 30 bacterial genomes
BioInformatics Genes & Genomes Gene 1 Gene 2 Gene 3 Gene 4 DNA transcription RNA (transcripts) translation + folding Proteins (or RNAs) Function 1 Function 2 Function 3 Function 4