1 / 27

Textual Analysis of DNA Sequences and Molecular Evolution

Textual Analysis of DNA Sequences and Molecular Evolution. HC Lee Dept Physics & Dept Life Science National Central University 2000 March 1. INTRODUCTION THE HUMAN GENOME PROJECT DNA - TEXT WRITTEN W/ 4 LETTERS MOLECULAR EVOLUTION ALIGNMENT OLIGO FREQUENCY – a new method

Download Presentation

Textual Analysis of DNA Sequences and Molecular Evolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Textual Analysis of DNA Sequences and Molecular Evolution HC Lee Dept Physics & Dept Life Science National Central University 2000 March 1

  2. INTRODUCTION • THE HUMAN GENOME PROJECT • DNA - TEXT WRITTEN W/ 4 LETTERS • MOLECULAR EVOLUTION • ALIGNMENT • OLIGO FREQUENCY – a new method • RESULTS • CONCLUSION

  3. INTRODUCTION • NCTS & the BITS Program • Biophysics • X-Ray Crystalography • Protein Structure • Biosequence Analysis • Biocomputing • Bioinformatics

  4. Human Genome Project • Started 1989 • First complete genome • Haemophilus inluenzae (1.8 Mb) 1995 • Human genome (30 Gb) • Working draft 2000 June • Complete draft 2001 Feb 16

  5. 2001 Feb 10

  6. The Human Genome Human has 23 chromosomes Human has 23 chromosomes

  7. DNA is a 4-letter Text A,C,G,T

  8. Molecular Evolution & Phylogeny • Organism represented by (DNA) Genome • There is a Universal Ancestor • Random mutation of DNA sequence leads to divergence and new species • Reqirement of fitness causes conservation of sequence • Sequence similarity  phylogeny

  9. SEQUENCE ALIGNMENT • Most important in studying sequence homoloy • Seq a: TACCATCGCAAACAT GG 17 - | | | | - | - | | | - _ | - __ - | Seq b: AACCACCACAAG ACCTCG 18 total length 19, matches 10, mismatches 6 gaps 1 single(SG), 1 extended (2, EG) Score: matches – (SG+EG)*P – (EG-1)*PE = P: penalty for SG (1) PE: penalty for EG (2) Score = 10 –2 –1 = 7 Similarity = matches/total length = 10/19 = 55%

  10. ALIGNMENT (II) • Result intuitive, evolution based • Widely used in sequence analysis – homology search, phylogeny, etc • Parameter dependent – many alignments possible (Needleman-Wunsch algorithm) • DNA & proteins sequences • Good software. E.g., BLAST, GCG,.. • Fast for length < 2000 • NP-complete problem for long and remotely related sequences, and for multiple alignments

  11. OLIGO FREQUENCY • Oligonucleotide (oligo): sequence several nucleotides long • There are 4^n oligos of length n • Complete set of frequencies of oligos characterizes a DNA sequence • Very fast to compute; scales with seq length • For multiple seqs, scales w/ no. of seqs • Related to alignment

  12. n-distance

  13. OLIGO FREQUENCY (II) • Relation w/ alignment • D: oligo distance • Oligo similarity S = 1- D/2 • In seq alignment, if % of matched bases is p, • Then % similarity is by def X = p • If oligo length is n, then, if all mutations (replace, delete and insert) are single, % similarity of each oligo is p^n • Hence, excluding correlations, S = p^n = X^n • With correlation, S = X^n + (1 – X^n)/4^n • For 1% accuracy, n >=7

  14. log S v.s. log X Simulated Random Mutations Oligo length = 9 oligo align

  15. log S v.s. log X Tree of Life (35 organisms) Oligo length = 9 oligo align

  16. OLIGO FREQUENCY (III) • Empirical: S = X^(kn), k ~ 2/3 • Why? Assumption: gaps (delete+insert) are single and uncorrelated. • Fact – most gaps are EXTENDED: if there is one, there will be more. • Because non-conserved parts are NOT protected by evolution. • Model generation of extended gap can get k~2/3 • Note: for two random seqs, X < 0.5

  17. log S v.s. log X Simulated Random Mutations with Extended gaps Oligo length = 9 oligo align

  18. Tree of Life (35 organisms) Bacteria A. aeolicus T. maritima Eukarya Archaea

  19. Oligo method is Robust • Three tests • Random truncation of 16S rRNA to 800 to 1200 bases (Bacteria and Archaea) • Random inversion of 16S rRNA (splice, reverse order and reconnect) • Random concatenation of 23S, 16S and 5S rRNA sequences (12 Bacteria and 6 Archaea)

  20. 16s rRNA TruncatedOligo Bacteria A. aeolicus T. maritima Archaea

  21. 16s rRNA TruncatedAlign A. aeolicus T. maritima

  22. 16s rRNA InvertedOligo Bacteria A. aeolicus T. maritima Archaea

  23. 16s rRNA InvertedAlign T. maritima A. aeolicus

  24. 5s+16s +23s rRNAs Mixed Oligo Bacteria A. aeolicus Archaea T. maritima

  25. 5s+16s +23s rRNAs Mixed Align A. aeolicus T. maritima

  26. CONCLUSION • Oligo frequency characterizes DNA seqs • Oligo similarity is related to alignment similarity • Oligo vs alignment gives a handle on mechanism of generation of extended gaps • Oligo method is robust to truncation and inversions • Can be developed into a tool for analysis and comparison of long and multiple seqs

More Related