270 likes | 299 Views
Textual Analysis of DNA Sequences and Molecular Evolution. HC Lee Dept Physics & Dept Life Science National Central University 2000 March 1. INTRODUCTION THE HUMAN GENOME PROJECT DNA - TEXT WRITTEN W/ 4 LETTERS MOLECULAR EVOLUTION ALIGNMENT OLIGO FREQUENCY – a new method
E N D
Textual Analysis of DNA Sequences and Molecular Evolution HC Lee Dept Physics & Dept Life Science National Central University 2000 March 1
INTRODUCTION • THE HUMAN GENOME PROJECT • DNA - TEXT WRITTEN W/ 4 LETTERS • MOLECULAR EVOLUTION • ALIGNMENT • OLIGO FREQUENCY – a new method • RESULTS • CONCLUSION
INTRODUCTION • NCTS & the BITS Program • Biophysics • X-Ray Crystalography • Protein Structure • Biosequence Analysis • Biocomputing • Bioinformatics
Human Genome Project • Started 1989 • First complete genome • Haemophilus inluenzae (1.8 Mb) 1995 • Human genome (30 Gb) • Working draft 2000 June • Complete draft 2001 Feb 16
The Human Genome Human has 23 chromosomes Human has 23 chromosomes
DNA is a 4-letter Text A,C,G,T
Molecular Evolution & Phylogeny • Organism represented by (DNA) Genome • There is a Universal Ancestor • Random mutation of DNA sequence leads to divergence and new species • Reqirement of fitness causes conservation of sequence • Sequence similarity phylogeny
SEQUENCE ALIGNMENT • Most important in studying sequence homoloy • Seq a: TACCATCGCAAACAT GG 17 - | | | | - | - | | | - _ | - __ - | Seq b: AACCACCACAAG ACCTCG 18 total length 19, matches 10, mismatches 6 gaps 1 single(SG), 1 extended (2, EG) Score: matches – (SG+EG)*P – (EG-1)*PE = P: penalty for SG (1) PE: penalty for EG (2) Score = 10 –2 –1 = 7 Similarity = matches/total length = 10/19 = 55%
ALIGNMENT (II) • Result intuitive, evolution based • Widely used in sequence analysis – homology search, phylogeny, etc • Parameter dependent – many alignments possible (Needleman-Wunsch algorithm) • DNA & proteins sequences • Good software. E.g., BLAST, GCG,.. • Fast for length < 2000 • NP-complete problem for long and remotely related sequences, and for multiple alignments
OLIGO FREQUENCY • Oligonucleotide (oligo): sequence several nucleotides long • There are 4^n oligos of length n • Complete set of frequencies of oligos characterizes a DNA sequence • Very fast to compute; scales with seq length • For multiple seqs, scales w/ no. of seqs • Related to alignment
OLIGO FREQUENCY (II) • Relation w/ alignment • D: oligo distance • Oligo similarity S = 1- D/2 • In seq alignment, if % of matched bases is p, • Then % similarity is by def X = p • If oligo length is n, then, if all mutations (replace, delete and insert) are single, % similarity of each oligo is p^n • Hence, excluding correlations, S = p^n = X^n • With correlation, S = X^n + (1 – X^n)/4^n • For 1% accuracy, n >=7
log S v.s. log X Simulated Random Mutations Oligo length = 9 oligo align
log S v.s. log X Tree of Life (35 organisms) Oligo length = 9 oligo align
OLIGO FREQUENCY (III) • Empirical: S = X^(kn), k ~ 2/3 • Why? Assumption: gaps (delete+insert) are single and uncorrelated. • Fact – most gaps are EXTENDED: if there is one, there will be more. • Because non-conserved parts are NOT protected by evolution. • Model generation of extended gap can get k~2/3 • Note: for two random seqs, X < 0.5
log S v.s. log X Simulated Random Mutations with Extended gaps Oligo length = 9 oligo align
Tree of Life (35 organisms) Bacteria A. aeolicus T. maritima Eukarya Archaea
Oligo method is Robust • Three tests • Random truncation of 16S rRNA to 800 to 1200 bases (Bacteria and Archaea) • Random inversion of 16S rRNA (splice, reverse order and reconnect) • Random concatenation of 23S, 16S and 5S rRNA sequences (12 Bacteria and 6 Archaea)
16s rRNA TruncatedOligo Bacteria A. aeolicus T. maritima Archaea
16s rRNA TruncatedAlign A. aeolicus T. maritima
16s rRNA InvertedOligo Bacteria A. aeolicus T. maritima Archaea
16s rRNA InvertedAlign T. maritima A. aeolicus
5s+16s +23s rRNAs Mixed Oligo Bacteria A. aeolicus Archaea T. maritima
5s+16s +23s rRNAs Mixed Align A. aeolicus T. maritima
CONCLUSION • Oligo frequency characterizes DNA seqs • Oligo similarity is related to alignment similarity • Oligo vs alignment gives a handle on mechanism of generation of extended gaps • Oligo method is robust to truncation and inversions • Can be developed into a tool for analysis and comparison of long and multiple seqs