630 likes | 644 Views
Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides. Lecture I Winter School on Modern Biophysics National Taiwan University December 16-18, 2002 HC Lee Dept Physics & Dept Life Science National Central University. The Book of Life. Growth of sequenced genome data
E N D
Looking at Whole Genomes: Frequency of Occurrence of Oligonucleotides Lecture I Winter School on Modern Biophysics National Taiwan University December 16-18, 2002 HC Lee Dept Physics & Dept Life Science National Central University
Growth of sequenced genome data exploded after 1995 (GenBank: as of 2002 January 13) Genome data exploded after 1995 Millions of sequences CBL@NCU
The Human Genome Human has 24 types of Chromosomes 3 billion bps Human has 23 chromosomes Human genome first draft completed Feb 16, 2001
First working draft of Human Genome Sequencing of first working draft ofHuman Genome published in 2001 February Nature, 409, February 15, 860-921 (2001) Science, 291, February 16, 1304-1351 (2001)
Genome - book of four letter Genome - Book of Life written in four letters DNA - a polymer of nucleotides Nucleotide – backbone + bases Four types of bases: A, C, G, T (the four letters) Gene – coded sequence of bases Genome – set of all genes; set of all chromosomes packaged pair of DNA strands with double helix structure CBL@NCU
Central Dogma • Genome (DNA): genetic information (genes) • Ribosomes: Transcribe (轉錄) & translate (翻譯) genes (nucleotide sequence) to proteins (amino acids sequence) • Proteins: expression and function
New way to do Life Science Research • in vivo 在活體裡 • in vitro 在試管中 • in silico 在電腦中 CBL@NCU
Frequency of occurrence of oligonucleotides A simple first look at whole genomes
Oligo (or k-mer) Frequency • Oligonucleotide (oligo): short sequence of several nucleotides (k~2-30) long; a k-mer • There are 4k different kinds of k-mers • Frequencies of occurrence of all k-mer in a sequence can be obtained by reading with a “sliding window” • Complete set of frequencies of k-mers characterizes a DNA sequence • Very fast to compute; scales with seq length • For multiple seqs, scales w/ no. of seqs • Related to alignment
Counting k-mers with Sliding Window N(GTTACCC) = N(GTTACCC) + 1 • Sum over all N(oligo) = Sequence (circular) length • Sequence is represented by the set {N(oligo) | all oligos} • Or: for each k, sequence represented by 4k-component vector
Frequency distribution of 6-mers Number of oligos Frequency of oligo More about this in lecture II
Making a portrait • Divide a rectangular into 2k by 2k cells, each cell corresponding to one of the 4k different kinds of k-mers • Write in each cell the frequency of the k-mer • Color-code ranges of frequencies
Mycoplasma genitalium Length 0.58 Mb G+C content 32% Bacteria, Firmicutes Pathogen from the human urogenital tract
Mycoplasma pneumoniae Length 0.816 Mb G+C content 40% Bacteria Firmicutes Parasite of the human respiratory tract.
Borrelia burgdorferi Length 0.911 Mb G+C content 30% Bacteria Spirochaetales Causitive agent of Lyme disease (neur- ologic complications, arthritis)
Rhizobium sp. NGR234 Length 0.53 Mb G+C content 59% Bacteria Proteobacteria Representative bacterium that fixes nitrogen in symbiosis with many plants.
Aquifex aeolicus Length 1.55 Mb G+C content 40% Bacteria Aquificales Earliest diverging, and most thermophilic bacteria known. Can grow on hydrogen, oxygen, carbon dioxide. Parasite of the human respiratory tract.
Haemophilus influenzae Length 1.83 Mb G+C content 38% Bacteria Proteobacteria Blood-loving causative agent of influenza.
Methanococcus jannaschii Length 1.66 Mb G+C content 31% Archaea Euryarchaeota Anaerobic, Methane-producing hyperthermophile; grows at > 200 atm and an optimum temp. of 85 degrees C. Note: fractals
Helicbacter pylori Length 1.67 Mb G+C content 40% Bacteria Proteobacteria Acid-loving causative agent of chronic gastric Diseases Note: fractals
Archaeoglobus fulgidus Length 2.18 Mb G+C content 49% Archaea, Euryarchaeota Hyperthermophilic sulphur-reducer; causes havoc by souring oil wells.
Synechococcus sp. PCC6803 Length 3.587Mb G+C content 48% Bacteria Cyanobacteria Unicellular cyanobacterium widely used for study of oxygen-producing photosynthesis mechanism. Exceptionally wide distribution of frequ- ency occurrence of short oligos.
Molecular Evolution & Phylogeny • Organism represented by Genome • A Universal Ancestor (is believed to) exists • Random mutation of DNA sequence leads to divergence and new species • Pressure from fitness causes conservation of sequence
Phylogeny & Sequence similarity • Because fitness exerts pressure on functional sequence to conserve, if rate of change induced by mutation is assumed constant, then the dissimilarity between two homologous sequences is indicative of time elapsed when they diverged. Hence can use sequence similarity to study phylogeny. • E.g. phylogeny based on 16S/18S rRNA
Sequence Alignment • Most important method for studying sequence homology • Example – alignment of two sequences a and b Seq a: TACCATCGCAAACAT GG (length 17b) x||||x|x|||x-|x--x| Seq b: AACCACCACAAG ACCTCG (length 18b) Consensus length 19, 10 matches(|), 6 mismatches (x), 1 single gap (-, SG), 1 extended gap (--, EG) Score: matches – (SG+EG)*P – (EG-1)*PE = (P: penalty for SG; PE: penalty for EG) Score = 10 –2 –1 = 7 Similarity = matches/total length =10/19=55%
Sequence Alignment (II) • Result intuitive, evolution based • Widely used in sequence analysis – homology search, phylogeny, etc • Parameter dependent – many alignments possible (Needleman-Wunsch algorithm) • DNA & proteins sequences • Good software. E.g., BLAST, GCG,.. • Fast for length < 2000 • NP-complete problem for long and remotely related sequences, and for multiple alignments
The Ribosome • E.g. phylogeny based on 16S/18S rRNA • 16S (Prokaryotes): 1550 bases; 18S Eukaryotes): 1800 bases • Ribosomal enzyme • Transcription & translation • Among the most ancient and best conserved biological machines • In genome of EVERY organism • Two subunits: 30S + 50S • 30S (small subunit): 16S/18S + 20 proteins • Translates mRNA
“Cartoon” of 16S rRNA Head Body Platform
Platform Head E coli 16S rRNA secondary structure Body 3‘m
Bacteria 16S rRNA alignment tree 35 organisms: 19 bacteria 9 archaea 7 eukarya E. coli Bacillus Aquifex Herpetosiphon Thermotoga Mouse Homo sapiens Methanococcus Archaea Eukarya Archaeoglobus C. elegans
16S/18S rRNA k-mer tree as function of k Bacteria Archaea Eukarya
Oligo Frequency and sequence alignment distances correlated • If sequence evolve ONLY by uncorrelated single mutations, then: S = X n(b/c chances of any base not changing is X) • X - alignment similarity • S - oligo frequency similarity • n - oligo length. • In practice, more than single mutation. E.g., extended gaps. Then S = X**(kn) k < 1. Empirically: k = 2/3.
Simulated Random Mutations log S v.s. log X S = X9 Oligo length = 9 oligo align
Simulated Random Mutations with Extended gaps h=4 ng =3 kth=0.625 Oligo length = 9 S = X6.3 log S v.s. log X oligo align
Tree of Life (35 organisms) log S v.s. log X h=5 ng=2.5 kth=0.8 kex=0.66 Oligo length = 9 oligo align
Oligo frequency Eukarya Archaea Aquifex Thermotoga Bacteria
Alignment Aquifex Thermotoga
Comparison of 16S/18S rRNATrees of Life (35 organisms)Similar topology Differences in detail Bacteria Aquifex Thermotoga Eukarya Archaea Black: oligo frequency Red: sequence alignment
Oligo method is Robust • Three tests (Bacteria and Archaea) • Random truncation of 16S rRNA to 800 to 1200 bases • Random inversion of 16S rRNA (splice, reverse order and reconnect) • Random concatenation of 23S, 16S and 5S rRNA sequences
k o m n d e r g h b a s i p j q z f y H C A D F B E G 0.1 L L 16s rRNA Truncated Alignment r G B q D F j f p z H Aquifex H Thermatoga y C Thermatoga E Sulfolobus Aquifex i H A b A a e A Aeropyrum k h m s g n d 0.1 o Oligo
Aquifex Thermatoga Alignment 16s rRNA Truncated A Aquifex H A H Thermatoga Oligo