240 likes | 466 Views
Comparative Genomics & Annotation. The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars. 5'. 3'. Exon.
E N D
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars
5' 3' Exon Intron Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence. ....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data Output: UTR and intergenic sequence 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3'
Levels of Annotation “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Homologous Genomes C*G A T A T T T C A C A Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Selection Strength,… Epigenomics – methylation, histone modification Further complications Integration of levels – RNA structure of mRNA, signals in coding regions,.. Knowledge and annotation transfer – experimental knowledge might be present in other species Evolution of Feature – regulatory signals > RNA > protein Combining with non-homologous analysis – tests for common regulation. Combining specie and population perspective
Observables, Hidden Variables, Evolution & Knowledge Observables x Hidden Variable Evolution Knowledge (Constraints) If knowledge deterministic
Exons of phase 0, 1 or 2 Genscan State with length distribution Introns of phase 0, 1 or 2 Initial exon Terminal exon Exon of single exon genes 5' UTR 3' UTR Poly-A signal Promoter Intergenic sequence Omitted: reverse strand part of the HMM
Comparative Gene Annotation AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}
Gene Finding & Protein Homology (Gelfand, Mironov & Pevzner, 1996) Protein Database Exon Ordering Graph Spliced Alignment: 1. Define set of potential exons in new genome. 2. Make exon ordering graph - EOG. 3. Align EOG to protein database. T Y G H L P T Y G H L P T Y - - L P M Y L P M T W Q
Simultaneous Alignment & Gene Finding Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002. Align by minimizing Distance/ Maximizing Similarity: Align genes with structure Known/unknown:
Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105
From Knudsen et al. (1999) Knudsen & Hein, 2003 RNA Structure Application
Observing Evolution has 2 parts C C A A G C A U U P(x): x P(Further history of x): http://www.stats.ox.ac.uk/research/genome/projects/currentprojects
Hidden Markov Model for Overlapping Genes Scanning TC [1,2,3] TC [1,2,3] D [2,3] D [2,3] D [3,1] D [3,1] S [3] S [3] D [1,2] D [1,2] S [2] S [2] • Only starts in AUG (0.06) • Will Stop in “STOP” (1.0) S [1] S [1] NC NC 3rd reading frame 2nd reading frame 1st reading frame Virus genome NC NC NC NC 1 1 1 1 Hidden States Annotation 2 2 2 2 3 3 3 3 1,2 1,2 1,2 1,2 1,3 1,3 1,3 1,3 2,3 2,3 2,3 2,3 1,2,3 1,2,3 1,2,3 1,2,3
Molecular Evolution: Known Reading Frames Known fixed context throughout phylogeny A G T C T Simplify Genetic Code: 4-fold 2-fold (1-1-1-1) Assume multiplicativity of selection factors 1st 1-1-1-1 2-2 4 2nd Selection rates on rates (f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b) 1-1-1-1 sites 2-2 4 (f1a, f1f2b) (f2a, f1f2b) (a, f2b) (f1a, f1b) (a, f1b) (a, b)
Un-known Reading Frames andvaryingselection. 1 0.01 (1-a)/7 A G T C T 2 0.1 (1-a)/7 1 sequences Coding Status 3 0.2 a (.95) k 0.4 Selection Levels 0.6 (1-a)/7 0.8 1.5 8 2.0 Selection Levels Coding Status A G T C T T C G Coding Status Selection Levels
HIV2 of 14 genomes: Evolution/Selection POL REV VPX NEF TAT GAG VIF VPR ENV A. Phylogeny and Evolutionary Parameters. B.Selection Strengths for Genes and Positions
HIV2 of 14 genomes: Annotation GenBank Rev Pol Vpx Tat Nef Gag Vif Vpr Env Single Sequence Sensitivity: 0.9308 Specificity: 0.9939 LogLikelihood: -34939.32 ViterbiCont.:-34949.41 Phylo-HMM Sensitivity: 0.9542 Specificity: 0.9965 LogLikelihood: -75939.18 ViterbiCont.:--75945.77
HMM extension: Stop/Start Skidding • Same evolutionary model as before, but different HMM topology • 64 states • 3 different types of transitions = ATG
de novo annotation: 81.5% sensitivity (without non-homologous genes) 98.5% specificity a = 0.23 b = 0.06 g = 0.71 Knowing HIV1 (fixing the Viterbi path for one cube): 97.6% sensitivity (without non-homologous genes) 99.9% specificity Annotation Results: HIV1 vs. HIV2
HMM Extension II: Introns Single Sequence HMM • Introns will almost always be 3k long • 27 states Pair HMM • 729 states
Conserved RNA Structure in Protein Coding Genes Problem: Gene Structure Known, RNA Structure Unknown. RNA Structure: Exons: Genome: Protein-RNA Evolution: Singlet Doublets Contagious Dependence
RNA + Protein Evolution Prediction of stem-paring regions for different number of sequences 8 5 3 Non-structural Structural Codon Nucleotide Independence Heuristic Singlet Ri,j =f* qi,j Doublet R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2) Structure/non-Structure Grammars
Combining Grammars: Multiple Hidden Layers Present Approach: Two “independent” annotations SCFG: RNA Structure HMM: Protein Structure Combine SCFG & HMM: RNA, Gene Structure Ideal Approach: Combined Annotation Joanna Davies
HMM SCFG Combining Grammars: Solution Attempts Independence is non-trivial to define as they in principle are competing alternative models. Let X be the stochastic variable giving the HMM annotation. Let Y be the stochastic variable giving the SCFG annotation. Is No. • Combined Grammars (HMM, SCGF) --> SCFG have been devised, but does not work well, have arbitrary designs and are very large. • Combinations of Viterbi and Posterior Decoding arises. Joanna Davies
http://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdfhttp://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdf