1 / 24

Comparative Genomics & Annotation

Comparative Genomics & Annotation. The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars. 5'. 3'. Exon.

dooley
Download Presentation

Comparative Genomics & Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars

  2. 5' 3' Exon Intron Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence. ....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data Output: UTR and intergenic sequence 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3'

  3. Levels of Annotation “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Homologous Genomes C*G A T A T T T C A C A Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Selection Strength,… Epigenomics – methylation, histone modification Further complications Integration of levels – RNA structure of mRNA, signals in coding regions,.. Knowledge and annotation transfer – experimental knowledge might be present in other species Evolution of Feature – regulatory signals > RNA > protein Combining with non-homologous analysis – tests for common regulation. Combining specie and population perspective

  4. Observables, Hidden Variables, Evolution & Knowledge Observables x Hidden Variable Evolution Knowledge (Constraints) If knowledge deterministic

  5. Exons of phase 0, 1 or 2 Genscan State with length distribution Introns of phase 0, 1 or 2 Initial exon Terminal exon Exon of single exon genes 5' UTR 3' UTR Poly-A signal Promoter Intergenic sequence Omitted: reverse strand part of the HMM

  6. Comparative Gene Annotation AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}

  7. Gene Finding & Protein Homology (Gelfand, Mironov & Pevzner, 1996) Protein Database Exon Ordering Graph Spliced Alignment: 1. Define set of potential exons in new genome. 2. Make exon ordering graph - EOG. 3. Align EOG to protein database. T Y G H L P T Y G H L P T Y - - L P M Y L P M T W Q

  8. Simultaneous Alignment & Gene Finding Bafna & Huson, 2000, T.Scharling,2001 & Blayo,2002. Align by minimizing Distance/ Maximizing Similarity: Align genes with structure Known/unknown:

  9. Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105

  10. From Knudsen et al. (1999) Knudsen & Hein, 2003 RNA Structure Application

  11. Observing Evolution has 2 parts C C A A G C A U U P(x): x P(Further history of x): http://www.stats.ox.ac.uk/research/genome/projects/currentprojects

  12. Hidden Markov Model for Overlapping Genes Scanning TC [1,2,3] TC [1,2,3] D [2,3] D [2,3] D [3,1] D [3,1] S [3] S [3] D [1,2] D [1,2] S [2] S [2] • Only starts in AUG (0.06) • Will Stop in “STOP” (1.0) S [1] S [1] NC NC 3rd reading frame 2nd reading frame 1st reading frame Virus genome NC NC NC NC 1 1 1 1 Hidden States Annotation 2 2 2 2 3 3 3 3 1,2 1,2 1,2 1,2 1,3 1,3 1,3 1,3 2,3 2,3 2,3 2,3 1,2,3 1,2,3 1,2,3 1,2,3

  13. Molecular Evolution: Known Reading Frames Known fixed context throughout phylogeny A G T C T Simplify Genetic Code: 4-fold 2-fold (1-1-1-1) Assume multiplicativity of selection factors 1st 1-1-1-1 2-2 4 2nd Selection rates on rates (f1f2a, f1f2b) (f2a, f1f2b) (f2a, f2b) 1-1-1-1 sites 2-2 4 (f1a, f1f2b) (f2a, f1f2b) (a, f2b) (f1a, f1b) (a, f1b) (a, b)

  14. Un-known Reading Frames andvaryingselection. 1 0.01 (1-a)/7 A G T C T 2 0.1 (1-a)/7 1 sequences Coding Status 3 0.2 a (.95) k 0.4 Selection Levels 0.6 (1-a)/7 0.8 1.5 8 2.0 Selection Levels Coding Status A G T C T T C G Coding Status Selection Levels

  15. HIV2 of 14 genomes: Evolution/Selection POL REV VPX NEF TAT GAG VIF VPR ENV A. Phylogeny and Evolutionary Parameters. B.Selection Strengths for Genes and Positions

  16. HIV2 of 14 genomes: Annotation GenBank Rev Pol Vpx Tat Nef Gag Vif Vpr Env Single Sequence Sensitivity: 0.9308 Specificity: 0.9939 LogLikelihood: -34939.32 ViterbiCont.:-34949.41 Phylo-HMM Sensitivity: 0.9542 Specificity: 0.9965 LogLikelihood: -75939.18 ViterbiCont.:--75945.77

  17. HMM extension: Stop/Start Skidding • Same evolutionary model as before, but different HMM topology • 64 states • 3 different types of transitions = ATG

  18. de novo annotation: 81.5% sensitivity (without non-homologous genes) 98.5% specificity a = 0.23 b = 0.06 g = 0.71 Knowing HIV1 (fixing the Viterbi path for one cube): 97.6% sensitivity (without non-homologous genes) 99.9% specificity Annotation Results: HIV1 vs. HIV2

  19. HMM Extension II: Introns Single Sequence HMM • Introns will almost always be 3k long • 27 states Pair HMM • 729 states

  20. Conserved RNA Structure in Protein Coding Genes Problem: Gene Structure Known, RNA Structure Unknown. RNA Structure: Exons: Genome: Protein-RNA Evolution: Singlet Doublets Contagious Dependence

  21. RNA + Protein Evolution Prediction of stem-paring regions for different number of sequences 8 5 3 Non-structural Structural Codon Nucleotide Independence Heuristic Singlet Ri,j =f* qi,j Doublet R(i1,i2),(j1,j2) = f1 * f2 * q (i1,i2),(j1,j2) Structure/non-Structure Grammars

  22. Combining Grammars: Multiple Hidden Layers Present Approach: Two “independent” annotations SCFG: RNA Structure HMM: Protein Structure Combine SCFG & HMM: RNA, Gene Structure Ideal Approach: Combined Annotation Joanna Davies

  23. HMM SCFG Combining Grammars: Solution Attempts Independence is non-trivial to define as they in principle are competing alternative models. Let X be the stochastic variable giving the HMM annotation. Let Y be the stochastic variable giving the SCFG annotation. Is No. • Combined Grammars (HMM, SCGF) --> SCFG have been devised, but does not work well, have arbitrary designs and are very large. • Combinations of Viterbi and Posterior Decoding arises. Joanna Davies

  24. http://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdfhttp://www.stats.ox.ac.uk/__data/assets/file/0016/3328/combinedHMMartifact.pdf

More Related