1 / 17

Comparative Genomics & Annotation

Comparative Genomics & Annotation. The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars. 5'. 3'. Exon.

london
Download Presentation

Comparative Genomics & Annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars

  2. 5' 3' Exon Intron Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence. ....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data Output: UTR and intergenic sequence 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3'

  3. Levels of Annotation “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Homologous Genomes C*G A T A T T T C A C A Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Selection Strength,… Epigenomics – methylation, histone modification Further complications Integration of levels – RNA structure of mRNA, signals in coding regions,.. Knowledge and annotation transfer – experimental knowledge might be present in other species Evolution of Feature – regulatory signals > RNA > protein Combining with non-homologous analysis – tests for common regulation. Combining specie and population perspective

  4. Observables, Hidden Variables, Evolution & Knowledge Observables x Hidden Variable Evolution Knowledge (Constraints) If knowledge deterministic

  5. Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG} Knudsen.., 99 Eddy & co. C C A A Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 G Pedersen, Meyer, Forsberg…, Simmonds 2004a,b • Conditional Modelling C A U U Footprinting -Signals (Blanchette) McCauley …. Firth & Brown Observable Unobservable Needs:

  6. & Variables: Ordinary letters: • A starting symbol: ii. A set of substitution rules applied to variables in the present string: Regular Context Free Context Sensitive General (also erasing) finished – no variables Grammars: Finite Set of Rules for Generating Strings

  7. Simple String Generators Variables(capital)Letters(small) Regular Grammar: Start with SS --> aTbS T --> aSbT One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba Regular Context Free Context Free Grammar  S--> aSabSbaa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba

  8. Stochastic Grammars *0.3 *0.7 *0.3 *0.3 *0.3 S -> aT -> aaS –> aabS -> aabaT -> aaba *0.1 *0.3 *0.5 S -> aSa -> abSba -> abaaba The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S.S --> (0.3)aT (0.7)bS T --> (0.3)aS (0.4)bT (0.3) ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb

  9. Hidden Markov Models in Bioinformatics O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 • Definition • Four Key Algorithms • Summing over Unknown States • Most Probable Unknown States • Marginalizing Unknown States • Optimizing Parameters

  10. What is the probability of the data? The probability of the observed is , which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O1,..Ok) conditional on Hk=j. Following recursion will be obeyed: O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2

  11. HMM Examples Simple Eukaryotic Gene Finding: Burge and Karlin, 1996 Simple Prokaryotic • Intron length > 50 bp required for splicing • Length distribution is not geometric

  12. Exons of phase 0, 1 or 2 Genscan State with length distribution Introns of phase 0, 1 or 2 Initial exon Terminal exon Exon of single exon genes 5' UTR 3' UTR Poly-A signal Promoter Intergenic sequence Omitted: reverse strand part of the HMM

  13. Comparative Gene Annotation AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}

  14. SCFG Analogue to HMM calculations HMM/Stochastic Regular Grammar: SCFG - Stochastic Context Free Grammars: W O1 O2O3 O4O5 O6O7 O8 O9 O10 WL WR H1 H2 j L 1 i i’ j’ H3

  15. Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105

  16. From Knudsen et al. (1999) Knudsen & Hein, 2003 RNA Structure Application

  17. Observing Evolution has 2 parts C C A A G C A U U P(x): x P(Further history of x): http://www.stats.ox.ac.uk/research/genome/projects/currentprojects

More Related