170 likes | 336 Views
Comparative Genomics & Annotation. The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars. 5'. 3'. Exon.
E N D
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure Prediction Signal Finding Overlapping Annotations: Protein Genes Protein-RNA Combining Grammars
5' 3' Exon Intron Ab Initio Gene prediction Ab initio gene prediction: prediction of the location of genes (and the amino acid sequence it encodes) given a raw DNA sequence. ....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggtgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccatggggtctctgttccctctgtcgctgctgttttttttggcggccgcctacccgggagttgggagcgcgctgggacgccggactaagcgggcgcaaagccccaagggtagccctctcgcgccctccgggacctcagtgcccttctgggtgcgcatgagcccggagttcgtggctgtgcagccggggaagtcagtgcagctcaattgcagcaacagctgtccccagccgcagaattccagcctccgcaccccgctgcggcaaggcaagacgctcagagggccgggttgggtgtcttaccagctgctcgacgtgagggcctggagctccctcgcgcactgcctcgtgacctgcgcaggaaaaacacgctgggccacctccaggatcaccgcctacagtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtgggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagaaccgccccacagcgtgattttggagcctccggtcttaaagggcaggaaatacactttgcgctgccacgtgacgcaggtgttcccggtgggctacttggtggtgaccctgaggcatggaagccgggtcatctattccgaaagcctggagcgcttcaccggcctggatctggccaacgtgaccttgacctacgagtttgctgctggaccccgcgacttctggcagcccgtgatctgccacgcgcgcctcaatctcgacggcctggtggtccgcaacagctcggcacccattacactgatgctcggtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagcttggagccccgcgcccacagctttggcctccggttccatcgctgcccttgtagggatcctcctcactgtgggcgctgcgtacctatgcaagtgcctagctatgaagtcccaggcgtaaagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctggggaaatggccatacatggtgg.... Input data Output: UTR and intergenic sequence 5'....tttttgcagtactcccgggccctctgttggggcctccccttcctctccagggtggagtcgaggaggcggggctgcgggcctccttatctctagagccggccctggctctctggcgcggggccccttagtccgggctttttgccATGGGGTCTCTGTTCCCTCTGTCGCTGCTGTTTTTTTTGGCGGCCGCCTACCCGGGAGTTGGGAGCGCGCTGGGACGCCGGACTAAGCGGGCGCAAAGCCCCAAGGGTAGCCCTCTCGCGCCCTCCGGGACCTCAGTGCCCTTCTGGGTGCGCATGAGCCCGGAGTTCGTGGCTGTGCAGCCGGGGAAGTCAGTGCAGCTCAATTGCAGCAACAGCTGTCCCCAGCCGCAGAATTCCAGCCTCCGCACCCCGCTGCGGCAAGGCAAGACGCTCAGAGGGCCGGGTTGGGTGTCTTACCAGCTGCTCGACGTGAGGGCCTGGAGCTCCCTCGCGCACTGCCTCGTGACCTGCGCAGGAAAAACACGCTGGGCCACCTCCAGGATCACCGCCTACAgtgagggacaggggctcggtcccggctggggtgaggggagggggctggaagaggtggggaagggtagttgacagtcgctctatagggagcgcccgcggacctcactcagaggctcccccttgccttagAACCGCCCCACAGCGTGATTTTGGAGCCTCCGGTCTTAAAGGGCAGGAAATACACTTTGCGCTGCCACGTGACGCAGGTGTTCCCGGTGGGCTACTTGGTGGTGACCCTGAGGCATGGAAGCCGGGTCATCTATTCCGAAAGCCTGGAGCGCTTCACCGGCCTGGATCTGGCCAACGTGACCTTGACCTACGAGTTTGCTGCTGGACCCCGCGACTTCTGGCAGCCCGTGATCTGCCACGCGCGCCTCAATCTCGACGGCCTGGTGGTCCGCAACAGCTCGGCACCCATTACACTGATGCTCGgtgaggcacccctgtaaccctggggactaggaggaagggggcagagagagttatgaccccgagagggcgcacagaccaagcgtgagctccacgcgggtcgacagacctccctgtgttccgttcctaattctcgccttctgctcccagCTTGGAGCCCCGCGCCCACAGCTTTGGCCTCCGGTTCCATCGCTGCCCTTGTAGGGATCCTCCTCACTGTGGGCG CTGCGTACCTATGCAAGTGCCTAGCTATGAAGTCCCAGGCGTAAagggggatgttctatgccggctgagcgagaaaaagaggaatatgaaacaatctgg ggaaatggccatacatggtgg.... 3'
Levels of Annotation “Annotation”: Tagging regions and nucleotides with information about function, structure, knowledge, additional data,…. Homologous Genomes C*G A T A T T T C A C A Annotation levels Protein coding genes including alternative splicing RNA structure Regulatory signals – fast/slow, prediction of TF, binding constants,… Selection Strength,… Epigenomics – methylation, histone modification Further complications Integration of levels – RNA structure of mRNA, signals in coding regions,.. Knowledge and annotation transfer – experimental knowledge might be present in other species Evolution of Feature – regulatory signals > RNA > protein Combining with non-homologous analysis – tests for common regulation. Combining specie and population perspective
Observables, Hidden Variables, Evolution & Knowledge Observables x Hidden Variable Evolution Knowledge (Constraints) If knowledge deterministic
Co-Modelling and Conditional Modelling Observable Unobservable Goldman, Thorne & Jones, 96 AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG} Knudsen.., 99 Eddy & co. C C A A Meyer and Durbin 02 Pedersen …, 03 Siepel & Haussler 03 G Pedersen, Meyer, Forsberg…, Simmonds 2004a,b • Conditional Modelling C A U U Footprinting -Signals (Blanchette) McCauley …. Firth & Brown Observable Unobservable Needs:
& Variables: Ordinary letters: • A starting symbol: ii. A set of substitution rules applied to variables in the present string: Regular Context Free Context Sensitive General (also erasing) finished – no variables Grammars: Finite Set of Rules for Generating Strings
Simple String Generators Variables(capital)Letters(small) Regular Grammar: Start with SS --> aTbS T --> aSbT One sentence – odd # of a’s: S-> aT -> aaS –> aabS -> aabaT -> aaba Regular Context Free Context Free Grammar S--> aSabSbaa bb One sentence (even length palindromes): S--> aSa --> abSba --> abaaba
Stochastic Grammars *0.3 *0.7 *0.3 *0.3 *0.3 S -> aT -> aaS –> aabS -> aabaT -> aaba *0.1 *0.3 *0.5 S -> aSa -> abSba -> abaaba The grammars above classify all string as belonging to the language or not. All variables has a finite set of substitution rules. Assigning probabilities to the use of each rule will assign probabilities to the strings in the language. If there is a 1-1 derivation (creation) of a string, the probability of a string can be obtained as the product probability of the applied rules. i. Start with S.S --> (0.3)aT (0.7)bS T --> (0.3)aS (0.4)bT (0.3) ii. S--> (0.3)aSa (0.5)bSb (0.1)aa (0.1)bb
Hidden Markov Models in Bioinformatics O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2 H3 • Definition • Four Key Algorithms • Summing over Unknown States • Most Probable Unknown States • Marginalizing Unknown States • Optimizing Parameters
What is the probability of the data? The probability of the observed is , which could be hard to calculate. However, these calculations can be considerably accelerated. Let the probability of the observations (O1,..Ok) conditional on Hk=j. Following recursion will be obeyed: O1 O2O3 O4O5 O6O7 O8 O9 O10 H1 H2
HMM Examples Simple Eukaryotic Gene Finding: Burge and Karlin, 1996 Simple Prokaryotic • Intron length > 50 bp required for splicing • Length distribution is not geometric
Exons of phase 0, 1 or 2 Genscan State with length distribution Introns of phase 0, 1 or 2 Initial exon Terminal exon Exon of single exon genes 5' UTR 3' UTR Poly-A signal Promoter Intergenic sequence Omitted: reverse strand part of the HMM
Comparative Gene Annotation AGGTATATAATGCG..... Pcoding{ATG-->GTG} or AGCCATTTAGTGCG..... Pnon-coding{ATG-->GTG}
SCFG Analogue to HMM calculations HMM/Stochastic Regular Grammar: SCFG - Stochastic Context Free Grammars: W O1 O2O3 O4O5 O6O7 O8 O9 O10 WL WR H1 H2 j L 1 i i’ j’ H3
Secondary Structure Generators S --> LSL .869 .131 F --> dFdLS .788 .212 L --> s dFd .895 .105
From Knudsen et al. (1999) Knudsen & Hein, 2003 RNA Structure Application
Observing Evolution has 2 parts C C A A G C A U U P(x): x P(Further history of x): http://www.stats.ox.ac.uk/research/genome/projects/currentprojects