1 / 53

Gene Prediction: Past, Present, and Future

Gene Prediction: Past, Present, and Future. Sam Gross. Genes. ATG . TAG TAA TGA. Gene  RNA  Protein Proteins are about 500 AA long Genes are about 1500bp long. ORF Scanning. In “lower” organisms, genes are contiguous We expect about 1 stop codon per 64bp

kioko
Download Presentation

Gene Prediction: Past, Present, and Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Prediction:Past, Present, and Future Sam Gross

  2. Genes ATG TAG TAA TGA • Gene  RNA  Protein • Proteins are about 500 AA long • Genes are about 1500bp long

  3. ORF Scanning • In “lower” organisms, genes are contiguous • We expect about 1 stop codon per 64bp • If we see a long ORF, it’s probably a gene! • And conversely, all genes are long ORFs

  4. Introns AG GT AG GT TGA TAA TAG ATG • Drosophila: • 3.4 introns per gene on average • mean intron length 475, mean exon length 397 • Human: • 8.8 introns per gene on average • mean intron length 4400, mean exon length 165 • ORF scanning is defeated

  5. Splicing AG GT AG GT TGA TAA TAG ATG AG GT AG GT AG TGA TAA TAG ATG

  6. Needles in a Haystack • Human genome is about 3.2Gbp • 20,000 – 25,000 genes • 78% intergenic, 20% introns, 2% coding

  7. Gotta Find ‘Em All • 60-85% of all human genes have been found, mostly by random EST sequencing • This probably won’t work for the rest • For most genes, only one splice variant is known • If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify

  8. Looking For Clues • Signals used by the cell • 99% of introns begin with GT, end with AG • 0.8% of introns begin with GC, end with AG • Gene begins with ATG • Gene ends with TAG, TAA, or TGA • Other properties of genes • Exons have characteristic lengths • Base composition of exons is characteristic due to genetic code • Exons tend to be conserved between species • Pattern of conservation is three-periodic

  9. Three-Periodicity • Most amino acids can be coded for by more than one DNA triplet (codon) • Usually, the degeneracy is in the last position Human CCTGTT (Proline, Valine) Mouse CCAGTC (Proline, Valine) Rat CCAGTC (Proline, Valine) Dog CCGGTA (Proline, Valine) Chicken CCCGTG (Proline, Valine)

  10. Hidden Markov Models • The de facto standard for gene prediction • Probabilistic finite state machine • Transition to a state, emit a character, transition to a new state • Many independence assumptions CDS NC ACG

  11. HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts

  12. HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y)

  13. Generalized Hidden Markov Models • Like a HMM, but state durations are explicit • Transition to a state, pick a duration d, emit d characters, transition to a new state • Dynamic programming algorithm complexity goes from O(N2L) to O(N2LK) • K is the maximum state duration • Not so bad in practice

  14. Predicting Genes With HMMs • Given a sequence, we can calculate the most likely annotation Intron Initial Exon Internal Exon Final Exon Single Exon Inter- genic GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA

  15. The Past: GENSCAN • Chris Burge, Stanford, 1997 • Before the Human Genome Project • No alignments available • People still thought there were 100,000 human genes

  16. The GENSCAN Model

  17. The GENSCAN Model • Output probabilities for NC and CDS depend on previous 5 bases (5th-order) • P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) • Each CDS frame has its own model • Special 2nd-order positional models for start codon, stop codon, and acceptor site • Even fancier model for donor sites • Maximal dependence decomposition (MDD) • Long-range dependencies • Separate model for different isochores

  18. GENSCAN Performance • First program to do well on realistic sequences • Multiple genes in both orientations • Pretty good sensitivity, poor specificity • 70% exon Sn, 40% exon Sp • Not enough exons per gene • Was the best gene predictor for about 4 years

  19. Mouse A A G G T G Mouse A A T G T G Chicken A A G G T G Chicken A A _ A C G Comparative Gene Prediction A B Intron Intron Exon Exon -3 -2 -1 +1 +2 +3 -3 -2 -1 +1 +2 +3 Human A A G G T G Human A A G G T G

  20. The Recent Past: TWINSCAN • Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 • Uses an informant sequence to help predict genes • For human, informant is normally mouse • Informant sequence consists of three characters • Match: | • Mismatch: : • Unaligned: . • Informant sequence assumed independent of target sequence

  21. The TWINSCAN Model • Just like GENSCAN, except adds models for conservation sequence • 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites • One CDS model for all frames • Many informants tried, but mouse seems to be at the “sweet spot”

  22. TWINSCAN Performance • Slightly more sensitive than GENSCAN, much more specific • Exon sensitivity/specificity about 75% • Much better at the gene level • Most genes are mostly right, about 25% exactly right • Was the best gene predictor for about 4 years

  23. The Present: N-SCAN • Gross and Brent, Washington University in St. Louis, 2005 • If one informant sequence is good, let’s try more! • Also several other improvements on TWINSCAN

  24. N-SCAN Improvements • Multiple informants • Richer models of sequence evolution • Frame-specific CDS conservation model • Conserved noncoding sequence model • 5’ UTR structure model

  25. HMM Outputs • GENSCAN • TWINSCAN • N-SCAN Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation |||:||:||:|||||:||||||||...... sequence Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1 GGTCAGC___CCAAGAACGTGTAG...... Informant2 GATCAGC___CCAAGAACGTGTAG...... Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA ...

  26. N-SCAN State Diagram

  27. Two-Component Output Distributions • Target sequence model • Phylogenetic model for informants • Product gives the probability of a multiple alignment column

  28. Phylogenetic Bayesian Network Models

  29. Graph Transformation

  30. Inference • Slightly-modified version of Felsenstein’s algorithm • At each of the O(N) nodes, we calculate 6o+1 summations over 6o+1 values • Total time complexity is O(N • 62(o+1))

  31. Training • Simple with labeled multiple alignment of all sequences • Can use known genes as a labeling • Don’t know ancestral genome sequences • Treat them as missing data and use EM

  32. CPD Parameterizations • Each Bayesian network of order o has (2N-1)(6o+1)(6o+1-1) free parameters • We can reduce this number by restricting the form of the CPDs • Partially reversible models • Relative frequency of DNA k-mers remains constant as sequence evolves • Gaps and unaligned regions introduced over time

  33. N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • Root (target) node is observed • Can use existing single-sequence models • Can use higher-order models • Can estimate target sequence model optimally

  34. N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • No assumption of homogeneous substitution process • Gaps and unaligned regions can be treated naturally • Robust against • Function-changing mutation • Alignment error • Sequencing error • The price is many more parameters

  35. Conservation Score Coefficient • N-SCAN uses log-likelihood scores internally. The score of a position i under state S is • Values of k between 0.3 and 0.6 result in the best performance • Performance is roughly constant in this range

  36. Whole-Genome Human Gene Prediction • Annotations used were cleaned RefSeqs • 16,259 genes • 20,837 transcripts • N-SCAN used human, mouse, rat, chicken alignment

  37. Exact Exon Accuracy

  38. Exact Gene Accuracy

  39. Intron Sensitivity By Length

  40. Human Informant Effectiveness

  41. Drosophila Informant Effectiveness

  42. The Future(?): CONTRAST • New gene predictor currently in the works • Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)

  43. HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts

  44. HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y) • Advantage: simplicity • Extremely fast training, efficient inference • Disadvantage: simplicity • Makes many unwarranted independence assumptions • Inaccurate model will get us into trouble

  45. When HMMs Go Wrong • Normal HMM training optimizes wrong function • We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) • This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence

  46. NNC A 2% B 2% C 96% NC A 3% B 2% C 95% CDS A 49% B 49% C 2% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% When HMMs Go Wrong …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC… A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution

  47. Can We Fix It? • Directly optimize • No closed form solution • But function and gradient can be calculated efficiently using DP • If we’re going to numerically optimize anyway, might as well switch to a more expressive model

  48. CRFs For Gene Prediction • Discriminative model • Define P(Y | X) as a product of many terms • Individual terms are not probabilities! • Terms are of the form fj(yi-1, yi, X, i) wj • The Good • Independence assumptions much weaker than in HMMs • Inference complexity is the same as for HMM • The Bad • Training requires numerical optimization of (convex) likelihood function

  49. The Math CRFs HMMs

  50. HMMs vs. CRFs y1 y2 y3 y4 y5 y6 … HMM x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 … CRF x1 x2 x3 x4 x5 x6

More Related