Gene Prediction: Past, Present, and Future

Gene Prediction:Past, Present, and Future Sam Gross

Genes ATG TAG TAA TGA • Gene  RNA  Protein • Proteins are about 500 AA long • Genes are about 1500bp long

ORF Scanning • In “lower” organisms, genes are contiguous • We expect about 1 stop codon per 64bp • If we see a long ORF, it’s probably a gene! • And conversely, all genes are long ORFs

Introns AG GT AG GT TGA TAA TAG ATG • Drosophila: • 3.4 introns per gene on average • mean intron length 475, mean exon length 397 • Human: • 8.8 introns per gene on average • mean intron length 4400, mean exon length 165 • ORF scanning is defeated

Splicing AG GT AG GT TGA TAA TAG ATG AG GT AG GT AG TGA TAA TAG ATG

Needles in a Haystack • Human genome is about 3.2Gbp • 20,000 – 25,000 genes • 78% intergenic, 20% introns, 2% coding

Gotta Find ‘Em All • 60-85% of all human genes have been found, mostly by random EST sequencing • This probably won’t work for the rest • For most genes, only one splice variant is known • If we can computationally predict a gene, we have a cheap experiment (RT-PCR) to verify

Looking For Clues • Signals used by the cell • 99% of introns begin with GT, end with AG • 0.8% of introns begin with GC, end with AG • Gene begins with ATG • Gene ends with TAG, TAA, or TGA • Other properties of genes • Exons have characteristic lengths • Base composition of exons is characteristic due to genetic code • Exons tend to be conserved between species • Pattern of conservation is three-periodic

Three-Periodicity • Most amino acids can be coded for by more than one DNA triplet (codon) • Usually, the degeneracy is in the last position Human CCTGTT (Proline, Valine) Mouse CCAGTC (Proline, Valine) Rat CCAGTC (Proline, Valine) Dog CCGGTA (Proline, Valine) Chicken CCCGTG (Proline, Valine)

Hidden Markov Models • The de facto standard for gene prediction • Probabilistic finite state machine • Transition to a state, emit a character, transition to a new state • Many independence assumptions CDS NC ACG

HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts

HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y)

Generalized Hidden Markov Models • Like a HMM, but state durations are explicit • Transition to a state, pick a duration d, emit d characters, transition to a new state • Dynamic programming algorithm complexity goes from O(N2L) to O(N2LK) • K is the maximum state duration • Not so bad in practice

Predicting Genes With HMMs • Given a sequence, we can calculate the most likely annotation Intron Initial Exon Internal Exon Final Exon Single Exon Inter- genic GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA GGTGAGGTGACCAAGAACGTGTTGACAGTA

The Past: GENSCAN • Chris Burge, Stanford, 1997 • Before the Human Genome Project • No alignments available • People still thought there were 100,000 human genes

The GENSCAN Model

The GENSCAN Model • Output probabilities for NC and CDS depend on previous 5 bases (5th-order) • P(Xi | Xi-1, Xi-2, Xi-3, Xi-4, Xi-5) • Each CDS frame has its own model • Special 2nd-order positional models for start codon, stop codon, and acceptor site • Even fancier model for donor sites • Maximal dependence decomposition (MDD) • Long-range dependencies • Separate model for different isochores

GENSCAN Performance • First program to do well on realistic sequences • Multiple genes in both orientations • Pretty good sensitivity, poor specificity • 70% exon Sn, 40% exon Sp • Not enough exons per gene • Was the best gene predictor for about 4 years

Mouse A A G G T G Mouse A A T G T G Chicken A A G G T G Chicken A A _ A C G Comparative Gene Prediction A B Intron Intron Exon Exon -3 -2 -1 +1 +2 +3 -3 -2 -1 +1 +2 +3 Human A A G G T G Human A A G G T G

The Recent Past: TWINSCAN • Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 • Uses an informant sequence to help predict genes • For human, informant is normally mouse • Informant sequence consists of three characters • Match: | • Mismatch: : • Unaligned: . • Informant sequence assumed independent of target sequence

The TWINSCAN Model • Just like GENSCAN, except adds models for conservation sequence • 5th-order models for CDS and NC, 2nd-order models for start and stop codons and splice sites • One CDS model for all frames • Many informants tried, but mouse seems to be at the “sweet spot”

TWINSCAN Performance • Slightly more sensitive than GENSCAN, much more specific • Exon sensitivity/specificity about 75% • Much better at the gene level • Most genes are mostly right, about 25% exactly right • Was the best gene predictor for about 4 years

The Present: N-SCAN • Gross and Brent, Washington University in St. Louis, 2005 • If one informant sequence is good, let’s try more! • Also several other improvements on TWINSCAN

N-SCAN Improvements • Multiple informants • Richer models of sequence evolution • Frame-specific CDS conservation model • Conserved noncoding sequence model • 5’ UTR structure model

HMM Outputs • GENSCAN • TWINSCAN • N-SCAN Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation |||:||:||:|||||:||||||||...... sequence Target GGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1 GGTCAGC___CCAAGAACGTGTAG...... Informant2 GATCAGC___CCAAGAACGTGTAG...... Informant3 GGTGAGCTGACCAAGATCGTGTTGACACAA ...

N-SCAN State Diagram

Two-Component Output Distributions • Target sequence model • Phylogenetic model for informants • Product gives the probability of a multiple alignment column

Phylogenetic Bayesian Network Models

Graph Transformation

Inference • Slightly-modified version of Felsenstein’s algorithm • At each of the O(N) nodes, we calculate 6o+1 summations over 6o+1 values • Total time complexity is O(N • 62(o+1))

Training • Simple with labeled multiple alignment of all sequences • Can use known genes as a labeling • Don’t know ancestral genome sequences • Treat them as missing data and use EM

CPD Parameterizations • Each Bayesian network of order o has (2N-1)(6o+1)(6o+1-1) free parameters • We can reduce this number by restricting the form of the CPDs • Partially reversible models • Relative frequency of DNA k-mers remains constant as sequence evolves • Gaps and unaligned regions introduced over time

N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • Root (target) node is observed • Can use existing single-sequence models • Can use higher-order models • Can estimate target sequence model optimally

N-SCAN Phylogenetic Models vs. Traditional Phylogenetic Models • No assumption of homogeneous substitution process • Gaps and unaligned regions can be treated naturally • Robust against • Function-changing mutation • Alignment error • Sequencing error • The price is many more parameters

Conservation Score Coefficient • N-SCAN uses log-likelihood scores internally. The score of a position i under state S is • Values of k between 0.3 and 0.6 result in the best performance • Performance is roughly constant in this range

Whole-Genome Human Gene Prediction • Annotations used were cleaned RefSeqs • 16,259 genes • 20,837 transcripts • N-SCAN used human, mouse, rat, chicken alignment

Exact Exon Accuracy

Exact Gene Accuracy

Intron Sensitivity By Length

Human Informant Effectiveness

Drosophila Informant Effectiveness

The Future(?): CONTRAST • New gene predictor currently in the works • Based not on a generalized HMM, but a semi-Markov conditional random field (SCRF)

HMMs For Gene Prediction • Generative model • Define P(X, Y) as a product of many independent terms P(ACG) = P(start in noncoding) * P(noncoding emits A) * P(noncoding transitions to noncoding) * P(noncoding emits C) * P(noncoding transitions to coding) * P(coding emits A) • Terms are of the forms P(yi | yi-1) and P(xi | yi) • Trained by collecting counts

HMMs For Gene Prediction • To predict genes given a sequence X, calculate argmaxY P(Y | X) = argmaxY P(X, Y) / P(X) = argmaxY P(X, Y) • Advantage: simplicity • Extremely fast training, efficient inference • Disadvantage: simplicity • Makes many unwarranted independence assumptions • Inaccurate model will get us into trouble

When HMMs Go Wrong • Normal HMM training optimizes wrong function • We use P(Y | X) for prediction, but we’re optimizing P(X, Y) = P(Y | X) P(X) • This means we may prefer parameters that lead to worse predictions if they assign a higher probability to the sequence

NNC A 2% B 2% C 96% NC A 3% B 2% C 95% CDS A 49% B 49% C 2% CNS A 96% B 2% C 2% CDS A 49% B 49% C 2% NC A 3% B 2% C 95% CDS A 3% B 95% C 2% When HMMs Go Wrong …CCCCCCCCCCCCCAAAAAAAAAACCCC…CCCCCCCBBABAAABBABBABCC… A = Conserved triplet B = Synonymous substitution C = Nonsynonymous substitution

Can We Fix It? • Directly optimize • No closed form solution • But function and gradient can be calculated efficiently using DP • If we’re going to numerically optimize anyway, might as well switch to a more expressive model

CRFs For Gene Prediction • Discriminative model • Define P(Y | X) as a product of many terms • Individual terms are not probabilities! • Terms are of the form fj(yi-1, yi, X, i) wj • The Good • Independence assumptions much weaker than in HMMs • Inference complexity is the same as for HMM • The Bad • Training requires numerical optimization of (convex) likelihood function

The Math CRFs HMMs

HMMs vs. CRFs y1 y2 y3 y4 y5 y6 … HMM x1 x2 x3 x4 x5 x6 y1 y2 y3 y4 y5 y6 … CRF x1 x2 x3 x4 x5 x6

Gene Prediction: Past, Present, and Future