520 likes | 795 Views
Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding). Arthur L. Delcher and Steven Salzberg Center for Bioinformatics and Computational Biology University of Maryland. Outline. A (very) brief overview of microbial gene-finding Codon composition methods
E N D
Bacterial Gene Finding and Glimmer(also Archaeal and viral gene finding) Arthur L. Delcher and Steven SalzbergCenter for Bioinformatics and Computational Biology University of Maryland
Outline • A (very) brief overview of microbial gene-finding • Codon composition methods • GeneMark: Markov models • Glimmer1 & 2 • Interpolated Markov Model (IMM) • Interpolated Context Model (ICM) • Glimmer3 • Reducing false positives • Improving coding initiation site predictions • Running Glimmer3
Step One • Find open reading frames (ORFs). …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon
Step One • Find open reading frames (ORFs). • But ORFs generally overlap … Reversestrand Stopcodon …ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT… …TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA… Stopcodon Stopcodon ShiftedStop
Campylobacter jejuni RM1221 30.3%GC All ORFs on both strands shown - color indicates reading frame Longest ORFs likely to be protein-coding genes Note the low GC content
Campylobacter jejuni RM1221 30.3%GC Purple lines are the predicted genes Purple ORFs show annotated (“true”) genes
Campylobacter jejuni RM1221 30.3%GC Mycobacterium smegmatis MC2 67.4%GC Note what happens in a high-GC genome
Campylobacter jejuni RM1221 30.3%GC Mycobacterium smegmatis MC2 67.4%GC Purple lines show annotated genes
The Problem • Need to decide which orfs are genes. • Then figure out the coding start sites • Can do homology searches but that won’t find novel genes • Besides, there are errors in the databases • Generally can assume that there are some known genes to use as training set. • Or just find the obvious ones
Codon Composition • Find patterns of nucleotides in known coding regions (assumed to be available). • Nucleotide distribution at 3 codon positions • Hexamers • GC-skew • (G-C)/(G+C) computed in windows of size N • Amino-acid composition • Use these to decide which orfs are genes. • Prefer longer orfs • Must deal with overlaps
Bacterial Replication Early replication Theta structure
Termination of Replication B. subtilis E. coli
Codon Composition Nucleotide variation at codon position:
Codon-Composition Gene Finders • ZCURVE • Guo, Ou & Zhang, NAR 31, 2003 • Based on nucleotide and di-nucleotide frequency in codons • Uses Z-transform and Fisher linear discriminant • MED • Ouyang, Zhu, Wang & She, JBCB 2(2) 2004 • Based on amino-acid frequencies • Uses nearest-neighbor classification on entropies
Probabilistic Methods • Create models that have a probability of generating any given sequence. • Train the models using examples of the types of sequences to generate. • The “score” of an orf is the probability of the model generating it. • Can also use a negative model (i.e., a model of non-orfs) and make the score be the ratio of the probabilities (i.e., the odds) of the two models. • Use logs to avoid underflow
Fixed-Order Markov Models • k th-order Markov model bases the probability of an event on the preceding k events. • Example: With a 3rd-order model the probability of this sequence:would be: Target Target
Fixed-Order Markov Models • Advantages: • Easy to train. Count frequencies of (k+1)-mers in training data. • Easy to compute probability of sequence. • Disadvantages: • Many (k+1)-mers may be undersampled in training data. • Models data as fixed-length chunks. Fixed-LengthContext Target
GeneMark • Borodovsky & McIninch, Comp. Chem 17, 1993. • Uses 5th-order Markov model. • Model is 3-periodic, i.e., a separate model for each nucleotide position in the codon. • DNA region gets 7 scores: 6 reading frames & non-coding―high score wins. • Lukashin & Borodovsky, Nucl. Acids Res. 26, 1998 is the HMM version.
Interpolated Markov Models (IMM) • Introduced in Glimmer 1.0Salzberg, Delcher, Kasif & White, NAR 26, 1998. • Probability of the target position depends on a variable number of previous positions (sometimes 2 bases, sometimes 3, 4, etc.) • How many is determined by the specific context. • E.g., for context ggtta the next position might depend on previous 3 bases tta . But for context catta all 5 bases might be used. ggtta
Real IMMs • Model has additional probabilities, λ, that determine which parts of the context to use. • E.g., the probability of g occurring after context atca is:
Real IMMs • Result is a linear combination of different Markov orders:where • Can view this as interpolating the results of different-order models. • The probability of a sequence is still the probability of the bases in the sequence.
Real IMMs • Problem: How to determine the λ’s (or equivalently the bj’s)? • Traditionally done with EM algorithm using cross-validation (deleted estimation). • Slow • Hard to understand results • Overtraining can be a problem • We will cover EM later as part of HMMs
Glimmer IMM • Glimmer assumes: • Longer context is always better • Only reason not to use it is undersampling in training data. • If sequence occurs frequently enough in training data, use it, i.e., • Otherwise, use frequency and χ2 significance to set λ. • Interpolation is always between only 2 adjacent model lengths.
More Precisely • Suppose context of length k+1 occurs a times, a<400, with target distribution D1; and context of length k occurs ≥400 times, with target distribution D2 • Let p be the χ2probability that D1 differs from D2 • Then the interpolated distribution will be: This formula computes the probability of a base B at any position in a sequence, given the context preceding that position This formula is repeatedly invoked for all positions in an ORF
IMMs vs Fixed-Order Models • Performance • IMM generally should do at least as well as a fixed-order model. • Some risk of overtraining. • IMM result can be stored and used like a fixed-order model. • IMM will be somewhat slower to train and will use more memory. Variable-LengthContext Target
Interpolated Context Model (ICM) • Introduced in Glimmer 2.0Delcher, Harmon, et al., Nucl. Acids Res. 27, 1999. • Doesn’t require adjacent bases in the window preceding the target position. • Choose set of positions that are most informative about the target position. Variable-PositionContext Target
ICM • For all windows compare distribution at each context position with target position • Choose position with max mutual information Compare
ICM • Continue for windows with fixed base at chosen positions • Recurse until too few training windows • Result is a tree—depth is # of context positions used • Then same interpolation as IMM, between node and parent in tree Compare
Sample ICM Model ver = 2.00 len = 12 depth = 7 periodicity = 3 nodes = 21845 0 ---|---|---|*-? 0.0260 0.225 0.240 0.358 0.177 1 ---|---|---|a*? 0.0769 0.243 0.165 0.501 0.092 2 ---|---|---|c*? 0.0243 0.237 0.294 0.264 0.205 3 ---|---|---|g*? 0.0597 0.219 0.240 0.407 0.134 4 ---|---|---|t*? 0.0277 0.209 0.245 0.291 0.256 5 ---|---|--*|aa? 0.0051 0.320 0.202 0.478 0.000 6 ---|---|--*|ac? 0.0046 0.195 0.118 0.541 0.146 7 ---|---|--*|ag? 0.0101 0.180 0.207 0.613 0.000 8 ---|---|*--|at? 0.0037 0.236 0.147 0.400 0.217 9 ---|---|--*|ca? 0.0019 0.199 0.246 0.303 0.253 10 ---|---|--*|cc? 0.0090 0.297 0.189 0.308 0.207 11 ---|---|--*|cg? 0.0035 0.180 0.406 0.267 0.148 12 ---|---|--*|ct? 0.0083 0.279 0.327 0.189 0.206 13 ---|---|--*|ga? 0.0057 0.277 0.326 0.397 0.000 14 ---|---|--*|gc? 0.0057 0.233 0.158 0.473 0.136 15 ---|---|--*|gg? 0.0056 0.119 0.252 0.417 0.213 16 ---|---|--*|gt? 0.0096 0.207 0.232 0.359 0.203 17 ---|---|--*|ta? 0.0034 0.205 0.170 0.400 0.224 18 ---|---|--*|tc? 0.0062 0.179 0.238 0.290 0.293 19 ---|---|--*|tg? 0.0052 0.183 0.338 0.323 0.156 20 ---|---|-*-|tt? 0.0076 0.244 0.244 0.194 0.318 21 ---|---|-*a|aa? 0.0035 0.376 0.199 0.425 0.000 22 ---|---|-*c|aa? 0.0037 0.313 0.197 0.490 0.000 23 ---|---|-*g|aa? 0.0052 0.283 0.199 0.517 0.000 24 ---|--*|--t|aa? 0.0060 0.259 0.231 0.510 0.000 25 ---|---|-*a|ac? 0.0031 0.176 0.138 0.519 0.166
ICM • ICM’s not just for modeling genes • Can use to model any set of training sequences • Doesn’t even need to be DNA • Have used recently to finding contaminant sequences in shotgun sequencing projects. • Used a non-periodic (i.e., homogeneous) model
Fixed-Length Sequences and ICMs • Array of ICM’s―a different one for each position in fixed-length sequence. • Can model fixed regions around transcription start sites or splice sites. • Positions in region can be permuted.
Overlapping Orfs • Glimmer1 & 2 used rules. • For overlapping orfs A and B, the overlap region AB is scored separately: • If AB scores higher in A’s reading frame, and A is longer than B, then reject B. • If AB scores higher in B’s reading frame, and B is longer than A, then reject A. • Otherwise, output both A and B with a “suspicious” tag. • Also try to move start site to eliminate overlaps. • Leads to high false-positive rate, especially in high-GC genomes.
Glimmer 2.0 Overlap Comments OrfID Start End Frame Len Score Comments 2 499 1689 [+1 L=1191 r=-1.151] [ShadowedBy #3] 3 1977 418 [-1 L=1560 r=-1.317] [Contains #2] [LowScoreBy #4 L=257 S=0] 4 1721 2611 [+2 L= 891 r=-1.156] [OlapWith #3 L=257 S=99] 5 2624 3775 [+2 L=1152 r=-1.184] 7 3775 4356 [+1 L= 582 r=-1.223] [DelayedBy #5 L=216] 8 4501 6615 [+1 L=2115 r=-1.181] [OlapWith #9 L=2103 S=99] 9 6633 4513 [-1 L=2121 r=-1.365] [LowScoreBy #8 L=2103 S=0] 10 6648 9173 [+3 L=2526 r=-1.155] 11 9229 10008 [+1 L= 780 r=-1.217] 12 10411 11208 [+1 L= 798 r=-1.192] 13 11215 12243 [+1 L=1029 r=-1.165] [ShorterThan #15 L=865 S=99] 15 12827 11379 [-3 L=1449 r=-1.295] [LowScoreBy #13 L=865 S=0] [LowScoreBy #16 16 12243 13298 [+3 L=1056 r=-1.228] [OlapWith #15 L=585 S=99] [DelayedBy #13 L= 17 13298 14137 [+2 L= 840 r=-1.161] 18 15143 14133 [-3 L=1011 r=-1.230] 19 15193 15519 [+1 L= 327 r=-1.284] 20 15519 17249 [+3 L=1731 r=-1.175] 21 17249 19015 [+2 L=1767 r=-1.266] [DelayedBy #20 L=1263] 23 19052 41620 [+2 L=22569 r=-1.144] 25 42693 41692 [-1 L=1002 r=-1.200] [ShadowedBy #26] 26 41623 43065 [+1 L=1443 r=-1.401] [Contains #25] [LowScoreBy #27 L=167 S=0] 27 42899 43351 [+2 L= 453 r=-1.261] [ShorterThan #26 L=167 S=99] 28 43351 44685 [+1 L=1335 r=-1.169] 29 45223 45747 [+1 L= 525 r=-1.153] 30 45261 44695 [-1 L= 567 r=-1.298] [DelayedBy #29 L=363] 31 46261 45857 [-2 L= 405 r=-1.275] 32 46701 46420 [-1 L= 282 r=-1.273] 33 46752 47543 [+3 L= 792 r=-1.178]
Glimmer3 • Uses a dynamic programming algorithm to choose the highest-scoring set of orfs and start sites. • Not quite an HMM • Allows small overlaps between genes • “small” is user-defined • Scores of genes are not necessarily probabilities. • Score includes component for likelihood of start site
Reverse Scoring • In Glimmer3 orfs are scored from 3’ end to 5’ end, i.e., from stop codon back toward start codon. • Helps find the start site. • The start should appear near the peak of the cumulative score in this direction. • Keeps the context part of the model entirely in the coding portion of gene, which it was trained on.
Finding Start Sites • Can specify a position weight matrix (PWM) for the ribosome binding site. • Used to boost the scoreof potential start sites that have a match. • Would like to try: • A fixed-length ICM model of start sites. • A decision-tree model of start sites. • Hard to get enough reliable data. • Glimmer2 had a hybridization-energy calculation with 16sRNA tail sequence that didn’t work well.
Glimmer3 vs. Glimmer2 Table 3. Probability models trained on the output of the long-orfs program. Predictions compared to genes with annotated function. Table 4. Probability models trained on the output of the long-orfs program. Predictions compared to all annotated genes.
Other Glimmer3 Features • Can specify any set of start or stop codons • Including by NCBI translation table # • Can specify list of “guaranteed” genes • Or just orfs and let Glimmer choose the starts • Can specify genome regions to avoid • Guarantee no prediction will overlap these regions • Useful for RNA genes (tRNAs, rRNAs) • Output more meaningful orf scores. • Allow genes to extend beyond end of sequence • important for annotation of incomplete genomes
Glimmer3 Output ----- Start ----- --- Length ---- -------------- Scores ------------ ID Frame of Orf of Gene Stop of Orf of Gene Raw Frame +1 +2 +3 -1 -2 -3 NC ... ... ... ... ... ... ... ... -2 10516 10459 10337 177 120 1.14 0 - - 99 - 0 - 0 -3 10787 10781 10629 156 150 -10.69 0 - - 99 - - 0 0 -3 11132 11120 11004 126 114 -13.54 0 - - 99 - - 0 0 -2 12058 12055 11936 120 117 -2.25 0 - - 99 - 0 - 0 -3 12437 12371 12249 186 120 -13.68 0 - - 99 0 - 0 0 0007 +3 8109 8142 12632 4521 4488 16.05 99 - - 99 - - - 0 0008 -2 12763 12724 12545 216 177 4.17 99 - - - - 99 - 0 +1 12850 12877 12996 144 117 -16.13 0 0 - 99 - - - 0 -2 13165 13036 12905 258 129 -1.31 0 - - 99 - 0 - 0 -3 13649 13634 13473 174 159 -7.11 0 - - 99 - 0 0 0 -2 13675 13663 13256 417 405 2.83 0 - - 99 - 0 - 0 -2 13972 13972 13850 120 120 4.25 0 - - 99 - 0 - 0 0009 +3 12633 12642 14087 1452 1443 15.20 99 - - 99 - - - 0 +1 14194 14323 14454 258 129 1.62 0 0 - - - - 99 0 0010 -3 14669 14663 14088 579 573 13.36 99 - - - - - 99 0 +1 14482 14548 14673 189 123 3.07 0 0 - - - - 99 0 +2 14570 14642 14806 234 162 -4.35 0 - 0 - - - - 99 0011 -2 14947 14935 14696 249 237 16.17 99 - - 0 - 99 - 0 0012 +3 14655 14718 14954 297 234 3.06 99 - - 99 - - - 0 -2 15115 15100 14975 138 123 -7.02 0 - - - - 0 - 99 +1 15004 15070 15288 282 216 -2.46 0 0 - - - - 99 0 +1 15364 15475 15594 228 117 -5.48 0 0 - - - - 99 0 0013 -3 15701 15671 15000 699 669 13.07 99 - - - - - 99 0 +3 15531 15591 15728 195 135 -8.84 0 - - 0 - - - 99 +1 15631 15649 15762 129 111 -0.69 3 3 - - - - - 96 -2 15922 15898 15719 201 177 -0.41 6 - - - - 6 - 93 -2 16561 16561 16307 252 252 1.73 0 - - 99 - 0 - 0 -2 16741 16711 16562 177 147 1.72 0 - - 99 - 0 - 0
Finding Initial Training Set • Glimmer1 & 2 have program long-orfs • It finds orfs longer than min-length that do not overlap other such orfs. • Current version automatically finds min-length. • Works OK for low- to medium-GC genomes • Terrible for high-GC genomes • Up to half the orfs can be wrong. • Uses first possible start site―even if orf is correct much of sequence isn’t. • Can use codon composition information in Glimmer3 version (similar to MED).
Running Glimmer3 #1 Find long, non-overlapping orfs to use as a training set long-orfs --no_header -t 1.0 tpall.1con tp.longorfs #2 Extract the training sequences from the genome file extract --nostop tpall.1con tp.longorfs > tp.train #3 Build the icm from the training sequences build-icm -r tp.icm < tp.train #4 Run first Glimmer glimmer3 -o50 -g110 -t30 tpall.1con tp.icm tp.run1 #5 Get training coordinates from first predictions tail +2 tp.run1.predict > tp.coords #6 Create a position weight matrix (PWM) from the regions # upstream of the start locations in tp.coords upstream-coords.awk 25 0 tp.coords | extract tpall.1con - > tp.upstream elph tp.upstream LEN=6 | get-motif-counts.awk > tp.motif #7 Determine the distribution of start-codon usage in tp.coords set startuse = `start-codon-distrib --3comma tpall.1con tp.coords` #8 Run second Glimmer glimmer3 -o50 -g110 -t30 -b tp.motif -P $startuse tpall.1con tp.icm tp
A novel application of Glimmer • P. didemni is a photosynthetic microbe that lives as an endosymbiont in the sea squirt > patella • P. didemni can only be cultured in L. patella cells L. patella (sea squirt)
A novel application of Glimmer • Generated 82,337 shotgun reads • Bacterial genome 5 Mb • Host genome estimated at 160 Mb • Depth of coverage therefore much greater for bacterial contigs • Singleton reads primarily belong to host
A novel application of Glimmer • Create training sets by classifying reads from scaffolds > 10kb as bacterial • 36,920 reads • Reads where both read and mate were singletons were treated as sea squirt • 21,276 reads • 21,141 reads unclassified
A novel application of Glimmer • Train a non-periodic IMM on both sets of data • 2 IMMs created • Then classify reads using the ratio of scores from the two IMMs • In a 5-fold cross-validation, classification accuracy was • 98.9% on P. didemni reads • 99.9% on L. patella reads
A novel application of Glimmer • Finally, re-assemble using ONLY bacterial reads • Original assembly: • 65 scaffolds of 20 Kb or longer • total scaffold length 5.74 Mb • Improved assembly: • 58 scaffolds of 20 Kb or longer • total scaffold length 5.84 Mb