150 likes | 256 Views
The Motivation. A certain genome that shall rename anonymous was sequenced One base was missed. True Orf. Sequencing error One base dropped. What happened then?. True Orf. True start. True stop. Sequencing error One base dropped. Out-of-frame stop. False orf. False orf.
E N D
The Motivation • A certain genome that shall rename anonymous was sequenced • One base was missed True Orf Sequencing error One base dropped
What happened then? True Orf True start True stop Sequencing error One base dropped Out-of-frame stop False orf False orf Out-of-frame start Software to predict orfs uses single-frame start-stop analysis
Predict coding region start sites Test bed: e. coli forward strand Well-studied Extensive set of “verified orfs” Save reverse strand for testing How well is a predictor doing? For each verified stop Compare predicted start to verified start Glimmer was hitting about 87% In e. coli forward strand
Early programs Long-lost, but here’s a middle-era effort: (define startscore (lambda (thisStart previousStop) (- (+ [Shine Dalgarno score] [3 points for ATG, 2 for GTG 1 for TTG] (+ (quotient (- thisStart previousStop) 100) (if (gcrich (string->list (spacer-region thisStart))) 2 0))))) Hmm
Later programs More sophisticated Shine Dalgarno computations, score stored in parameter s together with start location And more: (define startscore (lambda (s n) (let ((sr (string->list (spacer-region (car s))))) (- (+ (cadr s) (if (string=? (substring str (car s) (+ (car s) 3)) "ATG") 5 ; more pref for ATG (if (string=? (substring str (car s) (+ (car s) 3)) "GTG") 2 1))) (+ (* 0.1 (abs (- (length sr) 8))) ; 8 is average spacer length in e.coli verified (/ (- (car s) n) 60.0) ; 1 per 60 wasted orf (* 3 (/ (gcmajority sr) ; exaggerate the gc thing less (length sr))) (if (string=? (substring str (+ (car s) 4) (+ (car s) 6)) "TG") 3 0) ; punish .TG in second codon (if (hasstart sr) (if (lastmeth sr) -1 2) 0)))))) ; reward ATG in -1th codon ; punish starts elsewhere in spacer
Rough Translation (- (+ (* scaling-factor [Shine Dalgarno energy score]) 5 for ATG, 2 for GTG or 1 for TTG) (+ (* 0.1 divergence of spacer length from norm) 1 point per 60 wasted bp before start a score for gc richness in spacer region a score for XTG in second codon a score for having another start in the spacer region ))
So many numbers! • Just plucked from the air • Nevertheless • We’re already outpacing Glimmer on e. coli forward strand • How to fine-tune the numbers
Genetic Algorithm • (define POPULATION 40) ; size of initial population (and subsequent -- this is constant) • (define MUTPROB 10) ; There is a 1 in MUTPROB chance of mutation each generation • (define EXPDATASIZE 50) ; you would like each generation to work on a data set (taken from ocs) of about EXPDATASIZE • (define DATASETSIZECONTROL (round (/ 42000 EXPDATASIZE))) ; used by makedataset to aim for about EXPDATASIZE orfs • (define MINSDSIZE 3) ; the shortest Shine Dalgarno we are willing to contemplate • (define START-CODONS '("ATG" "GTG" "TTG")) • (define STOP-CODONS '("TAG" "TAA" "TGA")) • (define SD-TARGET (string->list "ATTCCTCC")) • ; other global constants modified in the dna • (define BIGNEG -10.0) • (define SPACER-MIN 4) • (define SPACER-MAX 18) • (define SD-REGION-LEN (+ SPACER-MAX (length SD-TARGET))) • (define HALFWINDOWSIZE 4000)
Those numbers placed in a “genome” • (define defaultdna (list SPACER-MIN SPACER-MAX HALFWINDOWSIZE BIGNEG 1.0 5.0 2.0 1.0 -0.1 8.0 60.0 -3.0 -.7 1.0 37.5 14.75)) • And that “genome” can…
Mutate! • (define mutate (lambda (dna) (let ((loc (random (length dna)))) (setnth dna loc (mutilate (nth dna loc)))))) • And …
Breed! (define cross(lambda (l m) (let ((a (random (length l))) (b (random (length l)))) (let ((aa (min a b)) (bb (max a b)))(list (append ((take aa) l) ((take (- bb aa))(nthcdr aa m)) (nthcdr bb l)) (append ((take aa) m) ((take (- bb aa))(nthcdr aa l)) (nthcdr bb m)))))))
Over many Generations (define generation(lambda (pop) (let ( (newpop (sort (lambda (x y) (> (car x) (car y))) (map (lambda (x) (cons (fitness x) x)) pop)))) (print-out-information) (makenewdataset) (generation (map cdr newpop)))))
Results • Kept getting better and better until • 94% predictions correct • 98% predictions had verified start as runner-up
Run on reverse strand • Similar percentages!!