360 likes | 553 Views
Algorithms for Finding Genes. Rhys Price Jones Anne R. Haake. There’s something about Genes . The human genome consists of about 3 billion base pairs Some of that can be transcribed and translated to produce proteins Most of it does not
E N D
Algorithms for Finding Genes Rhys Price Jones Anne R. Haake
There’s something about Genes • The human genome consists of about 3 billion base pairs • Some of that can be transcribed and translated to produce proteins • Most of it does not • What is it that makes some sequences of DNA encode protein?
The Genetic Code • Pick a spot in your sequence • Group nucleotides from that point on in triplets. • Each triplet encodes for an amino acid, or a “stop”, or a “start” • TGA, TAG, TAA code for “stop” • But there are exceptions!
Complications • In the 1960s it was thought that a gene was a linear structure of nucleotides directly linked in triplets to amino acids in the resulting protein • This view did not last long with the discovery in the late 1960s of genes within genes and overlapping genes • In 1977, split genes were discovered
Introns and Exons • Most eukaryotic genes are interrupted by non-coding sections and broken into pieces called exons. The interrupting sequences are called introns • Some researchers have used a biological approach and searched for splicing sites at intron-exon junctions • Catalogs of splice sites were created in the 1980s • But unreliable
Can you tell if it’s a gene? • What is it about some sections of DNA that make them encode for proteins? • Compare • TTCTTCTCCAAGAGCAGGGCTTAATTCTATGCTTCCAGGCGAAAGACTGCATGGCTAACAAAGCAACGCCTAACACATTCCTAAGCAATTGGCTTGCACC • GTCTTCCCGAGGGTGTTTCTCCAATGGAAAGAGGCGTCGCTGGGCACCCGCCGGGAACGGCCGGGTGACCACCCGGTCATTGTGAACGGAAGTTTCGAGA • Is there anything that distinguishes the second from the first? • We’ll return to this issue
Before we embark on the biological problem • We will look at an easier analogy • Sometimes called the Crooked Casino • Basic idea: • sequence of values generated by one or more processes • from the values, try to deduce which process produced which parts of the sequence
A useful analogy • The loaded die • I rolled a die 80 times and got23636626516113441416663242646331422242145353556233136631321454662631116546622542 • Some of the time it was a fair die • Some of the time the die produced 6s half the time and 1-5 randomly the rest of the time
Confession • (append (fair 20) (loaded 10) (fair 20) (loaded 10) (fair 20)) • (define fair (lambda (n) (if (zero? n) '() (cons (1+ (random 6)) (fair (1- n)))))) • (define loaded (lambda (n) (if (zero? n) '() (cons (if (< (random 2) 1) 6 (1+ (random 5))) (loaded (1- n)))))) • 23636626516113441416663242646331422242145353556233136631321454662631116546622542
There are no computer algorithms to recognize genes reliably • Statistical approaches • Similarity-based approaches • Others
Similarity Methods • Searching sequences for regions that have the “look and feel” of a gene (Dan Gusfield) • An early step is the investigation of ORFs • (in Eukaryotes) look for possible splicing sites and try to assemble exons • Combine sequence comparison and database search, seeking similar genes
Statistical Approaches • Look for features that appear frequently in genes but not elsewhere • As for the loaded die sequence above, that is not 100% reliable • But then, what is? This is Biology and it’s had billions of years to figure out how to outfox us! • If you want to learn about nature, to appreciate nature, it is necessary to understand the language that she speaks in. She offers her information only in one form; we are not so unhumble as to demand that she change before we pay any attention. (R. Feynman)
Relative Frequency of Stop Codons • 3 of the possible 64 nucleotide triplets are stop codons • We would thus expect stop codons to form approximately 5% of any stretch of random DNA, giving an average distance between stop codons of about 20 codons • Very long stretches of triplets containing no stop codons (long ORFs) might, therefore, indicate gene possibilities
Rather like looking for regions where my die was fair • By seeking a shortage of 6s • Similarly, there is (organism-specific) codon bias of various kinds, and this can be exploited by gene-finders • People and programs have looked for such bias features as: • regularity in codon frequencies • regularity in bicodon frequencies • periodicity • homogeneity vs. complexity
CpG Islands • CG dinucleotides are often written CpG to avoid confusion with the base pair C-G • In the human genome CpG is a rarer bird than would be expected in purely random sequences (there are chemical reasons for this involving methylation) • In the start regions of many genes, however, the methylation process is suppressed, and CpG dinucleotides appear more frequently than elsewhere
Rather like looking for stretches where my die was loaded • By seeking a plethora of 6s • But now we’ll look for high incidences of CpG
How can we detect shortage or plethora of CpGs? • Markov Models • Assume that the next state depends only on the previous state • If the states are x1 x2 ... xn • We have a matrix of probabilities pij • pij is the probability that state xj will follow xi
Transition matrices • CpG islands A C G T A 18 27 43 12C 17 37 27 19G 16 34 37 13T 8 35 38 18 • elsewhere A C G T A 30 20 28 21C 32 30 8 30G 25 25 30 20T 18 24 29 29 (adapted from Durbin et al, 1998)
How do we get those numbers? • empirically • lots of hard wet and dry lab work • We can think of these numbers as parameters of our model • later we’ll pay more attention to model parameters
For our loaded die example we have • 0 means non-6 • loaded 0 60 50 50 6 50 50 • unloaded 0 60 83 17 6 83 17
Discrimination • Given a sequence x we can calculate a log odds score that it was generated by one of two models. • S(x) = log (P(x | model A) / P(x | model B))= sum of log((prob xi-1 to xi in A) / (prob xi-1 to xi in B)) • Demonstration (define beta (lambda (x y) (cond ((and (= x 6) (< y 6)) (log (/ .83 .5))) ((and (= x 6) (= y 6)) (log (/ .17 .5))) ((and (< x 6) (< y 6)) (log (/ .83 .5))) ((and (< x 6) (= y 6)) (log (/ .17 .5))))))(define s (lambda (l) (cond ((null? (cdr l)) 0) (else (+ (beta (car l) (cadr l)) (s (cdr l)))))))
Similar Analysis for CpG Islands • Visit the code • (show mary) • (show gene) • (scorecpg mary) • (scorecpg gene) • Sliding Window • (slidewindow 20 mary) • (show (classify mary))
Hidden Markov Models • The Markov model we have just seen could be used for locating CpG islands • For example you could calculate the log odds score of a moving window of, say 50 nucleotides • Hidden Markov Models try to combine the two models in a single model • For each state, say x, in the Markov model, we imagine two states xA and xB for the two models A and B • Both new states “emit” the same character • But we don’t know whether it was CA or CB that “emitted” the nucleotide C
HMMs • Hidden Markov Models are so prevalent in bioinformatics, we’ll just refer to them as HMMs • Given • a sequence of “emitted signals” -- the observed sequence • and a matrix of transition probabilities and emission probabilities • For a path of hidden states, calculate the conditional probability of the sequence given the path • Then try to find an optimal path for the sequence such that the conditional probability is maximized
HMM “combines” Markov Models A+ 27 C+ 17 16 34 37 27 43 G+ 8 35 18 37 38 13 19 12 T+ 18 29 T- 21 30 20 29 30 18 G- 24 28 8 25 25 30 A- C- 32 20 30
Must be able to move between models • Figure out these transition probabilities A+ C+ G+ T+ T- G- A- C-
Add a begin and an end state • and figure even more transition probabilities A+ C+ G+ B T+ E T- G- A- C-
Basic Idea • Each state emits a nucleotide (except B and E) • A+ and A- both emit A • C+ and C- both emit C • You know the sequence of emissions • You don’t know the exact sequences of states • Was it A+ C+ C+ G- that produced ACCG • Or was it A- C- C+ G- • You want to find the most likely sequence
Parameters of the HMM • How do we know the transition probabilities between hidden states? • empirical, hard lab work • as above for the CpG island transition probabilities • imperfect data will lead to imperfect model • Is there an alternative?
Machine Learning • Grew out of other disciplines, including statistical model fitting • Tries to automate the process as much as possible • use very flexible models • lots of parameters • let the program figure them out • Based on ... • known data and properties • training sets • Induction and inference
Bayesian Modeling • Understand the hypotheses (model) • Assign prior probabilities to the hypotheses • Use Bayes’ theorem (and the rest of “probability calculus”) to evaluate posterior probabilities for the hypotheses in light of actual data
Neural Networks • Very flexible • Inputs • Thresholds • Weights • Output • Variations: • additional layers • loops
Training Neural Networks • Need set of inputs and desired outputs • Run the network with a given input • Compare to desired output • Adjust weights to improve • Repeat...
Training HMMs • Similar idea • Begin with guesses for the parameters of the HMM • Run it on your training set, and see how well it predicts whatever you designed it for • Use the results to adjust the parameters • Repeat...
Hope that • When you provide brand new inputs • The trained network will produce useful outputs
Some gene finder programs • GRAIL – neural net discriminant • Genie – HMM • GENSCAN – “Semi Markov Model” • GeneParser – neural net • FGENEH etc (Baylor) – Classification by Linear Discrimination Analysis (LDA) a well-known statistical technique • GeneID – rule-based