360 likes | 580 Views
Finding Genes based on Comparative Genomics. Robin Raffard November, 30 th 2004 CS 374. References. Main References Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004.
E N D
Finding Genes based on Comparative Genomics Robin Raffard November, 30th 2004 CS 374
References Main References • Multiple-sequence functional annotation and the generalized hidden Markov phylogeny. McAuliffe J., Pachter L., Jordan M. 2004. • Computational identification of evolutionarily conserved exons. Siepel A., Haussler D. 2004. Additional references • Phylogenetic shadowing if primate sequences to find functional regions of the human genome. Boffelli D., McAuliffe J., Ovcharenko D., Lewis K., Ovcharenko I., Pachter L., Rubin E. • A hidden markov model approach to variation among sites in rate evolution. Felsenstein J., Churchill G. • Statistics for Biology and health. Ewens W., Grant G.
Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 ATCATTACGCGGCTTAGCCCTTATAGCGATACGATGACAGATGACAA DNA Intergenics
Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 DNA
Problem formulation DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Gene 1 Gene 2 Gene 3 DNA
DNA consists of genes (functional sequences) separated by intergenics (nonfunctional sequences). Problem: Find genes using comparative genomics Key: Exons are conserved along evolution Problem formulation Gene 1 Gene 2 Gene 3 DNA
In Practice >human AGTGAGACACGACGAGCCTACTATCAGGACGAGAGCAGGAGAGTGATGATGAGTAGCGCACAGCGACGATCATCACGAGAGAGTAAGAAGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGACTACTACTAGG >mouse AGTGTGTCTCGTCGTGCCTACTTTCAGGACGAGAGCAGGTGAGTGTTGATGAGTTGCGCTCTGCGACGTTCATCTCGAGTGAGTTAGAAAGTGAAGGTATAACACAAGGTGTGAAGGCAGTGATGATGTAGAGCGACGAGAGCACAGCGGCGGGATGATATATCTAGGAGGATGCCCAATTTTTTTTT >platypus CTCTGCGGCGTTCGTCTCGGGTGGGTTGGGGGGTGGGGGTGTGGCGCAAGGTGTGAAGCACGACGACGATCTACGACGAGCGAGTGATGAGAGTGATGAGCGACGACGAGCACTAGAAGCGACGACTACTATCGACGAGCAGCCGAGATGATGATGAAAGAGAGAGAA
2 Questions • 1st question: Which genomes to compare: human/mouse or human/primates ? • 2nd question: How to extract genes from this comparison ?
Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers
Functional sequences in Human/Mouse/Primates % of similitude DNA sequence
Advantage of Human/Mouse Easy to figure out what the functional sequences are
Disadvantage of Human/Mouse Some human genes are not present in the mouse genome. Therefore impossible to extract them from a Mouse/Human comparison Human Mouse
Phylogenetic shadowing on real data Likelihood of mutation (log) DNA sequence
Absent Present Motivating Example: Gene apo(a) • Plasma protein • Important cardiovascular disease risk predictor
Phylogenetic shadowing of apo(a) Likelihood of mutation (log) DNA sequence
So Human/Mouse or Human/Primate ? • Old genes: Human/Mouse (Non coding sequences are strongly different) • New genes: Human/Primate (Straightforward alignment of coding sequences)
Outline • Human/Mouse vs Human/Primate • Advantages of Human/Mouse • Advantages of Human/Primate • Conclusion • Gene Finding • Phylogenic tree • Hidden Markov Chain • Hidden Markov Phylogeny • Contributions of the 2 papers
Naive way of extracting genes • Is not flexible/probabilistic. • Does not respect gene structure. Drawbacks:
1st step: Phylogenetic tree Nucleotide 1 Nucleotide 2 Given a nucleotide, is it functional or not ? Species
Primate phylogeny T T A A G A
Primate phylogeny Observed nucleotides A A T A G A • Which nucleotide ? • Which rate α ? A A C A
Algorithm • Given observed nucleotide, find the most likely rate α. • Mathematically, • Therefore,
Phylogenetic tree: Results Drawback: No biological model built in
Gene structure A gene finder should satisfy: Promoter region about 50 base upstream of gene 3’ untranslated region 5’ untranslated region TATA: start of transcription
Gene Model Exon S3 TATA S5 S6 S4 S2 S1 Intron
Hidden Markov Chain Model Composed of: • Sequence of states which are unobservable: S1, S2, S3, … , Sn. Si = exon, intron. Jump from Si to Si+1 follows a Markov chain: P(Si | Si+1) • Sequence of (sequence of) letters O1, O2, O3, …, On, which are emitted by the states ( according to P(Oi | Si ) ) and which are observed. P(S4 | S5) S1 O1 S2 O2 S3 O3 S4 O4 S5 O5 S6 O6 S7 O7 = ACGTACG… P(O1 | S1)
Viterbi Algorithm • Given a sequence of letters O1, … On (observed), find the sequence of states S1,…,Sn (unobservable). • Mathematically, find • 2 steps: • Compute max Prob(S,O) via dynamic programming: max Prob(S1,…,Si+1,O) = f ( max Prob(S1,…,Si,O) ) • Find a sequence of state which achieves the optimal: Si = argmax max Prob(S1,…,Si,O).
Generalized hidden Markov phylogeny Cumulates the 2 concepts: Hidden Markov chain Phylogenetic tree + Generalized hidden Markov phylogeny =
Global Method • Get a series of DNA sequences • Align them • Build the Generalized Hidden Markov Model • Train the parameters on sample genes • Find the hidden states: Si • The coding sequences are the exons
Contributions of the 1st paper • 1st to implement the Hidden Markov Phylogeny on the Primate/Human phylogeny. • Require only 5 primate species. • Able to sequence the apo(a) gene. Gene Finders
Contributions of the 2nd paper Implement sophisticated Hidden Markov Phylogeny on Human/Mouse phylogeny • Context-dependent phylogenetic models ( High-order Markov chain: Emission of one state also depends of the neighboring states). More computationally expensive but better. • Explicit modeling of conserved non-coding sequences. • Modeling of insertions and deletions.
Results of the 2nd paper Gene Finders Gene Finders
Conclusion • Genes found based on genomics comparison. • Mouse/Human for oldgenes • Primate/Human for recent genes • In any cases, same tool for extracting coding sequences: Hidden Markov Phylogeny • Future: Improve Markov model, sequence more genomes.
Thank you! Questions ?