250 likes | 433 Views
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter +. and thanks to Eli Rusman. * Affymetrix + UC Berkeley Mathematics Dept. Conservation of alternative splicing between human and mouse.
E N D
HMM Sampling and Applications toGene Finding and AlignmentEuropean Conference on Computational Biology 2003Simon Cawley* and Lior Pachter+ and thanks to Eli Rusman * Affymetrix+ UC Berkeley Mathematics Dept
Conservation of alternative splicing between human and mouse • Modrek and Lee: 40-60% of human genes have alternative splice forms. Nature Genetics 2002. • Nurtdinov et al. 75% of human alternative splice forms are conserved in mouse. Human Molecular Genetics 2003. Can we develop ab-initio methods for detecting conserved alternative splice sites?
Sequence Alignment A C A T T A G A A A G A T T A C C A C A
Finding the optimal alignment max A C A T T A G A A A G A T T A C C A C A
Match/mismatch probabilities for positions i,j in each sequence Alignment forward variables for positions [1,i] and [1,j] in each sequence gap probabilities Sampling to find alternative alignments ai,j = w ai-1,j + w ai,j-1 + si,j ai-1,j-1 A C A T T A G A A A G A T T A C C A C A
Linear Space Sampling Sequences length T,U To obtain k samples Time complexity: O(TU+k(T+U)) Memory requirements: O(T+U) Hirschberg’s divide and conquer algorithm Time complexity: O(TU) Memory requirements: O(T+U)
pre-mRNA ALTERNATIVE SPLICING SPLICING TRANSLATION TRANSLATION Protein I Protein II Alternative Splicing in Mammalian Genomes
Cross-species simultaneous gene finding and alignment M. Alexandersson, S. Cawley, L. Pachter, SLAM- Cross-species gene finding and alignment with a generalized pair hidden Markov model, Genome Research, 13 (2003) p 496-502
Exon 3 Exon 1 Exon 2 Intron 1 Intron 2 5’ 3’ CNS CNS CNS Modeling gene features [human] [mouse]
SLAM components • Splice site detector • VLMM • Intron and intergenic regions • 2nd order Markov chain • independent geometric lengths • Coding sequence • PHMM on protein level • generalized length distribution • Conserved non-coding sequence • PHMM on DNA level
SLAM input and output • Input: • Pair of homologous sequences. • Output: • CDS and CNS predictions in both sequences. • Protein predictions. • Protein and CNS alignment.
Methodology for identifying alternative splice sites • Compiled SLAM gene predictions for the human, mouse and rat genomes. • Identified a set of 3400 human/mouse/rat gene triples with consistent predictions from hs/mm and hs/rn analyses. • For each triple, sampled sub-optimal parses from hs/mm and hs/rn runs • Collected alternative exons (non-Viterbi exons) that appeared in both the hs/mm and hs/rn runs • Examined overlap with RefSeq genes, mRNAs and ESTs
SLAM whole genome predictions • Built a whole genome homology map (Colin Dewey) http://baboon.math.berkeley.edu/~cdewey/homologyMaps/ • Pre-aligned the homologous blocks to reduce the SLAM search space (Nicolas Bray using AVID) http://baboon.math.berkeley.edu/mavid/ http://hanuman.math.berkeley.edu/kbrowser/ • Ran SLAM on the resulting blocks http://bio.math.berkeley.edu/slam/mouse/ http://bio.math.berkeley.edu/slam/rat/
[human] [mouse] [rat]
Conclusions • Sampling is memory efficient, fast, and should be used routinely for alignment applications. • Conserved alternative splice forms can be detected ab-initio. • The extent of alternative splicing conservation is currently unclear. Sampling provides an alternative approach for investigating this problem- one that is not sensitive to biases in EST data. • Problem: design effective and scalable validation strategies for alternative splice sites.