150 likes | 286 Views
Multiple Species Gene Finding using Gibbs Sampling. Sourav Chatterji Lior Pachter University of California, Berkeley. Multiple Species Comparative Gene Finding (with Alignment). McAuliffe et al. (2004), Siepel et al. (2004). Multiple Species Comparative Gene Finding (with Alignment).
E N D
Multiple Species Gene Findingusing Gibbs Sampling Sourav Chatterji Lior Pachter University of California, Berkeley
Multiple Species Comparative Gene Finding(with Alignment) • McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding(with Alignment) • McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding(without Alignment)
Gibbs Sampling for Biological Sequence Analysis • Introduced by Lawrence et al. 1993 • Motif Detection • Extensions • Multiple Motifs in a Sequence • Multiple Types of Motifs • Applications • Alignment • Linkage Analysis
Gibbs Sampling • Aim : To sample from the joint distribution p(x1,x2,…,xn) when it is easy to sample from the conditional distributions p(xi | x1,…xi-1,xi+1,…,xn) but not from the joint distribution. • Method: Iteratively sample xit from the conditional distribution p(xi | x1t,…xi-1t,xi+1t-1,…,xnt-1) • Theorem : For discrete distributions, the distribution of (x1t,x2t …,xnt) converges to p(x1,x2,…,xn)
Connection to HMMs qs qs qs Z2 Zm Z1 qt qt qt Ym Y1 Y2 • qt= output probabilities • qs= transition probabilities • Difficult to sample from P(q,Z | Y) • Easy to sample q from P(q | Z,Y) • Easy to sample Z from P(Z | q,Y)
Gibbs Sampling for Gene Finding Initial Predictions
Gibbs Sampling for Gene Finding Sample Z1 from P(Z1 | Z[-1] , Y)
Gibbs Sampling for Gene Finding Sample Z2 from P(Z2 | Z[-2] , Y)
Additional Details • Issues in the Gibbs Sampling Method • Gibbs sampling assumes sequences independently generated by a HMM: need to generalize method a tree topology. • Learn parameters from a subset of sequences roughly equidistant from each other: human, mouse, dog and cow • Things get messy when there are multiple genes; need to handle multiple set of parameters. • Make use of an approximate alignment • Boost scores using a phyloHMM model
Results • 2060 exons predicted • Exon level Sensitivity : 23.2% • Exon level Specificity : 46.7% • 28.5% of predicted exons partially overlap with true exons. • Nucleotide Level Sensitivity : 42.8% • Nucleotide Level Specificity : 82.1%
Results • Nucleotide level results much better than exon level results • Need of better splice site models, probably multiple species splice site models. • Low Sensitivity • Is it the alignment?
Analysis of results (novel genes) • Statistics of transcripts overlapping with novel VEGA genes • 223 exons predicted • Exon level Sensitivity : 24.8% (78 of 315 true exons are predicted correctly) • Exon level Specificity : 35.0% (78 of the 223 predicted exons are correct) • Additionally, 24.7% of predicted exons partially overlap with the true exons. • Nucleotide level Sensitivity : 56.6% • Nucleotide level Specificity : 62.9%