190 likes | 524 Views
A Very Basic Gibbs Sampler for Motif Detection . Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute. What?. What is a motif? What is the biological point? What is the goal? What is a limitation with dynamic programming? What is Gibbs sampling?
E N D
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute
What? • What is a motif? • What is the biological point? • What is the goal? • What is a limitation with dynamic programming? • What is Gibbs sampling? • What is the program going to do? • What is the program missing? • What is a much better program?
What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related dna or protein sequences.
What is the biological point? • Encoding the “structural motif” of a protein (dna in exons) • Prediction of function (protein) • Protein binding sites (dna) • Transcription binding factors
What is the goal? • Perform local multiple sequence alignment to find consensus sequences (motifs)
What is a limitation with dynamic programming? • Memory and time complexity issues • Size of search space = (L-W+1)^N L = length of a sequence W = width of motif N = number of sequences • ex: L=30, W=7, N=10 (30 -7+1)^10 = 1333735776850284124449081472843776 • Alignment of even four such sequences will take a few hours ~10^4 seconds
What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space
What is the program going to do? • Ask user for : • file containing multiple dna or protein sequences • motif width • how many motifs wanted • Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]
What is the program going to do? • Generate random start positions for the motif in each sequence. ex: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width
What is the program going to do? 4. Construct position specific score matrix from all sequences except one.
What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix:
What is the program going to do? Example: Use the position specific matrix and background from before: [A: 0.34951456310679613, C: 0.17799352750809061, G: 0.21035598705501618, T: 0.23300970873786409] GATTACA:
What is the program going to do? 6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability =
What is the program going to do? 10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences
What is the program missing? • Doesn’t do reinitializations in the middle to get out of local maxima • Doesn’t optimize the width (you have to specify width explicitly) • Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) • Doesn’t read in fasta files • Doesn’t do error checking! • And other things that don’t know they are missing yet!
What is a much better program? • Gibbs Motif Sampler • http://bayesweb.wadsworth.org/gibbs/gibbs.html • AlignAce • http://atlas.med.harvard.edu/cgi-bin/alignace.pl