1 / 17

Motif identification with Gibbs Sampler

Not enough material. Motif identification with Gibbs Sampler. Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca. Background. Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901.

duaa
Download Presentation

Motif identification with Gibbs Sampler

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Not enough material Motif identification with Gibbs Sampler Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

  2. Background • Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901. • One of Markov chain Monte Carlo algorithms • Biological applications • Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998) • Classification of biological images (Samso et al., 2002) • Pairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).

  3. Motif Identification by Gibbs sampler Other outputs of Gibbs sampler: Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests Position weight matrix scores for identified motifs.

  4. Gibbs sampler in motif finding • Site sampler • Motif sampler

  5. Algorithm details: Initialization 1 2 3 4 1234567890123456789012345678901234567890123 S1 TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT S2 CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG S3 TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG S4 AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC S5 GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC.. ...S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG.. ... FA: 325FC: 316FG: 267FT: 301Sum: 1209 Randomly choose motif start Ai. Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs.

  6. Algorithm details: Predictive update S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

  7. Predictive update: Frequencies Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with  = 0.0001 The second column lists the distribution of nucleotide frequencies outside the 28 random motifs.

  8. Predictive update: PWM S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Odds ratio for CATGCC = e-0.9113-0.0693+0.1731-0.4469-0.2228-0.4042 = 0.153

  9. Predictive update S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-4. Possible locations of the 6-mer motif along S11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1. Scaled to sum to 1 Pick up the one with the largest odds ratio, update the Ai value, and generate a new frequency matrix and a new PWM Originally picked New one to replace the originally picked because of the largest odds ratio 40 – 6 + 1 = 35

  10. Algorithm details: Predictive update S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG A New PWM Scan another sequence Xuhua Xia Slide 10

  11. F as a criterion • Once all sequences are updated and a new set of Ai values obtained, compute • Update all the sequences again to obtain a new set of Ai and a new F. If the new F is greater the old F, replace the new set of Ai values by the new set of Ai values. Repeat until F value no long increases or when the maximum number of local iterations is reached. • This (from initiation to this slide) completes one global cycle of iteration • Repeat a number of global cycles until F does not increase.

  12. F as a criterion .............. ..............

  13. Summary of the algorithms • To find a motif of length L from a set of N sequences, randomly pick up a L-mer from each sequence • From the N L-mers, produce a PWM. • Randomly pick a sequence and use the PWM to scan the sequence along to obtain a set of PWMS each for a L-mer along the sequence. • Use the L-mer with the highest PWMS to update PWM. • Repeat this scanning and updating until all sequences have been used. • Calculate F1 • Repeat the entire process and calculate F2. • Continue the process until Fi does not increase any more. • Output • the final PWM, as well as PWMS for each sequence • The aligned motifs • Associated statistics

  14. Final Report: Final Frequency Final site-specific counts: A C G U 1 3 11 0 15 2 0 0 8 21 3 21 0 8 0 4 0 0 0 29 5 10 18 0 1 6 17 0 1 11 Final site-specific frequencies: A C G U 1 0.10413 0.37882 0.00092 0.51613 2 0.00112 0.00109 0.27563 0.72217 3 0.72225 0.00109 0.27563 0.00103 4 0.00112 0.00109 0.00092 0.99688 5 0.34451 0.61920 0.00092 0.03537 6 0.58489 0.00109 0.03526 0.37877 Final PWM [ln(Qij/Q0)]: A C G U 1 -0.93304 0.31199 -5.57384 0.86909 2 -5.46894 -5.54337 0.13202 1.20499 3 1.00364 -5.54337 0.13202 -5.34419 4 -5.46894 -5.54337 -5.57384 1.52737 5 0.26340 0.80335 -5.57384 -1.81131 6 0.79269 -5.54337 -1.92440 0.55966

  15. Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU 2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG 3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG 4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC 5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC 6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA 7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA 8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU 9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC 10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC 11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG 13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA 14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC 15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC 23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU 24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU 25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU 26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG 27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC 28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG 29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC

  16. Motif scores SeqName Motif Start PWMS S1 UUAUCA 18 493.3101 S2 CGGUCA 22 40.4251 S3 CUAUCA 14 282.6008 S4 AGAUAA 17 16.2174 S5 UGAUUA 16 12.3482 S6 CUAUCU 18 223.8567 S7 UUAUCA 20 493.3101 S8 UUAUCA 2 493.3101 S9 CUAUAA 17 164.6933 S10 CUAUCU 14 223.8567 S11 UGGUCA 21 70.5663 S12 UUGUAA 33 120.2498 S13 UUAUCU 20 390.7660 S14 UUAUCU 2 390.7660 S15 UUAUCA 10 493.3101 ... ... ... ... S27 UUAUCA 19 493.3101 S28 CUAUCU 15 223.8567 S29 UUGUCA 2 206.3393

  17. Motif sampler output

More Related