160 likes | 182 Views
Position Weight Matrices for Representing Signals in Sequences. Triinu Tasa, Koke 04.02.05. Definitions. Sequence, string – ordered arrangement of letters {'A', 'C', 'G', 'T'}
E N D
Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05
Definitions • Sequence, string – ordered arrangement of letters {'A', 'C', 'G', 'T'} • Pattern – simplified regular expression, alphabet {'A', 'C', 'G', 'T', '.'}, where '.' - wild-card of length 1 ('A', 'C', 'G' or 'T') Triinu Tasa, Koke 04.02.05
What is a weight matrix? What is a weight matrix? GATGAG GATGAT TGATAT GATGAT or [GT][AG][TA][GT]A[GT] Triinu Tasa, Koke 04.02.05
What is a weight matrix? Better: GATGAG GATGAT TGATAT Alignment matrix C: A 0 2 1 0 3 0 C 0 0 0 0 0 0 G 2 1 0 2 0 1 T 1 0 2 1 0 2 Frequency matrix F: A 0 0.7 0.3 0 1 0 C 0 0 0 0 0 0 G 0.7 0.3 0 0.7 0 0.3 T 0.3 0 0.7 0.3 0 0.7 Triinu Tasa, Koke 04.02.05
What is a weight matrix? Or weight matrix W: where N – number of sequences used - a priori probability of letteri Triinu Tasa, Koke 04.02.05
What is a weight matrix? Importance matrix I: I(i, j) = * A 0 1.4 0.3 0 3 0 C 0 0 0 0 0 0 G 1.4 0.3 0 1.4 0 0.3 T 0.3 0 1.4 0.3 0 1.4 Triinu Tasa, Koke 04.02.05
Applications - Clustering Applications • Pattern clustering 1. G.GATGAG.T 62/75 1:39/49 2:23/26 R:17.3026 BP:1.12008e-37 2. G.GATGAG 89/110 1:45/60 2:44/50 R:10.436 BP:1.61764e-34 3. GATGAG.T 124/148 1:52/70 2:72/78 R:7.36961 BP:2.79148e-33 4. TG.AAA.TTT 132/145 1:53/61 2:79/84 R:6.84578 BP:1.83509e-32 5. AAAATTTT 200/231 1:63/77 2:137/154 R:4.69239 BP:1.19109e-30 6. TGAAAA.TTT 104/114 1:45/53 2:59/61 R:7.78277 BP:3.86086e-29 7. AAA.TTTT 343/537 1:79/145 2:264/392 R:3.05349 BP:5.66833e-29 8. G.AAA.TTTT 135/156 1:51/62 2:84/94 R:6.19534 BP:5.69933e-29 9. TG.GATGAG 49/57 1:30/35 2:19/22 R:16.1117 BP:9.35765e-28 10. TG.AAA.TTTT 86/91 1:40/43 2:46/48 R:8.87311 BP:1.1124e-27 ... Triinu Tasa, Koke 04.02.05
Applications - Clustering G.GATGAG.T: GAGATGAGAT GTGATGAGAT GAGATGAGGT ... A -6.9 0.98 -6.9 1.38 -6.9 -6.9 1.38 -6.90.98 -6.9 C -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 G 1.38 -6.9 1.38 -6.9 -6.9 1.38 -6.9 1.38 0.29 -6.9 T -6.9 0.29 -6.9 -6.9 1.38 -6.9 -6.9 -6.9 -6.9 1.38 Triinu Tasa, Koke 04.02.05
Applications - Clustering Compare matrices with each other using the dynamic programming approach : where A, B – matrices i, j - columns If D(m,n) > threshold => matrices are different Triinu Tasa, Koke 04.02.05
Applications - Clustering G.GATGAG.T TG.AAA.TTT AAAATTTT G.GATGAG TGAAAA.TTT AAA.TTTT GATGAG.T TG.AAA.TTTT We want to represent the clusters by logos: We need to align the patterns first – position the similar parts of the patterns above each other: G.GATGAG.T G.GATGAG-- --GATGAG.T or the logo will look like this: Triinu Tasa, Koke 04.02.05
Applications – Multiple alignment Multiple Alignment Importance matrix I – represents the aligned patterns. Example: G.GATGAG.T GATGAG.T G.GATGAG 1. Insert the first pattern into I: ('.' gives 0.25 to each) A 0 0.25 0 1 0 0 1 0 0.25 0 C 0 0.25 0 0 0 0 0 0 0.25 0 G 1 0.25 1 0 0 1 0 1 0.25 0 T 0 0.25 0 0 1 0 0 0 0.25 1 2. Align the second pattern with I using a dynamic programming approach: Triinu Tasa, Koke 04.02.05
Applications – Multiple alignment Dynamic programming matrix: G .G A T G A G . T G 0.00 0.10 0.01 0.10 0.00 0.00 0.10 0.00 0.10 0.01 0.00 A 0.00 0.00 0.11 0.00 0.20 0.00 0.00 0.20 0.00 0.11 0.00 T 0.00 0.00 0.01 0.00 0.00 0.30 0.00 0.00 0.00 0.01 0.21 G 0.00 0.10 0.01 0.11 0.00 0.00 0.40 0.00 0.10 0.01 0.00 A 0.00 0.00 0.11 0.00 0.21 0.00 0.00 0.50 0.00 0.11 0.00 G 0.00 0.10 0.01 0.21 0.00 0.00 0.10 0.00 0.60 0.01 0.00 . 0.00 0.00 0.10 0.01 0.21 0.00 0.00 0.10 0.00 0.60 0.01 T 0.00 0.00 0.01 0.00 0.00 0.31 0.00 0.00 0.00 0.01 0.70 G.GATGAG.T --GATGAG.T Triinu Tasa, Koke 04.02.05
Applications – Multiple alignment 3. Add the pattern '--GATGAG.T' to I, if necessary add columns to the matrix. 4. Repeat the procedure for every pattern. Output: G.GATGAG.T G.GATGAG-- --GATGAG.T Why importance matrix? Triinu Tasa, Koke 04.02.05
Applications – Multiple alignment Example: Pattern: GATG So far aligned: GATGATGTA- - - - GATGTGG We want: w(G, 4) > w(G, 1) > w(G, 9) Solution – importance matrix Triinu Tasa, Koke 04.02.05
Applications – Weight matrix matching • Weight Matrix Matching Purpose:find the sequences that the weight matrix describes best in a given text file ...CATAGGAAATTCCACCTCTTTGGCTTTGCCCAGTCTTCCCTTGAGGATGCCTACGTTC... 1. Calculate the score for each position 2. if score > threshold => signal Problem: finding a good threshold • Threshold – 99.5% quantile Triinu Tasa, Koke 04.02.05
Questions? Triinu Tasa, Koke 04.02.05