280 likes | 416 Views
Welcome to Research Simulation 1 PSSMs & Search for Repeated Sequences Monday, 9 June 2003. heterocysts. sucrose. Cyanobacteria. Free-living Nostoc. Anabaena/Nostoc grown on NO 3 - , air. N 2. CO 2. O 2. Matveyev and Elhai (unpublished). heterocysts. sucrose. NH 3. Cyanobacteria.
E N D
Welcome toResearch Simulation 1PSSMs & Search for Repeated SequencesMonday, 9 June 2003
heterocysts sucrose Cyanobacteria Free-living Nostoc Anabaena/Nostoc grown on NO3-, air N2 CO2 O2 Matveyev and Elhai (unpublished)
heterocysts sucrose NH3 Cyanobacteria Free-living Nostoc Anabaena/Nostoc grown on NO3-, air NH3 N2 O2 CO2 Matveyev and Elhai (unpublished)
Tandem Heptameric RepeatsDo they come in complementary pairs?
Tandem Heptameric RepeatsDo they come in complementary pairs? C2. Consider the sequence below of one strand of a DNA fragment. 5'-AGAGAGAGCTAAGGTCTCTCC-3' Which of the following is a likely structure for the single-stranded fragment to assume? A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? 5'-AGAGAGAGCTAAGGTCTCTCC-3' A B C
Tandem Heptameric RepeatsDo they come in complementary pairs? IF:
Tandem Heptameric RepeatsDo they come in complementary pairs? TCATTGGTCATTGGTCATTGGTCATTTGTCCTTTGT AACAGTAACAGGAAACAGTAAACAATAAACAGGAAACAGTAAAC
Tandem Heptameric RepeatsDo they come in complementary pairs? TCATTGGTCATTGGTCATTGGTCATTTGTCCTTTGT AACAGTAACAGGAAACAGTAAACAATAAACAGGAAACAGTAAAC
hetQ 5’-GTA ..(8).. TACNNNNNNNNNNTANNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNATGNNNNNNNNNNNNNNNN3’-CAT ..(8).. ATGNNNNNNNNNNATNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNTACNNNNNNNNNNNNNNNN NtcA N RNA Polymerase Regulatory Protein and their Binding Sites GTA ..(8).. TAC
mRNA GTA…(8)…TAC …(20-24)…TAnnnT Differentiation in cyanobacteriaWhat does NtcA bind to? Herrero et al (2001) J Bacteriol 183:411-425
Table 1: Examples of position-specific scoring matrices from sequence alignment A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A Position-specific scoring matrices
A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A B. Table of occurrencesa A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1 C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2 G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2 Position-specific scoring matrices
B. Table of occurrencesa A 0 1 0 0 5 2 1 3 4 3 C 2 0 0 0 0 1 4 0 0 2 G 0 0 5 0 0 1 0 1 0 0 T 3 4 0 5 0 1 0 1 1 0 C. Position-specific scoring matrix (B = 0)b A 0 .20 0 0 1.0 .40 .20 .60 .80 .60 C .40 0 0 0 0 .20 .80 0 0 .40 G 0 0 1.0 0 0 .20 0 .20 0 0 T .60 .80 0 1.0 0 .20 0 .20 .20 0 Position-specific scoring matrices
Table 2: Scoring a sequence with a PSSM urt-71 T A G T A T C A A A Scorea .60 .20 1.0 1.0 1.0 .20 .80 .60 .80 .60 w/ps’countsb .51 .24 .75 .79 .79 .24 .61 .51 .65 .51 Normal’db 1.6 .75 4.2 2.5 2.5 .75 3.4 1.6 2.0 1.6 Position-specific scoring matrices Score = .60 * .20 * 1.0 * …
A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A Position-specific scoring matricesIntroduction of pseudocounts A? qG,6 = 5 real counts pG = ? pseudocounts
Position-specific scoring matricesIntroduction of pseudocounts Score(position,nucleotide) = (q + p) / (N + B) p = pseudocounts = B * (overall frequency of nucleotide) [A] = 0.36[T] = 0.36[C] = 0.18[G] = 0.18 B = Total number of pseudocounts = Square root (N) ? or = 0.1 ?
C. Position-specific scoring matrix (B = 0)b A 0 .20 0 0 1.0 .40 .20 C .40 0 0 0 0 .20 .80 G 0 0 1.0 0 0 .20 0 T .60 .80 0 1.0 0 .20 0 D. Position-specific scoring matrix (B = N = 2.2)c A .099 .24 .099 .099 .79 .38 .24 C .33 .056 .056 .056 .056 .19 .61 G .056 .056 .75 .056 .056 .19 .056 T .51 .65 .099 .79 .099 .24 .099 Position-specific scoring matricesIntroduction of pseudocounts
A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A B. Table of occurrencesa A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1 C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2 G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2 Position-specific scoring matricesNormalization How to account for similarity due to similar base composition? Compare ScorePSSM / Scorebackground frequency 0.79 / 0.32 = 2.2
E. Position-specific scoring matrix (B = 0.1)c A .006 .20 .006 .006 .99 .40 .20 .59 C .40 .004 .004 .004 .004 .20 .79 .004 G .004 .004 .98 .004 .004 .20 .004 .20 T .59 .79 .006 .99 .006 .20 .006 .20 F. Position-specific scoring matrix: Log-odds form (B = 0.1)c,d A 2.2 0.7 2.2 2.2 0.0 0.4 0.7 0.2 C 0.4 2.5 2.5 2.5 2.5 0.7 0.1 2.5 G 2.5 2.5 0.0 2.5 2.5 0.7 2.5 0.7 T 0.2 0.1 2.2 0.0 2.2 0.7 2.2 0.7 Position-specific scoring matricesLog odds form Log odds = -log(score) Score * score * score … log + log + log …
Position-specific scoring matricesDecrease complexity through info analysis Uncertainty (Hc) = - Sum [piclog2(pic)] H1= -{[4/11 log2(4/11)] + [3/11 log2(3/11)] + [1/11 log2(1/11)] + [3/11 log2(3/11)]} = 1.87 H31= -{[1/11 log2(1/11)] + [1/11 log2(1/11)] + [1/11 log2(1/11)] + [8/11 log2(8/11)]} = 1.28 Information content = Sum (Hmax– Hc) (summed over all columns)