Exploring PSSMs in Bioinformatics: Regulatory Site Prediction

Welcome toIntroduction to Bioinformatics • Characteristics of PSSMs • How to make a PSSM • Uncertainty and information • How to score a sequence

Scenario 1 Prediction of regulatory site

heterocysts sucrose N2 fixation in cyanobacteria N2 CO2 O2 Matveyev and Elhai (unpublished)

mRNA GTA…(8)…TAC …(20-24)…TAnnnT Differentiation in cyanobacteriaWhat does NtcA bind to? Herrero et al (2001) J Bacteriol 183:411-425

Differentiation in cyanobacteria Sequence upstream from hetQ ttctatgagaatataaaattttccttaagtttct aaaaccgaccattctgatgaataagtccggtttt tgctttttcgctttatttatctatatttccaagt ggggtgacaactatcttgccaatattgtcgttat gaaaaaatctGTAacatgagaTACacaatagcatttatatttgcttTAgtaTctctctcttgggtggg …(20-24)…TAnnnT GTA…(8)…TACNtcA binding site Promoter

Differentiation in cyanobacteriaIntegration of signals through HetR HetQ -N NtcA ??? Genes needed for differentiation Position in cell cycle HetR Level of PatS Level of HetN Master regulator Stockholm

Scenario 1: The aftermath • Did you go for it? YES • Did it bind NtcA? YES • Did killing the site prevent heterocysts? NO Stockholm

Scenario 1: The aftermath • Did you go for it? YES • Did it bind NtcA? YES • Did killing the site prevent heterocysts? NO • Fame and fortune? NO • Reasonable paper? YES

Scenario 1: The aftermath If hetQ isn’t the golden link, then what is? -N NtcA ??? Genes needed for differentiation HetR • Gene preceded by NtcA-binding site • Blocking NtcA-binding affects gene expression • Gene product required for hetR expression

Thousands of candidate hits Regexps may also “overfit” the model – be too strict and miss real binding sites Scenario 1: The aftermath If hetQ isn’t the golden link, then what is? -N NtcA ??? Genes needed for differentiation HetR • Gene preceded by NtcA-binding site How to find? • Search for GTA…(N8)…TAC…(N20-24)…TA…T?

Table 1: Examples of position-specific scoring matrices from sequence alignment A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A Position-specific scoring matrices: A better way

A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A B. Table of occurrencesa A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1 C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2 G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2 Position-specific scoring matrices: A better way

NtcA ??? HetR Good match to NtcA-site Good match to NtcA-site Good match to NtcA-site Good match to NtcA-site NtcA-based PSSM Position-specific scoring matrices: A better way TCTACTTATA TTCAATCCAC AGGGCTACAC CTAGTTCTTG AAGAGTCTGT TGAATGAACA CATACATGGT TTATCTGTTT TTCTGTCTGC TCTGACCTCT GGCAGCTTTC CACTAGTTTC TGGATTTCGG AACTCTAGCC TGCCCCACTC TTAGATAAAC GAACCTTAGT GACTTCTGCT ATACCAAAGT CTCCACGCCC CTCCGTAAAC CTCTAACATG ATGTCAGCAA ATATTAAAAA TGAATAAACT TTGTTAAAGG TACAAATGAA AATTAGCAAA AAGAGTTTAA AGTTAAAAAC GAATTGCAGT CATTCTAGGG AAACCTGTAT GGTTACATGA ACTGCCTAAA AAACAAGCTA TTATATATTT TAAGAAATTA ATTGCAATTA ATTTCCTGGG CCCCAGCTGT CATTAAAAAG AGGCAAATAC AGCCAAGGAC GACAGCACTG ACCCTCAAGA AGGCACCGGC TGACAGACAG GCTGAAATTC CGCTGAGAGC AGAGTGGTAC ATTGAACCCT CCCTGCACCA GGTCTTTCCT GTGGGCACTG AGTGCAGACA ATGAATGACT GAACGAACGA TTGAATGAAA AGAAATGAGA TATGAGGCAA TCACAGCATC AGGTGACCTT AGTATCTATT CTCGGGAGCG CACGGCTCTA AAGAGGCCCA TATCCAGGCA CCTTTAGATG CAAGAAGGAG GAAACAGCTC GAAATCCCTG AGGCCGGAGG GTCAAGAACT CTCCACCGGC GGCAGCGGCC CCCCGGCCTA AGGCTGCCTG TGCTATAAAT ACGCGGCCCA TTCCCTGGGC TCGGCGGGAC AGATAACATG AATGTGCCCT Anabaena genome

Table 1: Examples of position-specific scoring matrices from sequence alignment A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A Position-specific scoring matrices: A better way

A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A B. Table of occurrencesa A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1 C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2 G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2 Position-specific scoring matrices: A better way

B. Table of occurrencesa A 0 1 0 0 5 2 1 3 4 3 C 2 0 0 0 0 1 4 0 0 2 G 0 0 5 0 0 1 0 1 0 0 T 3 4 0 5 0 1 0 1 1 0 C. Position-specific scoring matrix (B = 0)b A 0 .20 0 0 1.0 .40 .20 .60 .80 .60 C .40 0 0 0 0 .20 .80 0 0 .40 G 0 0 1.0 0 0 .20 0 .20 0 0 T .60 .80 0 1.0 0 .20 0 .20 .20 0 Position-specific scoring matrices: A better way

Table 2: Scoring a sequence with a PSSM urt-71 T A G T A T C A A A Score .6 .2 1 1 1 .2 .8 .6 .8 .6 w/ps’countsb .51 .24 .75 .79 .79 .24 .61 .51 .65 .51 Normal’db 1.6 .75 4.2 2.5 2.5 .75 3.4 1.6 2.0 1.6 Position-specific scoring matrices: A better way Score = .60 * .20 * 1.0 * …

A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A Position-specific scoring matricesIntroduction of pseudocounts A? qG,6 = 5 real counts pG = ? pseudocounts

Position-specific scoring matricesIntroduction of pseudocounts Score(position,nucleotide) = (q + p) / (N + B) p = pseudocounts = B * (overall frequency of nucleotide) [A] = 0.32[T] = 0.32[C] = 0.18[G] = 0.18 B = Total number of pseudocounts = Square root (N) ? or = 0.1 ?

C. Position-specific scoring matrix (B = 0)b A 0 .20 0 0 1.0 .40 .20 C .40 0 0 0 0 .20 .80 G 0 0 1.0 0 0 .20 0 T .60 .80 0 1.0 0 .20 0 D. Position-specific scoring matrix (B = N = 2.2)c A .099 .24 .099 .099 .79 .38 .24 C .33 .056 .056 .056 .056 .19 .61 G .056 .056 .75 .056 .056 .19 .056 T .51 .65 .099 .79 .099 .24 .099 Position-specific scoring matricesIntroduction of pseudocounts

A. Sequence alignmenta A T T T A G T A T C A A A A A T A A C A A T T C G T T C T G T A A C A A A G A C T A C A A A A C A T T T T G T A G C T A C T T A T A C T A T T T A A G C T G T A A C A A A A T C T A C C A A A T C A T T T G T A C A G T C T G T T A C C T T T A B. Table of occurrencesa A 3 2 0 0 1 0 0 5 2 1 3 4 3 2 2 1 1 5 0 2 4 2 2 1 C 1 0 0 2 0 0 0 0 1 4 0 0 2 0 0 2 0 0 5 2 0 0 0 2 G 1 0 1 0 0 5 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 T 0 3 4 3 4 0 5 0 1 0 1 1 0 2 2 2 4 0 0 1 1 3 3 2 Position-specific scoring matricesNormalization How to account for similarity due to similar base composition? Compare ScorePSSM / Scorebackground frequency 0.79 / 0.32 = 2.2

E. Position-specific scoring matrix (B = 0.1)c A .006 .20 .006 .006 .99 .40 .20 .59 C .40 .004 .004 .004 .004 .20 .79 .004 G .004 .004 .98 .004 .004 .20 .004 .20 T .59 .79 .006 .99 .006 .20 .006 .20 F. Position-specific scoring matrix: Log-odds form (B = 0.1)c,d A 2.2 0.7 2.2 2.2 0.0 0.4 0.7 0.2 C 0.4 2.5 2.5 2.5 2.5 0.7 0.1 2.5 G 2.5 2.5 0.0 2.5 2.5 0.7 2.5 0.7 T 0.2 0.1 2.2 0.0 2.2 0.7 2.2 0.7 Position-specific scoring matricesLog odds form Log odds = -log(score) Score * score * score … log + log + log …

Position-specific scoring matricesExpand training set through orthologs Table 3: Training set including sequences from two Nostocsa 71-devB CATTACTCCTTCAATCCCTCGCCCCTCATTTGTACAGTCTGTTACCTTTACCTGAAACAGATGAATGTAGAATTTA Np-devB CCTTGACATTCATTCCCCCATCTCCCCATCTGTAGGCTCTGTTACGTTTTCGCGTCACAGATAAATGTAGAATTCA 71-glnA AGGTTAATATTACCTGTAATCCAGACGTTCTGTAACAAAGACTACAAAACTGTCTAATGTTTAGAATCTACGATAT Np-glnA AGGTTAATATAACCTGATAATCCAGATATCTGTAACATAAGCTACAAAATCCGCTAATGTCTACTATTTAAGATAT 71-hetC GTTATTGTTAGGTTGCTATCGGAAAAAATCTGTAACATGAGATACACAATAGCATTTATATTTGCTTTAGTATCTC 71-nirA TATTAAACTTACGCATTAATACGAGAATTTTGTAGCTACTTATACTATTTTACCTGAGATCCCGACATAACCTTAG Np-nirA CATCCATTTTCAGCAATTTTACTAAAAAATCGTAACAATTTATACGATTTTAACAGAAATCTCGTCTTAAGTTATG 71-ntcB ATTAATGAAATTTGTGTTAATTGCCAAAGCTGTAACAAAATCTACCAAATTGGGGAGCAAAATCAGCTAACTTAAT Np-ntcB TTATACAAATGTAAATCACAGGAAAATTACTGTAACTAACTATACTAAATTGCGGAGAATAAACCGTTAACTTAGT 71-urt ATTAATTTTTATTTAAAGGAATTAGAATTTAGTATCAAAAATAACAATTCAATGGTTAAATATCAAACTAATATCA Np-urt TTATTCTTCTGTAACAAAAATCAGGCGTTTGGTATCCAAGATAACTTTTTACTAGTAAACTATCGCACTATCATCA

Position-specific scoring matricesDecrease complexity through info analysis Uncertainty (Hc) = - Sum [piclog2(pic)]

Position-specific scoring matricesDecrease complexity through info analysis Uncertainty (Hc) = - Sum [piclog2(pic)] H1= -{[4/11 log2(4/11)] + [3/11 log2(3/11)] + [1/11 log2(1/11)] + [3/11 log2(3/11)]} = 1.87 H31= -{[1/11 log2(1/11)] + [1/11 log2(1/11)] + [1/11 log2(1/11)] + [8/11 log2(8/11)]} = 1.28 Information content = Sum (Hmax– Hc) (summed over all columns)

Exploring PSSMs in Bioinformatics: Regulatory Site Prediction

Exploring PSSMs in Bioinformatics: Regulatory Site Prediction

Presentation Transcript

Welcome to Introduction to Bioinformatics Monday, 11 October

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to BioInformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Welcome to Introduction to Bioinformatics Friday, 12 September

Introduction to Bioinformatics

Introduction to Bioinformatics

Welcome to Introduction to Bioinformatics Computing

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Introduction to Bioinformatics

Welcome to Bioinformatics!

Welcome to Introduction to Bioinformatics

Introduction to Bioinformatics