600 likes | 608 Views
Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium. Matthew Sylvester 12/1/03. Lamp-1 H+ ATPase Cathepsins Transferrin R Man. 6-PR. Endocytic Trafficking. SPI-1. SPI-2. ?. Salmonella -containing vacuole. bacterial effector
E N D
Bioinformatics Approaches to Identifying Candidate Effector Molecules of S. typhimurium Matthew Sylvester 12/1/03
Lamp-1 H+ ATPase Cathepsins Transferrin R Man. 6-PR Endocytic Trafficking SPI-1 SPI-2 ? Salmonella-containing vacuole bacterial effector proteins (SseJ, SifA, SseXs, and several others) Lysosome
Selection of S. typhimurium Proteins • Salmonella effectors are secreted into the host cell via either the Salmonella pathogenicity island 1 (SPI1) or SPI2 type three secretion system (TTSS) • We chose only those proteins shown experimentally in the literature to go out through one or both of these systems (see PubMed at http://ncbi.nlm.nih.gov) • The seventeen identified SPI1 and SPI2-associated effectors were considered as one group for subsequent analysis • As the N-terminal 150 amino acids have been shown to contain conserved sequences for several SPI2 effectors, we compared this region (Miao and Miller, 2000)
Alignment of SPI-2 Effector Proteins Miao E and Miller S. A conserved amino acid sequence directing intracellular type III secretion by Salmonella typhimurium. PNAS. 2000, 97(13). Pp. 7539-7544. Published alignment of known and putative SPI2 effectors identified by a BLAST (Basic Local Alignment Search Tool) search and then aligned using ClustalW. Note the presence of the WEK(I/M)XXFF motif from approx. aa 31-38.
BLAST • Tries to find the most “similar” proteins • Compares a query to sequences in a database and each comparison is given a score (higher scores are more similar) • Scoring matrices (substitution-based) are used to assign a score based on the probability of each residue substitution • Gap penalties are negative scores • The alignment score is the sum of scores at each position • Significance of overall alignment given a p-value or an e-value • e-value = expectation value:The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
Blocks: Local ungapped alignment with rows = protein segments and columns = amino acid position Building Substitution Matrices: Part I 1 A D E P Q D A 2 A C E P D D A … … … … … ….. …………………… 10 S D E P Q D A New Sequence: A D E P Q R A -count number of matches and mismatches between new sequence and every other sequence in block. -We have 9AA matches and 1 AS mismatch in pos. 1 Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. PNAS (1992). pp.10915-10919.
Building Substitution Matrices: Part II Next, sum the results of each column, store results in a table and add the new sequence to the group By successively adding new sequences, we get a table with all possible pairs If we have 9 A’s and 1 S in the first column, we get 1 + 2 + …+8=36 possible AA pairs and we get 9 AS or SA pairs and we get 0 SS pairs If w = width of amino acids and s = # sequences, we have w*s*(s-1)/2 total possible pairs. Here, we have 36+9=45 or 1*10*9/2=45
Calculating the Lod (log-odds) Matrix • Let fij be the total number of amino acid pairs in the frequency table at position i,j (1<=j<=i<=20) • Then the observed proportion for each amino acid pairing is: • We have fAA=36 and fAS=9, so qAA=36/45 and qAS=9/45
Calculating the Lod Matrix II • Now we need the expected probabilities of occurrence for each amino acid pair • If we assume that the observed frequencies of each amino acid are the population frequencies, we have • For our example, pA=36/45+(9/45)/2 =0.9 and pS=(9/45)/2=0.1 • Then the expected probability (eij)of occurrence is pipj for i=j and pipj+pjpi for i!=j • We have expected probability of AA=0.9*0.9=0.81, AS=2*0.9*0.1=0.18, SS=0.1*0.1=0.01
Calculating the Lod Matrix III • Then we calculate the log-odds score in bits as sij=log2(qij/eij), so if we see more than expected, sij>0, if we see as many as expected, sij=0, and if we see less than expected, sij<0 • Multiplying s by 2 and rounding to the nearest integer, we obtain our values for the block substitution matrix (BLOSUM)
Clustering • To prevent “double-counting” amino acid contributions from closely related proteins, sequences are clustered and counted as a single sequence in counting amino acids • Thus, if two sequences are identical at >X% of their aligned positions, then contributions are averaged between the two • In our example, if we were to cluster 8 of our sequences with A in the first position, we now have 2As and 1S • These matrices will be denoted BLOSUM X, such as BLOSUM 62
Substitution Matrix (log-odds) Based on observed frequencies of substitutions in related proteins; identical amino acids are given high positive scores, frequently observed substitutions get lower positive scores, and seldom observed substitutions get negative scores.
Related Calculations • Relative entropy measures the average information in bits that can be distinguishes an alignment from chance • Expected score in bit units
Primary Sequence Search Methodology Hmmer search of aligned sequences: • Hmmer uses hidden markov models to make a profile probability matrix of amino acids from aligned sequences • The matrix is searched against the appropriate genome database TRVI search allowing for gaps and substitutions: • A motif is developed by allowing for a flexible number of gaps wherever there are gaps in the alignment • Substitutions of amino acids with similar properties are allowed • The motif is searched against the appropriate genome database MEME/MAST search of unaligned sequences: • Identifies a specified number of domains (probability matrices) across a subset of the input sequences • The domains are searched against the appropriate genome database
How Hmmer Works:Profile Hidden Markov Models for Protein Sequence Analysis • http://hmmer.wustl.edu/
Hmmer Architecture • Squares are match states (consensus positions), diamonds are insertions, circles are deletions and beginning/end. Arrows indicate state transitions.
Hidden Markov Model Background From PMMB—Sandrine Dudoit See also http://www.ai.mit.edu/~murphyk/Bayes/rabiner.pdf
Hmmer Intro • Each M/D/I is a node and are determined by data and the multiple sequence alignment • Each M state aligns with a single amino acid and carries a vector of 20 probabilities determined by the proportion of times that an amino acid has shown up in a position in a multiple sequence alignment • Capable of handling gapped alignments • At each node either the M (amino acid aligned) or D state is used, and I states occur between nodes and self-transition • Arrows are transition probabilities and are estimated by the residues in each column of the multiple sequence alignment • S,N,C,T,J are “special states” that are algorithm-dependent and controlled externally
Intermediate Hmmer • Want to calculate P(S|M) where the sum over the space of all sequence should be 1 • …The rules of the HMM allow us to do this • Implied that the insertions follow a geometric distribution • From a multiple sequence alignment “seed”, Hmmer make a consensus sequences and searches databases against this consensus sequence
Analysis of TRVI-Putative Cytoplasmic Proteins • Literature search • YciE not found • YciF classified as a putative structural protein by Blattner et al. • BLAST searches • STM0274 almost exactly SciI (S. typhimurium); other homologies to ImpC and ImpD (Rhizobium leguminosarum), and conserved hypotheticals—no literature on SciI, ImpC, nor ImpD • YciF has homologies to other putative structural proteins in Shigella and E.coli. Also homologous to several conserved hypotheticals • YciE has homologies to YciE from E.coli and other putative cytoplasmic/structural proteins in other species (YciE and YciF do not hit each other) • STM3767 homologous to a 4-hydroxy-2-oxoglutarate aldolase and several hypothetical proteins • STM4192 homologous to a nucleoprotein/polynucleotide-associated enzyme, hypothetical protein YaiL from E.coli, and hypotheticals (YaiL not in literature)
Analysis of TRVI-Microarray Proteins • SseJ and YciE show up • fruF is part of the phosphoenolpyruvate: fructose phosphotransferase system • STM1181 is a putative flagella basal body part
MEME MAST Analysis MEME search results using MAST and searched by domain: Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein) Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698) Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605) Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator) Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex) Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406 Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502 • Domains 1,3, and 5 look to be important for SPI2 secretion • The other domains are important for small, related subsets of proteins
S. typhimurium Search Results Summary Hmmer search of aligned sequences : Only the input sequences (+ 2 theoretically secreted proteins) were returned. SPI1 and SPI2 effectors both have significant e-values from a combined matrix. TRVI search allowing for gaps and substitutions: 56 hits returned—Possible interesting hits include SseI, 5 LysR family proteins, 5 putative cytoplasmic proteins , 1 putative periplasmic protein, 2 inner membrane proteins, and 3 flagellar proteins. 4 proteins (FruF, SseJ, YciE, and a putative flagellar protein) were also identified in a DNA microarray screen under SPI2 inducing conditions with cholesterol. MEME search results using MAST and searched by domain: Domain 1: SseI, SlrP, SopA (putative effector proteins), YebE Domain 2: SseI, SlrP, YeeY, YeaH (putative cytoplasmic protein) Domain 3: SseI, HepA/RapA, Putative inner membrane protein (STM1698) Domain 4: YfeC, Putative periplasmic proteins (STM3783 and STM3605) Domain 5: RffG, OmpR (regulatory protein), PrpA,SirC (invasion regulator) Domain 6: SseI, SlrP, YadF, YaiB, PrpC(protein phosphatase), InvB (part of needle complex) Domain 7: CitC (citrate carrier), YcfN, YjeQ, STM0611, STM2406 Domain 8: DdlA (d-alanine ligase), GlyS, PgtA (phosphoglycerate transporter), STM4502
Primary Structure Conclusions • The best lead may be YciE, a putative cytoplasmic protein found with two different search methods • The methods did not give the same output • Hypothetical proteins found in the literature such as SipD, SptP (SPI1) and SpiC, SrfJ, SseB,C,D (SPI2) were not found • All proteins that go out via SPI2 do not necessarily have the WEK(I/M)XXFF motif • There is not a clear SPI1 motif
Secondary Structure Prediction • Psipred structure prediction server used • Predictions made by two feed-forward neural networks based on PSI-BLAST output • N-terminal motif (MEME 3)—random coil in all SPI2 proteins • First SPI2 motif at aa 31-38 (MEME 1)—examples are SseJ, SifA, SifB(+F), SlrP(+F), SseI, SspH1(+F) • Second SPI2 motif at aa 105-120 (no MEME)—entirely random coil except for a small segment of SspH2
Alpha-helical Wheel (SifA,SifB) WEK(I/M)XXFF is the Conserved motif among SPI2 effectors from aa 34 -41 (positions 1,2,3,4,7). All show this profile but SseJ (position 7 is polar-- still a hydrophobic face).