590 likes | 782 Views
GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076 MURNIYANTI BINTI MALIK AC100078 NG SHEE TING AC100079 SCHEE XIN LIN AC100086 AW MEI YEE AC100062. INTRODUCTION. @Ng Shee Ting. INTRODUCTION(cont..).
E N D
GROUP MEMBERS: MUHAMMAD KHAIRULANWAR IZZAT BIN HUSSIN AC100076 MURNIYANTI BINTI MALIK AC100078 NG SHEE TING AC100079 SCHEE XIN LIN AC100086 AW MEI YEE AC100062
INTRODUCTION @Ng Shee Ting
INTRODUCTION(cont..) @Ng Shee Ting
INTRODUCTION(cont..) @Ng Shee Ting
PURPOSES @Ng Shee Ting
WHAT IS PSSM?? @Ng Shee Ting
PSSM CONT.. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. *Note: A profile is a table of observed frequencies of amino acids (or nucleotides) at each position in a multiple alignment. @Ng Shee Ting
PSSM CONT.. • PSI-BLAST PSSM is derived from local alignments • Only positions present in the query sequence are used • If the query has L positions(length), PSSM will also have L positions and generate a 20 X L matrix. @Ng Shee Ting
Basic Concept for calculation(this example counting for nucleotide) @Izzat
CALCULATION cont… Row Column (Positions) @Izzat
CALCULATION cont… Refer back to Table A Shading indicates fraction of occurances for that base at that position: red (1.0), orange (0.8), yellow (0.6). @Izzat
CALCULATION cont… cThe background frequencies used to calculate the scores are A = T = 0.32; C = G = 0.18. Table 1D was calculated with the default scoring system used by the Gibbs Sampler @Izzat
CALCULATION cont… • In the example shown in Table 1D, the score for an adenine in position one is calculated: • Score(position 1, A) = [3+ √5 (0.32)] / [5 + √5] = 0.51 @Izzat
CALCULATION cont… • Score(position 1, A) = [3+ 0.1(0.32)] / [5 + 0.1] = 0.59 cThe background frequencies used to calculate the scores are A = T = 0.32; C = G = 0.18. Table 1E used the default scoring system of Meme. @Izzat
CALCULATION cont… dEach element of the table is equal to the negative log10 of the corresponding element of Table 1E. (*-log) @Izzat
EXAMPLE 20X L matrix Position 1 Position 15 L positions Y appear twice in this position @Izzat
PSSM CALCULATION Column 1: frequency (A, 1) = 0 / 5 = 0, frequency (G, 1) = 5 / 5 = 1, ... Column 2: frequency (A, 2) = 0 / 5 = 0; frequency (H,2) = 5 / 5 = 1, ... ... Column 15: frequency (A, 15) = 2 / 5 = 0.4, frequency (C, 15) = 1 / 5 = 0.2; ... Some frequencies are equal to 0 because of the number sequence in the multiple alignment . Such a frequency could lead to " exclusion "of the amino acid involved in this position. @Izzat
CONT.. One way around this by adding a "small value" at all frequencies observed. This low " frequency non-observed "is called a" pseudo-count .” In the previous example with a " pseudo-count "of 1: Column 1: f '(A, 1) = (0 +1) / (5 +20) = 0.04, f' (G, 1) = (5 +1) / (5 +20) = 0.24 ; ... Column 2: f '(A, 2) = (0 +1) / (5 +20) = 0.04, f' (H,2) = (5 +1) / (5 +20) = 0.24 ; ... ... Column 15: f '(A, 15) = (2 +1) / (5 +20) = 0.12, f' (C, 15) = (1 +1) / (5 +20) = 0.08 ; ... @Izzat
PSSM CONT.. The frequency of each amino acid determined at each position is compared to the frequency with which each amino acid is expected in a random sequence . It is assumed that each amino acid is observed with the same frequency in a random sequence. Score ij = log (f 'ij / q i ) where: -Score ij is the score for the residue i at position j -f 'ij is the relative frequency for residue i at position j, corrected by the " pseudo-count " -q i is the relative frequency expected for the residue i in a random sequence @Izzat
PSSM full calculated Score ij @Izzat
Exercise: Since the fully calculated score and f’ are given from the diagram given above. You can calculate the q I [using formula:Score ij = log (f 'ij / q i )] @Izzat
Solution • You can reverse the formula whereby q i = f 'ij/10^ Score ij -Any value with -0.2 in the table, q i =0.0634 -Any value with 2.3 in the table, q i =1.203*10^(-3) -Any value with 0.7 in the table, q i =0.015 -Any value with 1.3 in the table, q i = 6.014*10^(-3) @Izzat
Why PSSM? This PSSM is used to further search the database for new matches, and is updated for subsequent iterations with these newly detected sequences. is a matrix used for biological data, and its main role in PSI-BLAST search is to increase the sensitivity of results. The profile is used to perform a second BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. @Izzat
E Value? • an abbreviated term for “Expected Value” or “Expectation Value”. a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. E value works for the longest row ofmatches in an alignment of length L. @Schee Xin Lin
E Value cont It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Shorter sequences have a high probability of occurring in the database purely by chance. @Schee Xin Lin
E Value cont For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. @Schee Xin Lin
EQUATION E = Kmn e – λS • This is the equation for calculating the e value. • m :the length of the query sequence • n : the database sequence • S: score • The parameters, K and λ are constants representing the scoring system. @Schee Xin Lin
Example of calculation Constants • λ=0.219 • K=0.082 • s=103 • m=100 • n = 2X10^8 • λ s=0.219x103=22.6 • e- λ s = 1.6x10^-10 • Kmne- λ s = 0 .082x100X2X10^8x1.6x10^-10 = 0.2624 @Schee Xin Lin
In a typical current database search, a protein of length 250 might be compared to a protein database of 50 000 000 total residues. @Schee Xin Lin
Doubling the length of either sequence will double the number of HSPs. • Doubling the score S will exponentially reduce the expected number of HSPs.(The higher the score, the lower the expected number of HSPs) • Thus, we anticipate E is proportional to mn. Also, E is proportional to e – λS. @Schee Xin Lin
Relationship between E and mn Relationship between E and e – λS E E mn e – λS @Schee Xin Lin
HOW PSI BLAST WORKS? @Aw Mei Yee
PSI BLAST FLOW CHART 1 2 3 4 @Aw Mei Yee
PRINCIPLES 1. A standard BLAST search is performed against a database using a substitution matrix (e.g. BLOSUM62). PSI-BLAST principle: 2. A PSSM is constructed automatically from a multiple alignment of the highest scoring hits of the initial BLAST search. High conserved positions receive high scores and weakly conserved positions receive low scores. @Aw Mei Yee
PRINCIPLES cont.. 3. The PSSM replaces the initial matrix (e.g. BLOSUM62) to perform a second BLAST search. 4. Steps 3 and 4 can be repeated and the new found sequences included to build a new PSSM. 5. We say that the PSI-BLAST has converged if no new sequences are included in the last cycle. @Aw Mei Yee
@Aw Mei Yee Sequence in FASTA format Example of FASTA format: >gi|18892811|gb|AAL80910.1| transposase [Pyrococcus furiosus DSM 3638] MVVLSFQRKILIIKSEIYPIVSKHYPKNTRREVISLYDLITFAILAHLHFNGVYKHAYRVLIEEMKLFPK IRYNKLTERLNRHEKLLLLAQEELFKKHAREYVRILDSKPIQTKELARKNRKDKEGSSEVISEKPAVGFV PSKKKFYYGYKLTCYSDGNLLALLSVDPANKHDVSVVREKFWVIVEEFSGCFLFLDKGYVSRGLEEEFLR FGVVYTPVKRGNQISNLEEKKFYKYLSDFRRRIETLFSKFSEFLLRPSRSVSLRGLAVRILGAILAVNLD RLYNFTGGGN
Peptide Sequence Databases Try Choose refseq @Aw Mei Yee
Choose PSI BLAST @Aw Mei Yee
PSI BLAST USES TWO E-VALUE: • the threshold E-value for the initial BLAST. • the inclusion E-value to accept sequences in the PSSM construction (default is 0.005). @Aw Mei Yee
Try Set to 0.0001 Can change threshold (cut off)according to desired for the next iteration Lastly, click Blast to Start the search @Aw Mei Yee
OUTPUT @Aw Mei Yee
FIRST ITERATION Click Go for 2nd iteration @Aw Mei Yee
SECOND ITERATION Click Go for 3rd iteration @Aw Mei Yee
THIRD ITERATION Click Go for 4th iteration @Aw Mei Yee
FORTH ITERATION @Aw Mei Yee
After the second iteration, PSIBLAST E value are not directly comparable to those calculated by BLAST. • This is because that BLAST scores the target sequence against each database sequence using a matrix (PSSM) contain fix value for each amino acid pair. @Aw Mei Yee
Sequence derived from previous iteration Newly searched sequence which homolog with new iteration @Aw Mei Yee
SAMPLE ALIGNMENT (HIT TABLE) identical matches are marked by "+" symbol in a line between the query and the database sequence. Gaps are introduced with a "-" symbol The hit sequence is presented in the Sbjct: line, and the query sequence in the Query: line. @Aw Mei Yee