E N D
Heuristic approaches & scoring matrices M.Prasad Naidu MSc Medical Biochemistry, Ph.D,.
Introduction • Two algorithms are there in these methods • BLAST • FASTA • FastA is an algorithm developed by Pearson and Lipman. Its more sensitive than Blast. • Blast is an algorithm developed by Altschul et al., in 1990. It provides tools for high scoring local alignment between two sequences. Now a days, a gapped versions are available.
BLASTP algorithm • Blast Algorithm involves the following steps. • Breaking of the sequence into defined word size. • Finding a match or HSP (High Scoring Pair). • Alignment of the word and extending the alignment.
Breaking of the sequence into defined word size Query : AILDTGATGDA Word size : 4 AILDTGATGDA AILD ILDT LDTG DTGA TGAT GATG ATGD TGDA
Finding a High scoring Pair MQVWGWAILDTVATDAAMLL AILD
Extending the alignment MQVWGWAILDTVATDAAMLL ……………..AILDTGATGDA…… Parameters in BLAST result Percentage of Homology Scoring of the alignment No of residues aligned E-value
FastA algorithm • The word size in FastA algorithm is defined as K-tuple. • Generally the K-tuple for the algorithm is either 3 or 4 for nucleotide sequences and 1 or 2 for protein sequences. • FastA algorithm also involves the steps similar to that of the BLAST tool. But the alignment generation procedure is different.
Breaking of the sequence into defined k-tuple F A M L G F I K Y L P G C M 1 2 3 4 5 6 7 8 9 10 11 12 13 14
The most occuring number in the algorithm is 3, so the alignment starts after leaving three characters or residues
Alignment of the sequences F A M L G F I K Y L P G C M T G F I K Y L P G A C T Parameters in FASTA result Percentage of Homology Scoring of the alignment No of residues aligned P-Score
Scoring schemes Identity scoring matrix • Residue to residue scores are represented here in the form of similarity. • A 4 X 4 matrix is built for the nucleotides and 20 X 20 matrix for the amino acids. • For match score is +1 and mismatch is -1
PAM Matrices • These were first developed by Margaret Dayhoff and co-workers in 1978. • This model assumes that evolutionary changes follow the markov model i.e. residual changes occur independent on the previous mutation. One PAM is a unit of evolutionary divergence in which there is 1% amino acid change but it doesn’t imply that 100 PAM results in different aminoacids. • Dayhoff and coworkers have calculated the frequencies of accepted mutations for 1PAM by analyzing closely related families of sequences. • The scores are represented as log odd ratios. • The 1PAM can be extended to any no of PAMS. For example, 1PAM table is extended to N X 1PAM. • For closely related protein sequences, lower distance PAM is used and higher PAM is used for variying proteins. • PAM 30 is used for closer proteins and PAM 250 for divergent ones.
BLOSUM Matrices • These matrices are developed by Heinkoff and Heinkoff in 1991. • The matrices have been constructed in a similar fashion as PAM matrices. • The data was derived for local alignment of distantly related proteins deposited in the BLOCKS database. • BLOSUM 30 is used for comparing highly divergent sequences and BLOSUM 90 is used for closely related proteins. • Commonly used BLOSUM matrix is BLOSUM 62 that is used for proteins with 62% identities.