1 / 18

Bioinformatics 01 Part 3: Pairwise Alignments and Database Searches

Bioinformatics 01 Part 3: Pairwise Alignments and Database Searches. Similarity and homology Gap penalties and scoring matrices in pairwise alignments Alignment algorithms Database searching: BLAST and FASTA. Similarity and Homology.

rhoda
Download Presentation

Bioinformatics 01 Part 3: Pairwise Alignments and Database Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics 01Part 3: Pairwise Alignments and Database Searches • Similarity and homology • Gap penalties and scoring matrices in pairwise alignments • Alignment algorithms • Database searching: BLAST and FASTA

  2. Similarity and Homology • If proteins that are similar share a common ancestor, they are said to be homologous • Homology can be inferred, but not confirmed, from similarity • Biological data can be used to support the case that two or more similar proteins arose from a common ancestor and are therefore homologous • Proteins can be similar but not homologous, but homologous proteins always show similarity

  3. Sequence 1 VLKAHLIDGGSKLTS ||||| ||| Sequence 2 VLKAHIDGGSRLTS ungapped alignment Score: 8 Identity: 53% Sequence 1 VLKAHLIDGGSKLTS ||||| ||||| ||| Sequence 2 VLKAH-IDGGSRLTS gapped alignment Score: 13 Identity: 86.7% Examples of Simple Pairwise Alignments

  4. Scoring Penalties in Pairwise Alignments • Penalties are imposed to prevent the unrestricted insertion of gaps • Gap penalty: a penalty for introducing a gap • Extension penalty: a penalty for extending a gap • In protein evolution, it is more likely that an existing gap would be extended than a new gap introduced • Consequently, the score for a gap penalty is greater than the score for an extension penalty

  5. Dot Matrix Analysis and Dot Plots • Compares two sequences in the form of a matrix, with each sequence lying along one axis • A match between residues is indicated by a dot • A sliding window is used to cut down “noise” and produce clearer results • Dot plot reveals diagonal lines where there is sufficient similarity between the sequences

  6. Dot Plot: Human - and -Globin

  7. Scoring Matrices in Pairwise Alignments • A scoring matrix takes into account the significance of matches and mismatches between aligned amino acids • In theory, a scoring matrix could be based on the different chemical and physical properties of amino acids • In practice, scoring matrices are based on observed differences between proteins (or parts of proteins)

  8. PAM Scoring Matrices • Based on the analysis of 1,572 changes in 71 groups of closely related proteins (>85% identity) • Mutation probabilities were determined for each amino acid based on a substitution rate of 1% • These were used to construct the PAM 1 (point [or percent] accepted mutation) matrix • The PAM 250 matrix (often used as a default in pairwise alignments) provides scores equivalent to about 20% matches remaining between two sequences

  9. BLOSUM Scoring Matrices • Based on amino acid substitutions in a large set of amino acid patterns called blocks, derived from several hundred groups of related proteins • BLOSUM matrices take distant but significant relationships between proteins into account, because only protein segments are considered • Over-representation of amino acid substitutions in closely related protein segments was reduced by combining those segments into one sequence • Example: proteins showing 62% or more identity were grouped to produce the BLOSUM62 matrix

  10. Alignments and Dynamic Programming • Complete search of all possible alignments is computationally demanding and frequently impossible • Algorithms that use dynamic programming have been developed to obtain alignments between sequences • Algorithms may produce either global or local alignments

  11. Global Alignment: Needleman-Wunsch • A matrix is constructed that shows matches between the two sequences • Moving from the top left of the matrix, a process of summation is carried out taking penalties into account • For any given cell in the matrix, the maximum score for that cell is entered • Needleman-Wunsch attempts to align all residues in the two sequences, and is therefore a global alignment algorithm

  12. Local Alignment: Smith-Waterman • Takes into account that two relatively dissimilar sequences may exhibit short regions of local similarity • Smith-Waterman uses a local alignment algorithm to detect these similarities • Each cell in the matrix is considered as the end point of a potential alignment • A value for each cell is calculated using a similarity score, taking matches, mismatches and gaps into account • A backtracking procedure from the highest scoring cell is then used to trace the alignment through the matrix

  13. Pairwise Database Searching • Use of the Needleman-Wunsch or Smith-Waterman algorithms in pairwise database searching requires enormous computational power • Heuristic approximations of these algorithms are therefore used in database searches • Examples of search tools are BLAST and FASTA • Both BLAST and FASTA aim to identify short identical matches, which are then extended to produce local alignments

  14. BLAST • Search is made for regions of short length (words or k-tuples) obtained from the query sequence that match a database sequence = high scoring pairs (HSPs) • HSPs are extended in both directions to produce optimal alignments above a certain score • A scoring matrix (default is BLOSUM62), gap and gap extension penalties are taken into account in determining alignments • Optimal alignments are then reported in order of decreasing score

  15. http://www.ncbi.nlm.nih.gov/BLAST/

  16. FASTA • Regions of short length (words) in the query that match a target sequence are determined • High scoring regions (best initial regions) are used to rank matches for further analysis • Longer high scoring regions, including gaps, are generated by joining best initial regions • A full Smith-Waterman alignment is then performed between the high scoring regions • FASTA is slower than BLAST but may, in some cases, be more sensitive

  17. http://www.ebi.ac.uk/fasta33/

  18. A Final Few Words of Advice • Protein-protein searches are more informative than nucleotide-nucleotide searches (when the query is known to contain a protein-coding nucleotide sequence) • When performing a pairwise database search with a new, protein-coding nucleotide sequence, always use a translation of the nucleotide sequence in all six frames as the query • This can be done by using, for example, a translated BLAST search (such as tblastx, which translates both the query sequence and a nucleotide database)

More Related