140 likes | 528 Views
Biological Sequence Comparison / Database Homology Searching. Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway, Ireland. Database Homology Searching.
E N D
Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway, Ireland
Database Homology Searching • Use algorithms to increase efficiency and to provide a mathematical basis for searches which can be translated into statistical significance • Assumes that sequence, structure and function are inter-related • BLAST (Basic Local Alignment Search Tool) and FastA (Fast Alignment) • heuristic approximations of Needleman-Wunsch and Smith-Waterman algorithms • reduce computation
Needleman-Wunsch Algorithm • General algorithm for sequence comparison • Maximise a similarity score, to give ‘maximum match’ • Maximum match = largest number of residues of one sequence that can be matched with another allowing for all possible deletions • Finds the best GLOBAL alignment of any two sequences • N-W involves an iterative matrix method of calculation • All possible pairs of residues (bases or amino acids) - one from each sequence - are represented in a 2-dimensional array • All possible alignments (comparisons) are represented by pathways through this array
Needleman-Wunsch Algorithm (cont.) • Three main steps 1. Assign similarity values 2. For each cell, look at all possible pathways back to the beginning of the sequence (allowing insertions and deletions) and give that cell the value of the maximum scoring pathway 3. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment
Similarity values A numerical value is assigned to every cell in the array depending on the similarity/dissimilarity of the two residues These may be simple scores or more complicated, e.g. related to chemical similarities or frequency of observed substitutions The example shown has match = +1 mismatch = 0 Needleman-Wunsch Algorithm (cont.)
Score pathways through array For each cell want to know the maximum possible score for an alignment ending at that point Searches subrow and subcolumn, as shown, for the highest score Adds this to the score for the current cell Proceeds row by row through the array Gap penalty for the introduction of gaps in the alignment (presumed insertions or deletions into one sequence) … here = 0 Needleman-Wunsch Algorithm (cont.) Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j-1 -Wk +s(ai,bj)}, max{Hi-1, j-l -Wl +s(ai,bj)}}
Construct alignment The alignment score is cumulative by adding along a path through the array The best alignment has the highest score i.e. the maximum match Maximum match = largest number resulting from summing the cell values of every pathway The maximum match will ALWAYS be somewhere in the outer row or column shown The alignment is constructed by working backwards from the maximum match Needleman-Wunsch Algorithm (cont.) MP-RCLCQR-JNCBA | || | | | | | -PBRCKC-RNJ-CJA
Needleman-Wunsch Algorithm (cont.) Statistical Significance • Maximum match is a function of sequence relationship and composition • Would like to know probability of obtaining result (maximum match) from a pair of random sequences • Estimate this experimentally • form pairs of random sequences by randomly drawing one member from each set (I.e. have same composition as the real proteins) • if the value found for the real proteins is significantly different from that for the random proteins then the difference is a function of the sequences alone and not of their composition
Smith-Waterman Algorithm • Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximise the similarity measure • For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and can contain insertions and deletions
Only works effectively when gap penalties are used In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account Smith-Waterman Algorithm (cont.) Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}
Smith-Waterman Algorithm (cont.) • Four possible ways of forming a path For every residue in the query sequence 1. Align with next residue of db sequence … score is previous score plus similarity score for the two residues 2. Deletion (i.e. match residue of query with a gap) … score is previous score minus gap penalty dependent on size of gap 3. Insertion (i.e. match residue of db sequence with a gap) … score is previous score minus gap penalty dependent on size of gap 4. Stop … score is zero • Choose whichever of these is the highest
Construct Alignment The score in each cell is the maximum possible score for an alignment of ANY LENGTH ending at those coordinates Trace pathway back from highest scoring cell This cell can be anywhere in the array Align highest scoring segment Smith-Waterman Algorithm (cont.) GCC-UCG GCCAUUG
Needleman-Wunsch 1. Global alignments 2. Requires alignment score for a pair of residues to be >=0 3. No gap penalty required 4. Score cannot decrease between two cells of a pathway Smith-Waterman 1. Local alignments 2. Residue alignment score may be positive or negative 3. Requires a gap penalty to work effectively 4. Score can increase, decrease or stay level between two cells of a pathway Differences