Multiple alignment: Feng-Doolittle algorithm

Multiple alignment: Feng-Doolittle algorithm

Why multiple alignments? • Alignment of more than two sequences • Usually gives better information about conserved regions and function (more data) • Better estimate of significance when using a sequence of unknown function • Must use multiple alignments when establishing phylogenetic relationships

Dynamic programming extended to many dimensions? • No – uses up too much computer time and space • E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 104 matrix elements • If 3 sequences, 8 x 106 matrix elements • If 6 sequences, 6.4 x 1013 matrix elements

Need to find more efficient method • Sacrifice certainty of optimum alignment for certainty of good alignment but faster

Feng-doolittle algorithm • Does all pairwise alignments and scores them • Converts pairwise scores to “distances” • D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)] • Sobs = pairwise alignment score • Srand = exoected score for random alignment • Smax = average of self-alignments of the two sequences

As Smax approaches Srand (increasing evolutionary distance), Seff goes down; to make the distance measure positive, use the -log

Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences • Sequences can be aligned with sequences or groups; groups can be aligned with groups

Sequence-sequence alignments: dynamic programming • Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group • Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

Example Seq5 Seq3 Seq4 Seq1 Seq2 Alignment 2 Alignment 1 Alignment 3 Final alignment

Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap”

In-class exercise • Retrieve sequences from multalign.apr into BioScout • Run Gap in BioScout on all combinations of the sequences in multalign.apr; use a gap penalty of 6 and an extension penalty of 2 • Record alignment scores of each pairwise comparison • Save pairwise alignments

In class exercise, cont • use raw alignment scores as distance measures; make a guide tree based on these scores • In Vector NTI, select all sequences in multalign.apr (in the sequence pane); choose Alignment from the toolbar at the top; choose Alignment Setup from the pulldown; choose multiple alignment; take the defaults, choose ok; choose Alignment again, this time choose Align Selected Sequences from the pulldown

In class exercise, cont. • Note that ClustalW does some other things that the Pileup program discussed on the tape does not; we are going to ignore those things for the moment • Compare ClustalW’s guide tree (visible in the Phylogenetic Tree Pane – tab at bottom of window) with yours

In class exercise, cont • Carefully examine ClustalW’s alignment; compare it to the individual pairwise alignments you saved. Are there differences?

Start refining alignment: • Use structural info if you have it • Find patterns if you don’t • Use amino acid structure handout from beginning of class for substitution decisions!

ClustalW • Most widely used multiple alignment method • Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results • Ad hoc nature of the program can be mysterious

Advantageous differences • Gap penalties vary locally: • By observed frequency (in database) after each residue • By simple structure prediction – lower gap penalties in probable loop regions • By proximity to existing gaps – higher gap penalties when within 8 residues of an existing gap

Advantages, cont. • Change in substitution matrix choice depending on distance computed for guide tree • Substitution matrix families • Profile construction (more later) • Weighting of sequences in profiles depending on evolutionary distance computed for guide tree • More similar sequences get less weight than less similar sequences

In class exercise II • Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum) • Compare alignment results with different parameters changed

MultAlin • MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments • It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step • Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving

Scoring a multiple sequence alignment • Assumptions: • Sequences (rows) independent • Positions (columns) independent • Neither assumption is true … • Score of a column is the (possibly weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column • Score of a multiple alignment is the sum of scores for all columns

Multiple alignment: Feng-Doolittle algorithm