1 / 21

Multiple alignment: Feng-Doolittle algorithm

Multiple alignment: Feng-Doolittle algorithm. Why multiple alignments?. Alignment of more than two sequences Usually gives better information about conserved regions and function (more data) Better estimate of significance when using a sequence of unknown function

gema
Download Presentation

Multiple alignment: Feng-Doolittle algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple alignment: Feng-Doolittle algorithm

  2. Why multiple alignments? • Alignment of more than two sequences • Usually gives better information about conserved regions and function (more data) • Better estimate of significance when using a sequence of unknown function • Must use multiple alignments when establishing phylogenetic relationships

  3. Dynamic programming extended to many dimensions? • No – uses up too much computer time and space • E.g. 200 amino acids in a pairwise alignment – must evaluate 4 x 104 matrix elements • If 3 sequences, 8 x 106 matrix elements • If 6 sequences, 6.4 x 1013 matrix elements

  4. Need to find more efficient method • Sacrifice certainty of optimum alignment for certainty of good alignment but faster

  5. Feng-doolittle algorithm • Does all pairwise alignments and scores them • Converts pairwise scores to “distances” • D = -logSeff = -log [(Sobs –Srand)/(Smax –Srand)] • Sobs = pairwise alignment score • Srand = exoected score for random alignment • Smax = average of self-alignments of the two sequences

  6. As Smax approaches Srand (increasing evolutionary distance), Seff goes down; to make the distance measure positive, use the -log

  7. Once the distances have been calculated, construct a guide tree (more in the phylogeny class) – tells what order to group the sequences • Sequences can be aligned with sequences or groups; groups can be aligned with groups

  8. Sequence-sequence alignments: dynamic programming • Sequence-group alignments: all possible pairwise alignments between sequence and group are tried, highest scoring pair is how it gets aligned to group • Group-group alignments: all possible pairwise alignments of sequences between groups are tried; highest scoring pair is how groups get aligned

  9. Example Seq5 Seq3 Seq4 Seq1 Seq2 Alignment 2 Alignment 1 Alignment 3 Final alignment

  10. Notice that this method does not guarantee the optimum alignment; just a good one. Gaps are preserved from alignment to alignment: “once a gap, always a gap”

  11. In-class exercise • Retrieve sequences from multalign.apr into BioScout • Run Gap in BioScout on all combinations of the sequences in multalign.apr; use a gap penalty of 6 and an extension penalty of 2 • Record alignment scores of each pairwise comparison • Save pairwise alignments

  12. In class exercise, cont • use raw alignment scores as distance measures; make a guide tree based on these scores • In Vector NTI, select all sequences in multalign.apr (in the sequence pane); choose Alignment from the toolbar at the top; choose Alignment Setup from the pulldown; choose multiple alignment; take the defaults, choose ok; choose Alignment again, this time choose Align Selected Sequences from the pulldown

  13. In class exercise, cont. • Note that ClustalW does some other things that the Pileup program discussed on the tape does not; we are going to ignore those things for the moment • Compare ClustalW’s guide tree (visible in the Phylogenetic Tree Pane – tab at bottom of window) with yours

  14. In class exercise, cont • Carefully examine ClustalW’s alignment; compare it to the individual pairwise alignments you saved. Are there differences?

  15. Start refining alignment: • Use structural info if you have it • Find patterns if you don’t • Use amino acid structure handout from beginning of class for substitution decisions!

  16. ClustalW • Most widely used multiple alignment method • Similar strategy to the Feng-Doolittle approach implemented as Pileup, but more complex and gives generally superior results • Ad hoc nature of the program can be mysterious

  17. Advantageous differences • Gap penalties vary locally: • By observed frequency (in database) after each residue • By simple structure prediction – lower gap penalties in probable loop regions • By proximity to existing gaps – higher gap penalties when within 8 residues of an existing gap

  18. Advantages, cont. • Change in substitution matrix choice depending on distance computed for guide tree • Substitution matrix families • Profile construction (more later) • Weighting of sequences in profiles depending on evolutionary distance computed for guide tree • More similar sequences get less weight than less similar sequences

  19. In class exercise II • Change a few parameters in the ClustalW program (gap, gap extension, substitution matrix, etc.) one at a time: this is done in Alignment Setup. After each run with a different change, save the alignment project with some descriptive name that you can remember (e.g., gap20 or blosum) • Compare alignment results with different parameters changed

  20. MultAlin • MultAlin is also a heuristic algorithm that builds up a multiple alignment from a group of pairwise alignments • It differs from Pileup and Clustal in that the guide tree is recalculated based on the results of each alignment step • Because this leads to cycles of tree building and alignmnent, MultAlin can take a long time to run. It stops after the overall alignment score stops improving

  21. Scoring a multiple sequence alignment • Assumptions: • Sequences (rows) independent • Positions (columns) independent • Neither assumption is true … • Score of a column is the (possibly weighted) sum of all the pairwise comparisons (I.e., substitution matrix values) within that column • Score of a multiple alignment is the sum of scores for all columns

More Related