140 likes | 254 Views
BIC I, Week 6 lectures. Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu , arh@it.rit.edu. Multiple Alignments. The computational aspects Textbook Chapter 14
E N D
BIC I, Week 6 lectures Rhys Price Jones and Anne Haake Rochester Institute of Technology rpjavp@rit.edu, arh@it.rit.edu
Multiple Alignments • The computational aspects • Textbook Chapter 14 • An excellent gentle online tutorialhttp://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/mulali.html
Similarity to Pairwise Alignment • pairwise alignment could be achieved via a dynamic programming technique to fill a 2-dimensional matrix • An alignment of three sequences can be achieved by applying a dynamic programming technique and filling a 3-dimensional matrix • A Java Visualization toolhttp://bibiserv.techfak.uni-bielefeld.de/visualign/ • Aligning k sequences would require filling a k-dimensional matrix.
How do you score a multiple alignment? • ACTG ACTGAGG A-GGCCTG CCTG • Pairwise, say match 2, mismatch –1, gap -2 • column 1: AA AC AC, 2 + -1 + -1 = 0 • column 2: C- -C CC, -2 + -2 + 2 = -2 • column 3: TG GT TT, -1 + -1 + 2 = 0 • column 4: GG GG GG, 2 + 2 + 2 = 6 • overall score is then 4. • New feature: score two gaps!
How do you score a multiple alignment? • Scoring Along a Tree is a way to calculate the score of a multiple alignment, where only the score of the alignments of sequences that are neighbors in an (evolutionary) tree are summed up for the calculation of the overall score.
How do you score a multiple alignment? • Minimum Entropy • Basic idea is that the fewer bits necessary to specify a column, the better the score
How do you score a multiple alignment? • In summary • There are many scoring functions used to evaluate combinations of residues either in single edit operations, or whole pairwise comparisons, or whole multiple columns. • Be sure you know what you’re doing. • We’ll look soon at options for CLUSTAL-W
The Dynamic Programming Approach • Analysis: k sequences of length n require filling a nk element multidimensional array. • To compare 1000 nucleotide putative genes in 12 species, the array would have 100012 entries. • meg gig tera peta exa zetta yotta... • http://www.sdsc.edu/GatherScatter/gsq394/gsq3_f1.html
Heuristic • Try to find a smallish area within that huge multidimensional array • Carrillo Lipman bound is discussed at page 19 of http://www.techfak.uni-bielefeld.de/bcd/Curric/MulAli/mulali.html • But even so...
Practical Algorithms • like ClustalW or t-coffee • compute all pairwise alignment scores • from those create a guide tree • successively align pairs of sequences and already-computed alignments until one large multiple alignment remains
Pro: • fast algorithms even with many and long sequences • do good alignments of subfamily motifs
Con: • early errors persist – they don’t go away • a possibly erroneous guide tree retains its perfidious influence in the alignment and may prejudice use of the alignment for phylogeny studies • it’s hard to know how good or bad the alignment is
We will return • to these issues
In the meantime • Let’s look at the help for ClustalX • http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html • In particular we’ll look at the options and why they exist