200 likes | 332 Views
Multiple sequence comparison (MSC). Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14. Why care about similarity?. Similar sequences have similar structure. Similar structure -> similar sequence?. No, the converse is not true!
E N D
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14
Why care about similarity? • Similar sequences have similar structure
Similar structure -> similar sequence? • No, the converse is not true! • Convergent evolution. Outwardly similar solutions to similar problems may be internally different. • Tiger and ‘Tasmanian tiger’. Fish and dolphin. Bat and bird. • Same is true of molecular ‘species’ and ‘anatomies’!
Sequence --> function • Similar sequences have similar function • ‘[T]he same genes that work in flies are the ones that work in humans.’ -- Eric Wieshaus 1995 Nobel for drosophila work
Common origins • Similar sequences have common origins • ‘Descent with modification’ is Nature’s design mechanism • Strong similarity may imply recent common origin (what do we mean by ‘strong’ and ‘recent’?) • Strong similarity may imply strong conservation of sequence or motif
Is multiple sequence comparison a generalization? • From cs point of view, we’re going from two strings to many strings, a generalization • Yes, in that it helps detect faint similarities • No, in that we go from known biological similarity to suspected sequence similarity
‘Big’ uses for MSC • Represent protein families • Identify conserved sequence features • Deduce evolutionary history
Profile representation • Definition Given a multiple alignment of a set of strings, a profile specifies for each column the frequency of each character
Profile example Alignment a b c - a a b a b a a c c b - c b - b c Profile C1 C2 C3 C4 C5 a .75 .25 .50 b .75 .75 c .25 .25 .50 .25 d .25 .25 .25
Fit string S to profile P • Given a profile P and a string S, what is the best alignment (fit) of S to P? • Example: S: A a b - b c P: 1 - 2 3 4 5
Two key issues • How to score an alignment of a string to a profile • How to compute an optimal alignment, given a scoring system
Scoring and alignment of profile • Scoring Assuming letter-to-letter scores are given, use the weighted sum for each column • Optimal alignment By DP, similar to S-S optimal alignment • Q: How would you do profile-to-profile scoring and alignment?
Signature (motif) representation • A motif is a regular expression (re) • Example: a helicase motif[&H][&AD[DE]xn[TSN][x4][QK]Gx7[&A], where • [abc] = any of a,b,c • & = [ILVMFYW] • x = any amino • a3 = up to 3 a’s • an = any number of a’s • Find a motif by grep-ing
Finding optimal MS alignment • Need a scoring system • Given a scoring system, an (efficient) method of calculation • If no efficient method of getting the right answer, an efficient way of getting a plausible answer
Need MSC measure • Desirable characteristics: • variable number of sequences • column-wise calculation • order independence MQPILLL MLR-LL- MK-ILLL MPPVLIL
Sum-of-pairs (SP) measure • Column score = sum pairwise scores • k Choose 2 pairs • Reduces to pairwise alignment when k = 2 • Need to assign (-,-) value • May compute in either row or column order
DP approach • Generalization of two-sequence comparison • k-dimensional array • space complexity is O(nk) • MSC with SP measure is NP-complete
MSA speedup heuristic • This ‘heuristic’ guarantees the right answer! • But .. it doesn’t guarantee the speedup • General idea: • find a lower bound on L • if value for a cell exceeds L, it cannot enter into opt solution
Commonly method -- iterative • Simplest implementation • Begin with Si and Sj which are pairwise closest • Iteratively merge in additional string with smallest edit distance from any in multiple alignment • Equivalent to finding MSP on edit tree
Clustering method • Almost any clustering algorithm can be adapted to MSC • Usually start with small clusters and build big ones • Also possible start with big cluster, and divide-and-conquer • Not clear which method is best