400 likes | 534 Views
Bioinformatics Algorithms and Data Structures. Chapter 14.1-5: Multiple String Comparisons Lecturer: Dr. Rose Slides by: Dr. Rose March 1, 2007. Multiple String Comparisons. Q: Why are we interesting in multiple string comparisons? A: At one level we are data-mining.
E N D
Bioinformatics Algorithms and Data Structures Chapter 14.1-5: Multiple String Comparisons Lecturer: Dr. Rose Slides by: Dr. Rose March 1, 2007
Multiple String Comparisons Q: Why are we interesting in multiple string comparisons? A: At one level we are data-mining. • Looking for similarities • Common evolution • Common functionality • Significance of similarity may not be clear with only two strings. Multiple string comparison is accomplished by multiple alignment.
Multiple String Comparisons Defn.Global multiple alignment of k > 2 strings is: • Generalization of alignment of 2 strings • Strings S1,S2,…,Skare inflated with spaces to achieve strings S’1,S’2,…,S’k with uniform length l. • Strings are arrayed in k rows of l columns.
Example AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.C.TAACCCG ACTA...TAAC...
Multiple String Comparisons Consider the relation between two-string comparison and biological function: • two-string alignments are used to find unsuspected biological relationship from apparent string similarity. • This follows from the first fact of biological sequence comparison: sequence similarity implies functional or structural similarity.
Multiple String Comparisons Consider the relation between multiple string comparison and biological function: • Multiple string alignments are used to find unknown string similarities from known biological relationships. • This isn’t as obvious since there is the tendency to focus on one-dimensional sequences and not the corresponding three-dimensional structures or two-dimensional substructures.
Multiple String Comparisons This follows from the second fact of biological sequences: Strings that are functionally related can appear very different and yet preserve the same important three-dimensional and two-dimensional features. There are several levels of abstraction entailed: • Three-dimensional structure • Functionality • Amino-acid sequence
Multiple String Comparisons These different levels of abstraction are preserved/conserved to different degrees: • Three-dimensional structure is most preserved • Functionality is somewhat conserved • Amino-acid sequence less likely to be conserved Q: What point are we trying to make? A: The significance is that similarity of structure may not be blatantly apparent at the sequence level. Comparison of multiple sequences highlights less apparent similarity.
Multiple String Comparisons Example from text: Hemoglobin • 4 chains of ~140 amino acids a piece • Found in insects to mammals • Insects and invertebrates diverged ~600 million BP • large number of amino acid mutations (~100) per chain in the two sequences (insect & invertebrate)
Multiple String Comparisons Comparison of two mammalian hemoglobin sequences: • Exhibit high amino-acid similarity (Our cousin the chimpanzee shares the identical sequence) • Suggest similar functionality Comparison of mammalian and insect hemoglobin sequences: • Exhibits little amino-acid similarity • However, has similar functionality
Multiple String Comparisons The important point is that while: sequence similarity functional & structural similarity The converse: functional & structural similarity sequence similarity is not true, i.e., functional & structural similarity sequence similarity
Family & Superfamily Representation Data Mining Problem: • Given a set of biologically similar strings find the commonalities that characterize the family. Why would we want to do this? • Conserved features may explain function & structure. • Characterization of the family may make it easy to recognize new members. • Characterization may also make it easier to exclude nonmembers.
Family & Superfamily Representation Example: protein families • The similarity may be functionality or • Two- or three-dimensional structure Specific Examples: • globins (hemoglobins, myoglobins) • immunoglobulin (antibody) proteins
Family & Superfamily Representation Q: Why would we be interested in identifying the family to which a protein belongs? A: Family membership immediately clues us in on: • Physical structure • Biological functionality Text suggests there are ~100,000 proteins in humans but only ~1000 or fewer protein families
Family & Superfamily Representation Q: If we suspect that a new protein belongs to some family how do we check? • Align the new protein sequence with a representative member of the family? • Align the new protein sequence with several representative members of the family? • Align the new protein sequence with a generalization of members of the family? A: Align the new protein sequence with a generalization of members of the family.
Family & Superfamily Representation Q: What is the representation of the generalization of members of the family? Consider: • We want to match family members while • Excluding non-family members This is an established area in machine learning. In general, the key is that the representation language must be sufficiently expressive to distinguish between + & - examples. Conjecture: amino acid strings lack sufficient expressiveness
Family & Superfamily Representation Three common currently used representations: • Profile (based on multiple alignment) • Consensus sequence (based on multiple alignment) • Signature (some based on multiple alignment, some not)
Profile Representation Defn. a profile (aka weight matrix)for a multiple alignment specifies the frequency of each character in each column. Consider the following multiple alignment: a b c – a a b a b a a c c b – c b – b c The corresponding extracted profile C1 C2 C3 C4 C5 a .75 .25 .50 b .75 .75 c .25 .25 .50 .25 - .25 .25 .25
Profile Representation log-odds ratios: profile entries are sometimes expressed in this form. Let p(y, j) denote the frequency of the occurrence of character y in column j. Let p(y) denote the frequency of the occurrence of character y anywhere in multiply aligned sequences. logp(y, j)/p(y) is the log-odds ratio for cell (y, j) of the profile (weight matrix).
Profile Representation Alignment of string S with profile P • Insertion of spaces into S is allowed • Use regular string alignment? • Let C be a string of profile column positions • Align S by inserting spaces into S and C.
Profile Representation Example: S = aabbc, P is the profile from the previous slide: C1 C2 C3 C4 C5 a .75 .25 .50 b .75 .75 c .25 .25 .50 .25 - .25 .25 .25 Alignment of S and C. S : a a b - b c C: 1 - 2 3 4 5 Q: How do we score such an alignment???
Profile Representation Q: How do we score profile alignments? • Assume we have an alphabet-weight scoring scheme, e.g., a b c - a 2 –1 -3 -1 b –1 2 –1 -1 c –3 –1 2 -1 - -1 –1 –1 0 • Column score: compute the weighted sum of scores based on the frequency of characters in the column. • Alignment score: sum the column scores.
Profile Representation a b c - : alphabet-weight scoring scheme a 2 –1 -3 -1 b –1 2 –1 -1 c –3 –1 2 -1 - -1 –1 –1 0 C1 C2 C3 C4 C5 : profile a .75 .25 .50 b .75 .75 c .25 .25 .50 .25 - .25 .25 .25 Compute the weighted sum of scores based on the frequency of characters in the column. S : a a b - b c C: 1 - 2 3 4 5 Column1 = 0.75 * 2 + 0.25*(-3) Column2 = 0.75 * 2 + 0.25*(-1) Column3 = 0.25 * 0 + 0.50 * (-1) + 0.25 * (-1) Column4 = 0.75 * 2 + 0.25 * (-1) Column5 = 0.50 * (-3) + 0.25 * 2 + 0.25 * (-1)
Profile Representation Q: How do we find optimal alignments? A: Use dynamic programming to maximize similarity. As before: s(x, y) denotes the alphabet-weight assignment for aligning x & y. p(y, j) denote the frequency of letter y in column j. Then let S(x, j) denote Sy[s(x, y) * p(y, j) ], the score for aligning x with column j.
Profile Representation Defn. Let V(i, j) denote the value of the optimal alignment of S[1..i] with the first j columns of C. Then V(0, j ) = SkjS(_,k) And V(i, 0) = SkiS(S1(k), _) Here S1(k) denotes the kth character of the first string argument, i.e., S[k].
Profile Representation The general recurrence is then: V(i, j) = max[ V(i - 1, j - 1) + S(S1(i), j), match ith and jth letters V(i - 1, j) + S(S1(i), _), insert a gap in the profile V(i, j - 1) + S(_,j) ] insert a gap in S1. Q: What is the time complexity for solving this recurrence using DP?
Profile Representation Clearly the time complexity is O(smn) for DP Where: • n is the length of S the string. • m is length of the profile and • s is the size of the alphabet. O(smn) is more costly than sequence to sequence alignment. (Do you recall what that cost was?)
Signature Representation This representation is used by protein databases such as: • PROSITE • BLOCKS The core idea is that families of proteins are characterized by motifs or sequence signatures. Q: What is a motif? A: (Webster) A usu. repeating salient thematic element
Signature Representation Example from text: [&H][&A]D[DE]xn[TSN] x4[QK]G x7[&A] Where • A bracket indicates alternative amino acids • & = { I, L, V, M, F, Y, W} • x denotes any amino acid. • The subscript denote the length of the string, n denotes and arbitrary length.
Signature Representation Example from text: [&H][&A]D[DE]xn[TSN] x4[QK]G x7[&A] Observations: • The representation is a generalization • The generalization is a regular expression
Signature Representation Signature: [&H][&A]D[DE]xn[TSN]x4[QK]Gx7[&A] Matches: HADDITIIIIQGIIIIIIIA IADDITIIIIQGIIIIIIIA LADDITIIIIQGIIIIIIIA VADDITIIIIQGIIIIIIIA MADDITIIIIQGIIIIIIIA
Signature Representation Regular expression representation • use regular expression pattern matching. • no need to worry about mismatches/errors.
Computing Multiple Alignments Recall: two string local alignment was defined in terms of global alignment of substrings. • We take the same approach for multiple string local alignment. Defn. A local multiple alignment of a set S of strings is obtained by selecting one substring S´i from each Si S and then globally aligning these substrings.
Computing Multiple Alignments Q: Global vs Local alignment: which should we prefer? Wait for someone to respond! Gusfield notes for: • Pairs of sequences and • Multiple sequences there are biological justifications for preferring local over global alignment of multiple sequences. But…….
Computing Multiple Alignments But……. The best (computer science) theoretical results are for global alignment. Like the joke about the lost wallet, Gusfield chooses to emphasize global alignment.
Computing Multiple Alignments Q: How can we generalize the concept of score to multiple alignments? IOW, what objective function should we use? We will consider three types of objective functions: • Sum-of-pairs • Consensus • Tree
Computing Multiple Alignments First we define the concepts of induced pairwise alignment and its corresponding score. Defn. The induced pairwise alignment of strings Si and Sj is obtained from the global alignment M by removing all other rows. Note: instances of matching spaces can be removed from the induced alignment. Note: to score an induced pairwise alignment any two-string alignment scoring scheme can be used.
Computing Multiple Alignments Consider the following pairwise scoring scheme score = #mismatches + #spaces In the following example: 1 A A T - G G T T T 2 A A - C G T T A T • T A T C G - A A T score(1,2) = 4 score(1,3) = 5 score(2,3) = 4