330 likes | 434 Views
Composition Alignment. Gary Benson Departments of Computer Science and Biology Boston University. Composition Alignment. Gary Bens z on Departments of Computer Science and Biology Boston University. Outline of Talk. Sequence composition and composition match
E N D
Composition Alignment Gary Benson Departments of Computer Science and Biology Boston University
Composition Alignment Gary Benszon Departments of Computer Science and Biology Boston University
Outline of Talk • Sequence composition and composition match • Composition alignment algorithm • Composition match scoring functions • Growth of local composition alignment scores • Limiting the length of a composition match • Biological examples
Goal Identify features in DNA sequences that are not accurately described by position specific patterns. A position specific pattern, P, has the form: P = p1 p2 p3 ...pk where pi is either a single specific character or a choice (weighted or unweighted) of characters. In DNA there are features that are characterized by composition rather than by position specific patterns.
Sequence Composition Composition is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let S be a string over Σ. Then, C(S)=(fσ1 , fσ2 , fσ3 , … , fσ|Σ|) is the composition of S, where fσi is the fraction of the characters in S that are σi.
Composition Example S = ACTGTACCTGGCGCTATT C(S) = ( 0.17, 0.28, 0.22, 0.33 ) A C G T Note that the order of letters is irrelevant as it has no effect on the composition.
Composition and Sequence Features • Isochores – Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density. • CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression. • Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.
Composition Match We hope to identify common features in sequences using a new alignment algorithm. The main new idea is the use of composition matching. Two strings, S and T, have a composition match if their lengths are equal and C(S) = C(T). For example, S and T below have a composition match: S = ACTGTACCTGGCGCTATT T = AAACCCCCGGGGTTTTTT
Composition Alignment Problem Given:Two sequences, S and T of lengths m and n, over an alphabet Σ, and a scoring function cm(s, t) for the score of a composition match between substrings s and t. Find: The best scoring alignment (global or local) of S with T such that the allowed scoring options include composition match between substrings of S and T as well as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.
Example of composition alignment S = AACGTCTTTGAGCTC T = AGCCTGACTGCCTA Alignment AACGTCTTTGAGCTC | |<-> | <---> AGCCTGACT-GCCTA
Related Work • Alignment allowing adjacent letter swap. O(nm), Lowrance and Wagner (1975) • All swapped matchings of a pattern in a text. O(nm1/3log m log|Σ|), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000) O(n log m log |Σ|), Amir, Cole, Hariharan, Lewenstein, Porat (2001) • Composition naming O(n log m log |Σ|), Amir, Apostolico, Landau, Satta (2003)
Composition Alignment using Dynamic Programming Given two sequences, S and T, the best alignment of the prefix strings S[1, i] = s1 …si T[1, j] = t1 …tj ends in one of four ways: • mismatch, • insertion, • deletion, or • composition match
Ways an Alignment Can End mismatch S: C G T T: C G A S: C A T T: C A - S: C A – T: C A A composition match X: C G T A C Y: C G C T A insertion or deletion
Ways an Alignment Can End mismatch S: C G T T: C G A S: C A T T: C A - S: C A – T: C A A composition match X: C G T A C Y: C G C T A insertion or deletion Note that the suffixes will have a length l where 1 ≤ l ≤ min(i, j, limit)
Time Complexity Computing the optimal composition alignment with dynamic programming is similar to standard alignment, except for the composition match scoring option. The overall time complexity is O(nmZ) where Z is the time required per (i, j) pair to find the best length l for the composition match.
Computing length of the shortest composition match Our goal here is to start with two strings, S and T, of equal length, and for each prefix pair S[1, k], T[1, k], find the length of the shortest suffixes that have a composition match.
For example, let S = AACGTCTTTGAGCT T = AGCCTGACTGCCTA the table states that for k = 6, the shortest suffixes which have a composition match have length = 3: S = AACGTC... T = AGCCTG...
Composition difference We find the matching suffix lengths using composition difference, a vector quantity for two strings x and y: CD(x, y) = (cσ1, … , cσ|Σ|) where cσiis the difference between the number of times σi occurs in x and in y.
Using composition difference Key observation: two identical composition differences at prefix lengths k and g indicate a composition match of length k – g.
Sorting to find shortest composition matches Sort on composition difference using stable sort. Adjacent tuples with the same composition difference identify shortest composition matches.
Time complexity for composition matches O(nmΣ) to find all index pairs shortest composition match lengths for two strings of length n and m. In our work, Σ, is a small constant (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.
Composition match scoring functions We have explored: Functions based on match length, k: • Function 1: cm(k) = ck • Function 2: cm(k) = c√ k where c is a constant. Functions based on substring composition: • Function 4: cm(C, B, k) = ck · H(C,B) where H is the relative entropy function, C is the composition of the matching substrings and B is a background composition.
Additive and subadditive scoring functions The functions based on length are additive or subadditive: cm(i + j) ≤ cm(i) + cm(j) Lemma: For additive or subadditive composition match scoring functions, any best scoring alignment is equivalent in score to an alignment which contains only shortest composition matches. Theorem: Composition alignment with additive or subadditive match scoring functions and finite alphabet has time complexity O(nm).
The limit parameter Intuitively, allowing scrambled letters to match should increase the amount of matching between sequences. If too much matching occurs, alignments will not be meaningful. The limit parameter is an upper bound on the length l of the longest single composition match, used to prevent excessive matching. Sequence length = 100, randomly generated
Global score as a predictor of local parameter suitability: Function 1
Global score as a predictor of local parameter suitability: Function 2
Limit values for DNA • Function 1: cm(k) = ck: Limit ≤ 3. • Function 2: cm(k) = c√k: Limit ≤ 10. • Function 4: cm(C, B, k) = ck ·H(C, B): Limit ≤ 50.
Biological examples Composition alignment was tested on a set of 1796 promoter sequences from the Eukaryotic Promoter Database. Each sequence is 600 nucleotides long, 500 bases upstream and 100 downstream of the transcription initiation site. Two local alignment scores were produced using function 1, W using composition alignment and S using standard alignment. The examples shown have statistically significant W with W ≥ 3 · S to exclude good standard alignments.
Example 1 Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically significant. Sequences are characteristic of CpG islands. Composition Alignment: GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC <->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->|| CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC Standard Alignment: CGCCGCCGCCG CGCCGCCGCCG
Example 2 Composition alignment of two promoter sequences. Composition changes at vertical line. A C G T Left: (0.01, 0.61, 0.30, 0.08) Right: (0.19, 0.16, 0.56, 0.09) GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG <->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->| CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG
Conclusion We • define a new alignment problem based on composition matching and test several scoring functions • show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabet • show that alignment using scoring functions based on sequence length only require finding shortest composition matches • give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity in the absence of significant standard alignments