940 likes | 1.14k Views
Multiple-Alignment. 藍永倫、陳婕妤、文國煒、劉宗灝、戴邦炘. Reference.
E N D
Multiple-Alignment • 藍永倫、陳婕妤、文國煒、劉宗灝、戴邦炘
Reference • Michael Brudno, Chuong B. Do, Gregory M. Cooper, Michael F. Kim, Eugene Davydov, NISC Comparative Sequencing Program, Eric D. Green, Arend Sidow, and Serafim BatzoglouLAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNAGenome Res., Apr 2003; 13: 721 - 731 ; doi:10.1101/gr.926603 • Michael Brudno, Alexander Poliakov, Asaf Salamov, Gregory M. Cooper, Arend Sidow, Edward M. Rubin, Victor Solovyev, Serafim Batzoglou, and Inna DubchakAutomated Whole-Genome Multiple Alignment of Rat, Mouse, and HumanGenome Res., Apr 2004; 14: 685 - 692 ; doi:10.1101/gr.2067704
簡介 • Multiple Sequence Alignment AAA----CTGCAC----AG A--CTG-CT--ACTG---G ---CTGACTGC----TTA- NP-Complete
LAGAN toolkit • http://lagan.stanford.edu/ • LAGAN • Multi-LAGAN • Suffle-LAGAN
LAGAN (概述) • 一次對兩個 sequence 做 global alignment • Find local alignments. (seeds) • Compute a rough global map. • Restricted DP.
Multi-LAGAN (概述) • 一共有 K 個 sequence • 類似 K-Clustering • 假設已知演化樹 (phylogenetic-tree ) • 每次將兩個最近的 sequence 合併起來,一共做 K-1 次 LAGAN。
效能評估 • 測試資料 • ROSETTA Set • 129 genes • 人、鼠 • 平均 10 Kbp • 上述 12 種動物的 CFTR Region • 平均 1Mbp • 先用人類的 annotated exon 來 align 其他 11 種動物,用這個結果當作「標準答案」。
Discussion • Multiple Alignments 對於較遠物種的比對會比 Pairwise Alignments 來得好。 • Local Alignment v.s. Global Alignment • MLAGAN 雖然比較慢,但正確率最高。 • LAGAN / MLAGAN 不只適合相近的序列,對於差異大的序列也有不錯的表現。
Three main steps 1.Generation of local alignments. 2. Construction of a rough global map. 3. Computation of the final global alignment.
1. Generation of Local Alignment • LAGAN uses CHAOS to find local homologies between two sequence. • Michael Brudno, Michael Chapman, Berthold Gottgens, Serafim Batzoglou, and Burkhard MorgensternFast and sensitive multiple alignment of long genomic sequences.BMC Bioinformatics, 4:66 2003. • CHAOS works by chaining short words, the seeds, which match between the two sequence. • Anchor : chain of seeds, local alignment.
y x 1. Generation of Local Alignment • k : word length, c : degeneracy • A (k, c)-seed is a pair of k-long words that match with at most c differences between the two sequence. • d :maximum distance , s : maximum shift. • Two seeds are x-letters and y-letters apart. They can be chained together if : • x <= d and y <= d • | x - y | <= s
gap cutoff distance cutoff seed seq2 Search box location in seq1 Range of search 1. Generation of Local Alignment • Find seeds at current locationin seq1 • Find the previous seeds that fall into the search box • Do a range query: seeds are indexed by their diagonal. • Pick a previous seed that maximizes the score of chain Time O(n log n), where n is number of seeds.
y x 1. Generation of Local Alignment • Scoring of Chains • I love SWEET COW ^(oo)^~ • Match and mismatch penalties for each pair of chained seed. • Gap penalties proportional to | x – y | for each pair of chained seed. • Chains are threw away if they score under a threshold t. • Rapid rescoring • For the chains that score above t. • Rescore them by performing ungapped extensions in both directions from each seed. Finding the optimal location to insert exactly one gap of size | x – y |
A1 A2 y x 2. Construction of a Rough Global Map • (b, e, b’, e’, s) represent a local alignment (anchor). • From (b, b’) to (e, e’) • s is the score of the alignment • A1 < A2iff e1 < b2 and e1’ < b2’ • A1 = (b1, e1, b1’, e1’, s1) • A2 = (b2, e2, b2’, e2’, s2) • A chain of local alignment A1 < A2 < … < Ak, has score s1 + s2 + … + sk. • The optimal rough global map is the highest-scoring chain. • Computed using Sparse Dynamic Programming – LIS, in time O(nlogn), n is the total number of local alignment.
2. Construction of a Rough Global Map • Recursive anchoring • The choice of parameter k (length of seeds), d (maximum degeneracy of seeds), and t (score threshold) is a tradeoff between speed and sensitivity. • Speed : higher k, lower c. • Sensitivity : lower k, higher c. • To achieve combination of speed and sensitivity, LAGAN calls CHAOS with a restrictive set of parameters in the regions between each anchor (local alignment) of the global map.
3. Computation of Global Alignment • Limits the area for each anchor • The rectangle (0, 0) to (i+r, i-r). • The rectangle (i’-r, j’-r) to (M, N). • The band enclosed by the two diagonals • (i-r, j+r) to (i’-r, j’+r) • (i+r, j-r) to (I’+r, j’-r) • r is a parameter, typically 15.
3. Computation of Global Alignment • Do dynamic programming method Needleman-Wunsch to this limited area. • In this sense the anchors in LAGAN are more flexible than the anchors in MUMer, AVID, and GLASS. • LAGAN provide only approximate locations by which the alignment should pass.
Memory-efficient Implementation • LAGAN performs the entire computation with memory proportional to the size of the largest rectangle. • LAGAN achieves this memory efficiency as follow: • Allocates working memory for one rectangle and the neck that follows it. Compute Needleman-Wunsch matrix. • Traces back all optimal alignments ending in the cells at the rightmost column of the neck. • Soon converge upon a single optimal alignment in practice. • Deallocates all working memory, except the memory necessary to keep the traced-back alignments. • Repeat step 1 to step 3 for the next rectangle and neck.
LAGAN Running Time Analysis • The running time of LAGAN is dominated by the “rectangles”. • The running time of “necks” is O[r*(M+N)], which is linear in the sequence lengths. • Suppose there are n anchors, let (x0, y0),…,(xn, yn) be dimension of the n+1 rectangles. Let denote the total length of the inter-anchor segments in each sequence. We can asume the anchors will be aligned in linear time and therefore ignore their length. and
LAGAN Running Time Analysis • The total number of cells in these rectangles is • The first term depends only on the effective lengths of the sequences and the total number of anchors. • If we assume a lower bound on acceptable anchor density, then L1L2/n behaves linearly in sequence length, because L1/n and L2/n areO(1).
LAGAN Running Time Analysis • The total number of cells in these rectangles is • The second term is at most nσx σy where σ denotes the standard deviation. • Assuming constant anchor density. (reasonable assumption for a fixed pair of organism.) Thus, linear in sequence length provided the standard deviations are constant. • If the anchors are spaced evenly, and with a constant density, the running time will be linear in sequence length.
References • LAGAN online • http://genome.lbl.gov/cgi-bin/VistaInput?align_pgm=lagan&num_seqs=2 • http://ai.stanford.edu/~serafim/CS262_2005/index.html • LAGAN • http://lagan.stanford.edu/lagan_web/citing.shtml • “Algorithms for Alignment of Genomic”, SequencesMichael Brudno, Department of Computer Science, Stanford UniversityPGA Workshop 07/16/2004
LAGAN and Multi-LAGAN : Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA
Outline • LAGAN • Multi-LAGAN • Performance Evaluation
Multiple Alignment • A natural extension of 2-sequence comparisons • More difficult than pairwise: the running time scales as the product of the lengths of all sequences • NP-complete problem (need heuristic approaches)
“Progressive Alignment” • the most widely used heuristic approach • Successive applications of a pairwise alignment algorithm • CLUSTALW (best-known) and MLAGAN
MLAGAN (Multi-LAGAN) • A multiple aligner based on progressive alignment with LAGAN • 2 main phases : • (1) Progressive alignment with LAGAN • (2) (optional) Iterative improvement 1. successively remove each sequence 2. realign it
Algorithm MLAGAN • Input : • K sequences X1,…,XK • A phylogenetic binary tree between them
Algorithm MLAGAN (cont.) 3 main steps: • (1): Generation of rough global maps. Find the rough global map between each pair of sequences. (step 1, 2 of LAGAN)
Algorithm MLAGAN (cont.) • (2): Progressive multiple alignment with anchors. 2.1 Perform a global alignment between the 2 closest sequences according to the phylogenetic tree using step 3 of LAGAN.
Algorithm MLAGAN (cont.) 2.2 Find the rough global maps of the new multi-sequence to all other multi- sequences. (details & scoring metric in later) 2.3 Iterate steps 2.1, 2.2 (K-1 times). Repeat until left with a multiple alignment of all sequences.
Algorithm MLAGAN (cont.) • (3): (Optional) Iterative refinement with anchors. For each sequence Xi in the multiple alignment: 3.1 Find anchors between Xi & the other sequences that align better than a given cutoff.
Algorithm MLAGAN (cont.) 3.2 Align Xi to the multiple alignment of the other sequences with LAGAN. (details in later)
Align 2 Multi-sequences • In the order of the given phylogenetic tree. E.g. 1. (human, chimpanzee) 2. (mouse, rat) 3. (human/chimpanzee, mouse/rat) 4. (human/chimpanzee/mouse/rat, chicken)
Align 2 Multi-sequences (cont.) • Step 2.2 of MLAGAN E.g. Compute the rough global map of 2-sequence X/Y and 1-sequence Z • (1) Anchors in the rough global maps between X & Z, Y & Z. • (2) Reweigh overlapped anchors : (s1+s2)*I/U
Align 2 Multi-sequences (cont.) I: length of intersection U: length of union • (3) The highest weight chain, by LIS.
Scoring with Affine Gaps • An open research area (T-COFFEE) • 2 classical models : (1) sum-of-pairs model (2) consensus model
Scoring with Affine Gaps (cont.) • sum-of-pairs model :Sum of scores of all pairwise alignments • consensus model : • Create a “consensus string” by a majority vote at each position. • Sum of pairwise scores between the consensus and each individual sequence
Scoring with Affine Gaps (cont.) • Each scoring scheme has advantages & disadvantages. E.g. consensus • We use a “combination” of both : • sum-of-pairs => substitutions. • consensus => gaps. p.s. Most similar to CLUSTALW: ※ heuristically weighted per-sequence penalties => gaps
Scoring with Affine Gaps (cont.) • Stacking effect (consensus affine-gap) : Because gap-open penalties are large compared to match & mismatch scores, often it is favorable to artificially open additional gaps in order to stack the gap openings. • Solution: use “gap-end” penalty (== “gap-open” penalty)
Scoring with Affine Gaps (cont.) • consensus string : ATCTGT---CAG
Scoring with Affine Gaps (cont.) • Define : (Aij): K × L alignment matrix Aij belongs to {A, C, G, T, -} (Bij): K × L alignment matrix Bij belongs to {N, O, G, C}
Scoring with Affine Gaps (cont.) Bij = ‘O’ (gap-open): the ones opening a gap. ‘G’ (gap-continue): Aij=‘-’ except gap-open. ‘C’ (gap-close): the ones closing a gap. ‘N’ (nucleotide): Aij≠‘-’ except gap-close.