410 likes | 586 Views
Northeastern University, China. ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases. Xiaochun Yang , Honglei Liu, Bin Wang. Local Alignment. Similar over short conserved regions Dissimilar over remaining regions Applications
E N D
Northeastern University, China ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang
Local Alignment • Similar over short conserved regions • Dissimilar over remaining regions • Applications • Comparing long stretches of anonymous DNA • Searching for unknown domains or motifs within proteins from different families • …
Related Work • Smith-Waterman algorithm (1981) • An exact approach but very slow • Not used for search • BLAST: an efficient but approximate approach • OASIS:an exact approach and efficient only for short query sequences (less than 60 characters) • BWT-SW: an exact approach but inefficient • Our target • An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)
P T Local Alignment Score >= H P T • Input: 2 sequences, a similarity function, a threshold • Output: Alignments.
Measure Similarity • Scoring scheme <sa, sb, sg, ss> • An identical mapping: positive score sa • A mismatch: negative score sb • Gap: negative score sg + r×ss • Gap opening penalty Gap extension penalty S1: S2: TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA Scoring scheme: <1, -3, -2, -1> sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 ×(-1)) = 5
j … X i The best alignment score of X[1,i] and any substring of P ending at position j. A Basic Approach P T
An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = <1,-3,-5,-2> Ga -2 -5-2 -2 -5-2 Gb
A Basic Approach j P i T 4 i = i1+t1 = i2+t2 6 6
Challenges • Speed • Each matrix contains m ~ m×n entries • n matrixes • How to avoid calculating most of entries without impairing the accuracy of the alignment results? • In-memory algorithm • Long sequences: both T and P are long
Contributions • Speed • Prune unnecessary calculations • Avoid duplicate calculations • In-memory algorithm • Use compressed suffix array • Mathematical analysis
Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm
Local filterings • Length Filtering Pruned
Local filterings • Score Filtering Pruned
Pruned Pruned Local filterings • q-Prefix Filtering Simpler function
Comparison of Calculating One Matrix P=G1C2T3A4A5G6C7T8A9A10G11C12T13G14C15 X=G1C2T3A4A5G6C7T8A9G10T11 Scoring scheme <1, -3, -5, -2> H=3
Comparison of Calculating One Matrix P=G1C2T3A4A5G6C7T8A9A10G11C12T13G14C15 X=G1C2T3A4A5G6C7T8A9G10T11 Scoring scheme <1, -3, -5, -2> H=3
Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm
Pruned 4 6 i = i1+t1 = i2+t2 6 Global Filtering
Pruned fork areas Global Filtering It is unnecessary to calculate the fork area in the matrix of X and P Using X’: Alignment score >= Sa Question: Safely avoid calculating based on calculated matrixes?
X Global Filtering • Update and check unnecessary calculations on-the-fly X’ Scoring scheme <1, -3, -5, -2> Boolean matrix • Space consuming: m×n space • (2) Calculation order
X’ X Global Filtering • q-prefix domination X’ dominates X
t Global Filtering • q-prefix domination X’ dominates X X’ X • Text T • Constructing dominations offline in O(n) time • Query P • Check useless calculations on-the-fly Calculation order is unnecessary.
Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm
Reusing score calculations for P reusable alignment entries Entries with a common prefix Ps can share alignment scores.
Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.
Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm
Row by row Column by column A Hybrid Algorithm
Mathematical Analysis • Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi) • DNA: 4.50mn0.520 ~ 9.05mn0.896 • Proteins: 8.28mn0.364 ~ 7.49mn0.723
Experiments • Data sets • Human genome data set • Length of a text: 50 million ~ 1 billion. • Mouse genome data set • Length of each query: 1 thousand ~ 1 million. • Protein data set • Length of a text: 10 million ~ 50 million. • Length of each query: 200 ~ 100,000. • E-value: threshold • Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.
Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW
Conclusions • High efficiency of ALAE • Improves BWT-SW significantly • Accelerates BLAST for most of the scoring schemes • In-memory approach using compressed suffix array • Mathematical analysis • Upper bound on calculated entries
Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project
Simulating Searches Using Compressed Suffix Array • Match a q-length substring in text • Identify forks • Find occurrences of a substring in text • Calculate end positions of alignments • Get all suffixes with the same prefix as Xq
X = GC G C T A G C $ $ G C T A G C C T A G C $ G A G C $ G C T • Positions of GC in T • SA[4] = 5 • SA[5] = 1 C $ G C T A G T A G C $ G C C T A G C $ G A G C $ G C T G C $ G C T A G C $ G C T A G C T A G C $ C $ G C T A G $ G C T A G C T A G C $ G C Review of Compressed Suffix Array T = G1C2T3A4G5C6 T’ = G1C2T3A4G5C6$7 Conceptual matrix 7 4 6 2 5 1 3 BTW = CTGGA$C SA[0,6]
X = GC P-1 = CG C G A T C G $ $ C G A T C G G A T C G $ C A T C G $ C G • Positions of CG in T-1 • SA[2] = 2 • SA[3] = 6 Therefore, • Positions of GC in T • SA[2]-|X|+1 = 1 • SA[3]-|X|+1= 5 C G $ C G A T A T C G $ C G C G A T C G $ T C G $ C G A C G $ C G A T G $ C G A T C G A T C G $ C G $ C G A T C $ C G A T C G T C G $ C G A Compressed Suffix Array – reverse T to T-1 T = G1C2T3A4G5C6 T-1 = C6G5A4T3C2G1$0 T’ = $0G1C2T3A4G5C6 Conceptual matrix 0 4 2 6 1 5 3 BTW = GGT$CCA SA[0,6]
v v … X Align Distinct Substring in T with P P T j i v
Alignment Time • T = 50 million characters • P = 10 thousand characters