ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases

Northeastern University, China ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang

Local Alignment • Similar over short conserved regions • Dissimilar over remaining regions • Applications • Comparing long stretches of anonymous DNA • Searching for unknown domains or motifs within proteins from different families • …

Related Work • Smith-Waterman algorithm (1981) • An exact approach but very slow • Not used for search • BLAST: an efficient but approximate approach • OASIS:an exact approach and efficient only for short query sequences (less than 60 characters) • BWT-SW: an exact approach but inefficient • Our target • An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)

P T Local Alignment Score >= H P T • Input: 2 sequences, a similarity function, a threshold • Output: Alignments.

Measure Similarity • Scoring scheme <sa, sb, sg, ss> • An identical mapping: positive score sa • A mismatch: negative score sb • Gap: negative score sg + r×ss • Gap opening penalty Gap extension penalty S1: S2: TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA Scoring scheme: <1, -3, -2, -1> sim(S1,S2) = 15×1 + (-3) + (-2-1) + (-2 + 2 ×(-1)) = 5

j … X i The best alignment score of X[1,i] and any substring of P ending at position j. A Basic Approach P T

A DP Algorithm

An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = <1,-3,-5,-2> Ga -2 -5-2 -2 -5-2 Gb

A Basic Approach j P i T 4 i = i1+t1 = i2+t2 6 6

Challenges • Speed • Each matrix contains m ~ m×n entries • n matrixes • How to avoid calculating most of entries without impairing the accuracy of the alignment results? • In-memory algorithm • Long sequences: both T and P are long

Contributions • Speed • Prune unnecessary calculations • Avoid duplicate calculations • In-memory algorithm • Use compressed suffix array • Mathematical analysis

Outline • Local filterings • Global filtering • Reusing calculations • A hybrid algorithm

Local filterings • Length Filtering Pruned

Local filterings • Score Filtering Pruned

Pruned Pruned Local filterings • q-Prefix Filtering Simpler function

Comparison of Calculating One Matrix P=G1C2T3A4A5G6C7T8A9A10G11C12T13G14C15 X=G1C2T3A4A5G6C7T8A9G10T11 Scoring scheme <1, -3, -5, -2> H=3

Pruned 4 6 i = i1+t1 = i2+t2 6 Global Filtering

Pruned fork areas Global Filtering It is unnecessary to calculate the fork area in the matrix of X and P Using X’: Alignment score >= Sa Question: Safely avoid calculating based on calculated matrixes?

X Global Filtering • Update and check unnecessary calculations on-the-fly X’ Scoring scheme <1, -3, -5, -2> Boolean matrix • Space consuming: m×n space • (2) Calculation order

X’ X Global Filtering • q-prefix domination X’ dominates X

t Global Filtering • q-prefix domination X’ dominates X X’ X • Text T • Constructing dominations offline in O(n) time • Query P • Check useless calculations on-the-fly Calculation order is unnecessary.

Reusing score calculations for P reusable alignment entries Entries with a common prefix Ps can share alignment scores.

Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.

Row by row Column by column A Hybrid Algorithm

Mathematical Analysis • Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST ( http://blast.ncbi.nlm.nih.gov/Blast.cgi) • DNA: 4.50mn0.520 ~ 9.05mn0.896 • Proteins: 8.28mn0.364 ~ 7.49mn0.723

Experiments • Data sets • Human genome data set • Length of a text: 50 million ~ 1 billion. • Mouse genome data set • Length of each query: 1 thousand ~ 1 million. • Protein data set • Length of a text: 10 million ~ 50 million. • Length of each query: 200 ~ 100,000. • E-value: threshold • Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.

Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW

Filtering Ratio

Reusing Ratio

Index Size

Conclusions • High efficiency of ALAE • Improves BWT-SW significantly • Accelerates BLAST for most of the scoring schemes • In-memory approach using compressed suffix array • Mathematical analysis • Upper bound on calculated entries

Thank you! Source code to be available at http://faculty.neu.edu.cn/yangxc/project

Simulating Searches Using Compressed Suffix Array • Match a q-length substring in text • Identify forks • Find occurrences of a substring in text • Calculate end positions of alignments • Get all suffixes with the same prefix as Xq

X = GC G C T A G C $ $ G C T A G C C T A G C $ G A G C $ G C T • Positions of GC in T • SA[4] = 5 • SA[5] = 1 C $ G C T A G T A G C $ G C C T A G C $ G A G C $ G C T G C $ G C T A G C $ G C T A G C T A G C $ C $ G C T A G $ G C T A G C T A G C $ G C Review of Compressed Suffix Array T = G1C2T3A4G5C6 T’ = G1C2T3A4G5C6$7 Conceptual matrix 7 4 6 2 5 1 3 BTW = CTGGA$C SA[0,6]

X = GC  P-1 = CG C G A T C G $ $ C G A T C G G A T C G $ C A T C G $ C G • Positions of CG in T-1 • SA[2] = 2 • SA[3] = 6 Therefore, • Positions of GC in T • SA[2]-|X|+1 = 1 • SA[3]-|X|+1= 5 C G $ C G A T A T C G $ C G C G A T C G $ T C G $ C G A C G $ C G A T G $ C G A T C G A T C G $ C G $ C G A T C $ C G A T C G T C G $ C G A Compressed Suffix Array – reverse T to T-1 T = G1C2T3A4G5C6 T-1 = C6G5A4T3C2G1$0 T’ = $0G1C2T3A4G5C6 Conceptual matrix 0 4 2 6 1 5 3 BTW = GGT$CCA SA[0,6]

v v … X Align Distinct Substring in T with P P T j i v

Alignment Time • T = 50 million characters • P = 10 thousand characters

ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases

ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases

Presentation Transcript

Object Recognition Using Alignment

Catalyzing the Symbiotic Age: Discovering, Predicting, and Creating our Next Era of Accelerating Change Las Vegas Future

Text and Web Search

EVERYDAY LEADERSHIP

Accelerating Change: Developing Tomorrow’s Medical Toolkit John Smart, President, ASF Medicine Meets Virtual Reality 14

How to Be a Strategic Futurist: An Evolutionary Developmental Perspective on Accelerating Change

Lecture 5: Searching Sequence Databases Eric C. Rouchka, D.Sc. eric.rouchka@uofl kbrin.a-bldg.louisville/~rouchka/CECS6

Feature-Based Alignment

Chapter 22: Distributed Databases

HAPTER 4

Chapter 19: Distributed Databases

Pairwise sequence Alignment

CHAPTER 17 Wheel Alignment Principles

BLAST : Basic local alignment search tool

Origin of Accelerating Universe: Dark-Energy and Particle Cosmology

Distributed Databases

HAPTER 4

Relational Databases

Multiple Sequence Alignment

Ontology Alignment

Isolation in Relational Databases