Fast Finds: Making Google & BLAST Faster

Fast Finds: Making Google & BLAST Faster Dr. Laurence Boxer (w. Stephen Englert, NU CIS/MAT ’05) Dept. of Computer & Information Sciences Presented to Niagara University Research Council September, 2005

The problem: Given two character strings, a “pattern” and a “text” (with the text typically much larger than the pattern), find all matching copies of the pattern in the text. Examples using exact matches P: agtacagtac T: actaactagtacagtacagtacaactgtccatccg Output: (notice finds may overlap) Input: T: Welcome! You’ve come from …. P: come Output: T: Welcome! You’ve come from ….

Further, we want our solution to run quickly. • Users typically will wait a few seconds. They won’t wait a few hours, days, or months. • Today’s databases may be HUGE, e.g., the entire Web, or multiple genomes. Since, usually, the time of solution grows with the amount of data processed, it’s important to use efficient algorithms. • Thus, the growth rate of a solution’s running time as a function of the amount of data processed is an important consideration for a software developer.

Notation: we commonly use • T(n) for the time required to solve a problem of size n, which is typically the number of data items processed. • T(n)=Θ(f(n)) means T(n) is approximately proportional to f(n) (within a strictly defined technical rule). • T(n)=O(f(n)) means T(n) is at most proportional to something that’s Θ(f(n)) (within a strictly defined technical rule).

Simpler search problem: Sequential Search for a given value • Start at beginning of list • Examine each item until either you find what you seek (stop when success is reached), or you reach the end of list without finding (stop at failure). • Example: list of schools Niagara, Canisius, UB, RIT, St. Bonaventure • Search for Niagara succeeds at 1st item (best case: Θ(1) time) • Search for St. Bonaventure succeeds at 5th (last) item • Search for NCCC fails at last item • Worst cases require examining each item, hence T(n)=Θ(n); in general, then, T(n)=O(n). • Would you search a phone book for Zielinski this way?

Simpler search problem: Binary Search for a given value • Requires list to be ordered • Mimics the way we search a phone book – each item examined eliminating a large portion of the data. Start in middle; decide which half to continue searching; repeat this pattern on portion of data not yet eliminated. Thus, 1st item examined, if not the sought value, eliminates ½ data; 2nd item examined, if not the sought value, eliminates ½ of remaining ½, leaving ¼ of original data, etc. k-th item examined, if not the sought value, leaves of original data, until only 1 item remains. • Thus, in worst case, stop when • Since k reflects T(n), worst case: • In general,

Examples of binary search • Search for Rochester: • Start in middle – [4]. Compare: Niagara < Rochester, so range narrows to [5-8] • Middle of [5-8] is [6], so success at 2nd item examined. • Search for Ithaca: • Start in middle – [4]. Niagara > Ithaca, so range narrows to [1-3]. • Middle of [1-3] is [2]. Compare: Canisius < Ithaca, so range narrows to [3-3]. • Range [3-3] has one item. Compare: D’Youville < Ithaca, so search fails at 3rd item examined. Note

Measures of running times for sequential and binary searches • Sequential search – always applicable, but T(n)=O(n). • Binary search – only applicable when data is sorted, but If the constant of proportionality in both cases is 0.01 second (unlikely), then the comparison is more than 2 hours 54 minutes, to 0.2 seconds.

Approaches to speedup of string pattern matching algorithms • Use parallel computers – made up of multiple processors acting in parallel. Each processor processes a portion of the data in parallel with all other processors, so solution comes faster than by using just 1 processor. This approach was discussed in L. Boxer and R. Miller, Coarse Grained Gather and Scatter Operations with Applications, Journal of Parallel and Distributed Computing, 64 (2004), 1297-1320, presented at the March, 2005, NU Bioinformatics Seminar. • Devise faster sequential algorithms. This is the approach we discuss now.

Previous sequential solutions to exact string pattern matching • It’s known that in worst case, all characters of T must be considered. (Simple example: P has 1 character – any character of T not considered might be an overlooked match, so we must check ‘em all.) Hence, in worst case, a solution must take time that is at least proportional to n. • Algorithms that take O(n) time are known. By previous remark, such algorithms are “optimal” in worst case. • Boyer-Moore algorithm runs, for many examples, much faster than linear time – often, in Θ(n/m) time – but in worst case, runs in very slow Θ(mn) time. Has been modified by several authors to get linear time worst case, but these modifications are complex. • Our goal: a simple such modification.

Our modification of Boyer-Moore 1 In Θ(m) time, scan the characters of P to note which characters of the “alphabet” (character set) are used in P. This will enable us to make Θ(1) time decisions on whether any given character of T is a “bad character.” Now, we seek matches of last character of P with character of T. Boyer-Moore is often fast because it recognizes blocks of characters in T that can be skipped over, as follows.

Our modification of Boyer-Moore 2 Boyer-Moore “bad character” rule: if character of T aligned with last character of P isn’t in P, then none of the m characters of T starting with this one can align with last character of P in a substring match. Here, the “g” of T doesn’t match any character of P. It follows that aligning the “t” of Pwith any of the m characters of T starting with “g” also won’t yield a match, so we can skip the m-1 characters of T following “g” as possible positions for matching last character of P.

Our modification of Boyer-Moore 3 • Boyer-Moore is slow in worst case because it’s too eager to look for pattern matches. If no bad character is recognized, hence no data eliminated, Boyer-Moore may search for a match “everywhere”, hence compare (almost) every character of P with (almost) every character of T, resulting in Θ(mn) comparisons, hence Θ(mn) time. • Instead, we separate the data-elimination step. First, use the “bad character” (and other Boyer-Moore data elimination) rule(s) to determine what data of T must be considered. This can be done in Worst case: Θ(n) time, leaving all of T (no data eliminated) – thus, a (small) waste of time Often: much faster due to skips, and leaving only a fraction of T left to consider – precisely when Boyer-Moore works quickly • Then …

Our modification of Boyer-Moore 4 • Apply a linear-time algorithm (thus, other than Boyer-Moore) to the uneliminated data (which, often, is greatly reduced). • Summary: • In worst case, we apply a linear-time algorithm to the original n data after wasting a linear amount of time pre-processing. This gives a still linear-time algorithm, about 12%-16% slower than had we not done this pre-processing. • Often, we apply a linear-time algorithm to a greatly reduced (sublinear) amount of data after spending a sublinear amount of time preprocessing, netting a sublinear time altogether.

Experimental results • Stephen Englert (now in grad school at SUNY - Albany) wrote most of the code to time our experiments • Time units in “clocks” of C++ software – real time will vary depending on factors (that may have nothing to do with the quality of the algorithm) such as speed of computer hardware and quality of code generated by compiler

A best case experiment • %%%%... T is a file not containing “%”; • n = 2,350,367 Therefore, preprocessing takes Θ(n/m) time, eliminates all of T.

Experiment 2 • T: Same file as in previous experiment Superlinear speedup likely due to matches vs. no matches

Experiment 3 9 vs. 41 for “algorithm” likely due to more “bad” characters, since “parallel” uses fewer distinct letters

Experiment 4 – worst case – preprocessing doesn’t reduce data T = “#” ^ n, n = 2 ^ k, P = “#” ^ m Here, preprocessing slows running time (by about 12% - 16%).

Notes • A worst case example is often artificial, not of practical interest. In such a case, our pre-processing algorithm seems to waste up to 16% of the running time. However, this is still a linear-time algorithm - O(n); better than Boyer-Moore - O(mn). • In cases of practical interest, our pre-processing algorithm yields dramatically improved performance for the linear-time algorithm it is combined with. • Our algorithm is fast when Boyer-Moore is fast; much faster when Boyer-Moore is slow. • Our modification of Boyer-Moore is simpler than others in the literature and appears to perform at least as well.

Fast Finds: Making Google & BLAST Faster