870 likes | 1.09k Views
Semi-Numerical String Matching. Semi-numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as: Arithmetic. Bit – operations. The fast Fourier transform. Semi-numerical String Matching.
E N D
Semi-numerical String Matching • All the methods we’ve seen so far have been based on comparisons. • We propose alternative methods of computation such as: • Arithmetic. • Bit – operations. • The fast Fourier transform.
Semi-numerical String Matching • We will survey three examples of such methods: • The Random Fingerprintmethod due to Karp and Rabin. • Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrepdue to Wu and Manber. • A solution to the match count problem using the fast Fourier transformdue to Fischer and Paterson and an improvement due to Abrahamson.
Karp-Rabin fingerprint - exact match • Exact match problem: we want to find all the occurrences of the pattern P in the text T. • The pattern P is of length n. • The text T is of length m.
Karp-Rabin fingerprint - exact match • Arithmetic replaces comparisons. • An efficient randomized algorithm that makes an error with small probability. • A randomized algorithm that never errors whose expected running time is efficient. • We will consider a binary alphabet: {0,1}.
Arithmetic replaces comparisons. • Strings are also numbers, H: strings → numbers. • Let s be a string of length n, • Definition:let Tr denote the n length substring of T starting at position r.
Arithmetic replaces comparisons. • Strings are also numbers, H: strings → numbers. T = 1 0 1 1 0 1 0 1 P = 0 1 0 1 T = 1 0 1 1 0 1 0 1 H(T5) = 5 = P = 0 1 0 1 H(P) = 5 T = 1 0 1 1 0 1 0 1 H(T2) = 6 ≠ P = 0 1 0 1 H(P) = 5
Arithmetic replaces comparisons. • Theorem:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr) • Proof:Follows immediately from the unique representation of a number in base 2.
Arithmetic replaces comparisons. • We can compute H(Tr) from H(Tr-1) T = 1 0 1 1 0 1 0 1 T1 = 1 0 1 1 T2 = 0 1 1 0
Arithmetic replaces comparisons. • A simple efficient algorithm: • Compute H(T1). • Run over T Compute H(Tr) from H(Tr-1) in constant time,and make the comparisons. • Total running time O(m)?
Karp-Rabin • Let’s use modular arithmetic, this will help us keep the numbers small. • For some integer p The fingerprint of P is defined byHp(P) = H(P) (mod p)
Karp-Rabin • Lemma: And during this computation no number ever exceeds 2p.
An example P = 1 0 1 1 1 1 H(P) = 47 p = 7 Hp(P) = 47 (mod 7) = 5
Karp-Rabin • Intermediate numbers are also kept small. • We can still compute H(Tr) from H(Tr-1). Arithmetic: Modular arithmetic:
Karp-Rabin • Intermediate numbers are also kept small. • We can still compute H(Tr) from H(Tr-1). Arithmetic: Modular arithmetic:
Karp-Rabin • How about the comparisons? Arithmetic:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr) Modular arithmetic: If there is an occurrence of P starting at position r of T then Hp(P) = Hp(Tr) There are values of p for which the converse is not true!
Karp-Rabin • Definition: If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at position r, we say there is a false match between P and T at position r. If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T.
Karp-Rabin • Our goal will be to choose a modulus p such that • p is small enough to keep computations efficient. • p is large enough so that the probability of a false match is kept small.
Prime moduli limit false matches • Definition:For a positive integer u, п(u) is the number of primes that are less than or equal to u. • Prime number theorem (without proof):
Prime moduli limit false matches • Lemma (without proof):if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2u. • Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is 6,469,693,230 ≥ 536,870,912 = 229
Prime moduli limit false matches • Corollary:If u ≥ 29 and x is any number less than or equal to 2u, then x has fewer than п(u) distinct prime divisors. • Proof: Assume x has k ≥ п(u) distinct prime divisors q1 , …, qk then 2u ≥ x ≥ q1* …* qk but q1* …* qk is at least as large as the product of the first п(u) prime numbers.
Prime moduli limit false matches • Theorem:Let I be a positive integer, and p a randomly chosen prime less than or equal to I.If nm ≥ 29 thenThe probability of a false match between P and T is less than or equal to п(nm) / п(I) .
Prime moduli limit false matches • Proof: • Let R be the set of positions in T where P doesn’t begin. • We have • By the corollary the product has at most п(nm) distinct prime divisors. • If there is a false match at position r then p divides thus also divides • p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I).
Random fingerprint algorithm • Choose a positive integer I. • Pick a random prime p less than or equal to I, and compute P’s fingerprint – Hp(P). • For each position r in T, comput Hp(Tr) and test to see if it equals Hp(P). If the numbers are equal either declare a probable match or check and declare a definite match. • Running time: excluding verification O(m).
How to choose I • The smaller I is, computations are more efficient • The larger I is, the probability of a false match decresses. • Proposition:When I = nm21. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m.
How to choose I • Proof:
Extensions • An idea: why not choose k primes? • Proposition: • when k primes are chosen randomly and independently between 1 and I, the probability of a false match is at most • Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers.
An illustaration • k = 4, n = 250, m = 4000I = 250*40002 < 232
Even lower limits on the error • When k primes are used, the probability of a false match is at most • Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes that divide it.Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n).
Checking for error in linear time • Consider the list L of locations in T where the Karp-Rabin algorithm declares P to be found. • A run is a maximal interval of starting locationsl1, l2, …, lr in L such that every two numbers differ by at most n/2. • Let’s verify a run.
Checking for error in linear time • Check the first two declared occurrences explicitly.P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • If there is a false match stop. • Otherwise P is semi periodic with period d = l1 – l2.
Checking for error in linear time • d is the minimal period. • P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…
Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • For each i check that li+1 – li = d. • Check the last d characters of li for each i.
Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • Check l1
Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • Check l2 • P is semi periodic with period 3.
Checking for error in linear time T = abbabbabbabbabbabbabbabbabbax… • Check li+1 – li = 3
Checking for error in linear time P = babT = abbabbabbabbabbabbabbabbabbax… • For each i check the last 3 characters of li.
Checking for error in linear time P = babT = abbabbabbabbabbabbabbabbabbax… • For each i check the last 3 characters of li.
Checking for error in linear time P = babT = abbabbabbabbabbabbabbabbabbax… • For each i check the last 3 characters of li. • Report a false match or approve the run.
Time analysis • No character of T is examined more than twice during a single run. • Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs. • Total verification time O(m).
Time analysis • When we have a false match we start again with a different prime. • The expected probability of a false match is O(1/m). • We have converted the algorithm to one that never mistakes with expected linear running time.
Why use Karp-Rabin? • It is efficient and simple. • It is space efficient. • It can be generalized to solve harder problems such as 2-dimensional string matching. • It’s performance is backed up by a concrete theoretical analysis.
The Shift-And Method
The Shift-And Method • We start with the exact match problem. • Define M to be a binary n by m matrix such that:M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]
The Shift-And Method • Let T = california • Let P = forM = • M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. • How does M solve the exact match problem?
How to construct M • We will construct M column by column. • Two definitions are in order: • Bit-Shift(j-1) is the vector derived by shifting the vector for column j-1 down by one and setting the first bit to 1. • Example:
How to construct M • We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears. • Example: P = abaac
How to construct M • Initialize column 0 of M to all zeros • For j > 1 column j is obtained by
An example j = 1 1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c