Semi-Numerical String Matching

Semi-Numerical String Matching

Semi-numerical String Matching • All the methods we’ve seen so far have been based on comparisons. • We propose alternative methods of computation such as: • Arithmetic. • Bit – operations. • The fast Fourier transform.

Semi-numerical String Matching • We will survey three examples of such methods: • The Random Fingerprintmethod due to Karp and Rabin. • Shift–And method due to Baeza-Yates and Gonnet, and its extension to agrepdue to Wu and Manber. • A solution to the match count problem using the fast Fourier transformdue to Fischer and Paterson and an improvement due to Abrahamson.

Karp-Rabin fingerprint - exact match • Exact match problem: we want to find all the occurrences of the pattern P in the text T. • The pattern P is of length n. • The text T is of length m.

Karp-Rabin fingerprint - exact match • Arithmetic replaces comparisons. • An efficient randomized algorithm that makes an error with small probability. • A randomized algorithm that never errors whose expected running time is efficient. • We will consider a binary alphabet: {0,1}.

Arithmetic replaces comparisons. • Strings are also numbers, H: strings → numbers. • Let s be a string of length n, • Definition:let Tr denote the n length substring of T starting at position r.

Arithmetic replaces comparisons. • Strings are also numbers, H: strings → numbers. T = 1 0 1 1 0 1 0 1 P = 0 1 0 1 T = 1 0 1 1 0 1 0 1 H(T5) = 5 = P = 0 1 0 1 H(P) = 5 T = 1 0 1 1 0 1 0 1 H(T2) = 6 ≠ P = 0 1 0 1 H(P) = 5

Arithmetic replaces comparisons. • Theorem:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr) • Proof:Follows immediately from the unique representation of a number in base 2.

Arithmetic replaces comparisons. • We can compute H(Tr) from H(Tr-1) T = 1 0 1 1 0 1 0 1 T1 = 1 0 1 1 T2 = 0 1 1 0

Arithmetic replaces comparisons. • A simple efficient algorithm: • Compute H(T1). • Run over T Compute H(Tr) from H(Tr-1) in constant time,and make the comparisons. • Total running time O(m)?

Karp-Rabin • Let’s use modular arithmetic, this will help us keep the numbers small. • For some integer p The fingerprint of P is defined byHp(P) = H(P) (mod p)

Karp-Rabin • Lemma: And during this computation no number ever exceeds 2p.

An example P = 1 0 1 1 1 1 H(P) = 47 p = 7 Hp(P) = 47 (mod 7) = 5

Karp-Rabin • Intermediate numbers are also kept small. • We can still compute H(Tr) from H(Tr-1). Arithmetic: Modular arithmetic:

Karp-Rabin • How about the comparisons? Arithmetic:There is an occurrence of P starting at position r of T if and only if H(P) = H(Tr) Modular arithmetic: If there is an occurrence of P starting at position r of T then Hp(P) = Hp(Tr) There are values of p for which the converse is not true!

Karp-Rabin • Definition: If Hp(P) = Hp(Tr) but P doesn’t occur in T starting at position r, we say there is a false match between P and T at position r. If there is some position r such that there is a false match between P and T at position r, we say there is a false match between P and T.

Karp-Rabin • Our goal will be to choose a modulus p such that • p is small enough to keep computations efficient. • p is large enough so that the probability of a false match is kept small.

Prime moduli limit false matches • Definition:For a positive integer u, п(u) is the number of primes that are less than or equal to u. • Prime number theorem (without proof):

Prime moduli limit false matches • Lemma (without proof):if u ≥ 29, then the product of all the primes that are less than or equal to u is greater than 2u. • Example: u = 29, the prime numbers less than or equal to 29 are: 2,3,5,7,11,13,17,19,23,29, their product is 6,469,693,230 ≥ 536,870,912 = 229

Prime moduli limit false matches • Corollary:If u ≥ 29 and x is any number less than or equal to 2u, then x has fewer than п(u) distinct prime divisors. • Proof: Assume x has k ≥ п(u) distinct prime divisors q1 , …, qk then 2u ≥ x ≥ q1* …* qk but q1* …* qk is at least as large as the product of the first п(u) prime numbers.

Prime moduli limit false matches • Theorem:Let I be a positive integer, and p a randomly chosen prime less than or equal to I.If nm ≥ 29 thenThe probability of a false match between P and T is less than or equal to п(nm) / п(I) .

Prime moduli limit false matches • Proof: • Let R be the set of positions in T where P doesn’t begin. • We have • By the corollary the product has at most п(nm) distinct prime divisors. • If there is a false match at position r then p divides thus also divides • p must be in a set of size п(nm) but p was chosen randomly out of a set of size п(I).

Random fingerprint algorithm • Choose a positive integer I. • Pick a random prime p less than or equal to I, and compute P’s fingerprint – Hp(P). • For each position r in T, comput Hp(Tr) and test to see if it equals Hp(P). If the numbers are equal either declare a probable match or check and declare a definite match. • Running time: excluding verification O(m).

How to choose I • The smaller I is, computations are more efficient • The larger I is, the probability of a false match decresses. • Proposition:When I = nm21. The largest number used in the algorithm requires at most 4(log(n)+log(m)) bits. 2. The probability of a false match is at most 2.53/m.

How to choose I • Proof:

Extensions • An idea: why not choose k primes? • Proposition: • when k primes are chosen randomly and independently between 1 and I, the probability of a false match is at most • Proof: We saw that if p allows and error it is in a set of at most п(nm) integers. A false match can occur only if each of the independently chosen k primes is in a set of size of at most п(nm) integers.

An illustaration • k = 4, n = 250, m = 4000I = 250*40002 < 232

Even lower limits on the error • When k primes are used, the probability of a false match is at most • Proof: Suppose a false match occurs at position r. That means that each of the primes must divide |H(P)-H(Tr) | ≤ 2n. There are at most п(n) primes that divide it.Each prime is chosen from a set of size п(I) and by chance is a part of a set of size п(n).

Checking for error in linear time • Consider the list L of locations in T where the Karp-Rabin algorithm declares P to be found. • A run is a maximal interval of starting locationsl1, l2, …, lr in L such that every two numbers differ by at most n/2. • Let’s verify a run.

Checking for error in linear time • Check the first two declared occurrences explicitly.P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • If there is a false match stop. • Otherwise P is semi periodic with period d = l1 – l2.

Checking for error in linear time • d is the minimal period. • P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax…

Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • For each i check that li+1 – li = d. • Check the last d characters of li for each i.

Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • Check l1

Checking for error in linear time P = abbabbabbabbabT = abbabbabbabbabbabbabbabbabbax… • Check l2 • P is semi periodic with period 3.

Checking for error in linear time T = abbabbabbabbabbabbabbabbabbax… • Check li+1 – li = 3

Checking for error in linear time P = babT = abbabbabbabbabbabbabbabbabbax… • For each i check the last 3 characters of li.

Checking for error in linear time P = babT = abbabbabbabbabbabbabbabbabbax… • For each i check the last 3 characters of li. • Report a false match or approve the run.

Time analysis • No character of T is examined more than twice during a single run. • Two runs are separated by at least n/2 positions and each run is at least n positions long. Thus no character of T is examined in more than two consecutive runs. • Total verification time O(m).

Time analysis • When we have a false match we start again with a different prime. • The expected probability of a false match is O(1/m). • We have converted the algorithm to one that never mistakes with expected linear running time.

Why use Karp-Rabin? • It is efficient and simple. • It is space efficient. • It can be generalized to solve harder problems such as 2-dimensional string matching. • It’s performance is backed up by a concrete theoretical analysis.

The Shift-And Method

The Shift-And Method • We start with the exact match problem. • Define M to be a binary n by m matrix such that:M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. M(i,j) = 1 iff P[1 .. i] ≡ T[j-i+1 .. j]

The Shift-And Method • Let T = california • Let P = forM = • M(i,j) = 1 iff the first i characters of P exactly match the i characters of T ending at character j. • How does M solve the exact match problem?

How to construct M • We will construct M column by column. • Two definitions are in order: • Bit-Shift(j-1) is the vector derived by shifting the vector for column j-1 down by one and setting the first bit to 1. • Example:

How to construct M • We define the n-length binary vector U(x) for each character x in the alphabet. U(x) is set to 1 for the positions in P where character x appears. • Example: P = abaac

How to construct M • Initialize column 0 of M to all zeros • For j > 1 column j is obtained by

An example j = 1 1 2 3 4 5 6 7 8 9 10 T = x a b x a b a a x a 1 2 3 4 5 P = a b a a c

Semi-Numerical String Matching

Semi-Numerical String Matching

Presentation Transcript

String Matching

Approximate String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching

String Matching II

String Matching

String Matching

String Matching Algorithms

String Matching

String matching

Approximate String Matching

String Matching Algorithms

String Matching

String Matching

String Matching