500 likes | 645 Views
An Efficient Index Structure for String Databases. Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale /Nikita Rasam. Issue ? Find similar substrings in a large database, that is similar to a given query string quickly , using a small index structure
E N D
An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By AtulUgalmugale/Nikita Rasam
Issue ? • Find similar substrings in a large database, that is similar to a given query string quickly, using a small index structure • In some applications we store, search and analyze long sequences of discrete characters, which we call “strings” • There is a frequent need to find similarities between genetic data, web data and event sequences.
Applications ? • Information Retrieval : A typical application of information retrieval is text searching; given a large collection of documents and some text keywords we want to find the documents which contain these keywords. • searching keywords through the net: usually by “mtallica” we mean “metallica”:
Computational Biology : The problem is similar in computational biology; here we have a long DNA sequence and we want to find subsequences in it that match approximately a query sequence. …ATGCATACGATCGATT… …TGCAATGGCTTAGCTA…Animal species from the same family are bound to have more similar DNAs
Video data can be viewed as an event sequence if some pre-specified set of events are detected and stored as a sequence. Searching similar event subsequences can be used to find related video segments.
String search algorithms proposed so far are in-memory algorithms. • Scan the whole database for each query. • Size of the string database grows faster than the available memory capacity, and extensive memory requirements make the search techniques impractical. • Suffer from disk I/Os when the database is too large • Performance deteriorates for long query patterns
Similarity Metrics • The difference between two strings s1 and s2 is generally defined as the minimum number of edit operations to transform s1 to s2 called “edit distance ED”. • Edit operations: • Insert • Delete • Replace
Suppose we have two strings x,y e.g. x = kitten, y = sitting and we want to transform x into y. A closer look: k i t t e n s i t t i n g 1st step: kitten sitten (Replace) 2nd step: sittensittin (Replace) 3rd step: sittinsitting (Insert)s • What is the edit distance between “survey” and “surgery”? • s u r v e y ---> s u r g e y replace (+1) ---> s u r g e r y insert (+1) • Edit distance = 2
In the general version of edit distance, different operations may have different costs, or the costs depend on the characters involved. • For example replacement could be more expensive than insertion, or replacing “a” with “o” could be less expensive than replacing “a” with “k”. • This is called as weighted edit distance.
Global Alignment • Global alignment (or similarity) of s1 and s2 is defined as the maximum valued alignment of s1 and s2. • Given two strings S1 and S2, the global alignment of them is obtained by inserting spaces into S1 or S2 and at the ends so that are of the same length and then writing them one against the other • Example • qacdbd & qawdbqac_dbd qa_wdb_ • Edits and alignments are dual. • A sequence of edits can be converted into a global alignment. • An alignment can be converted into a sequence of edits
Local Alignment Given two strings X and Y find two substrings x and y from X and Y, respectively, such that their alignment score (in the global sense) is maximum over all pairs of such substrings. (empty substrings are allowed) S(x,y) = +2 , x = y -2, x != y -1, x = ‘_’ or y = ‘_’ X=pqraxabcstvq Y=yxaxbacsll x=axabcs y=axbacs a x a b _ c s a x _ b a c s +2+2-1+2-1+2+2=+8
String Matching Problem • Whole Matching : finding the edit distance ED(q,s) between a data string s and a query string q. • Substring Matching : Consider all substrings s[i:j] of s which are close to the query string. • Two Types of Queries : Range search seeks all the substrings of S which are within an edit distance of r to a given query q (r = range query) K-nearest neighbor search seeks the K closest substrings of S to q.
Challenges in solving the substring matching problem • Finding the edit distance is very costly in terms of both time and space. • The strings in the database may be very long. • The database size for most applications grows exponentially. New approach to overcome challenges • Define a lower bound distance for substring searching • Improve this lower bound by using the idea of wavelet transformation • Use the MRS index structure based on the aforementioned distance formulations
A dynamic programming algorithm for computing the edit distance • Problem: find the edit distance between strings x and y. • Create a (|x|+1)×(|y|+1) matrix C, where Ci,j represents the minimum number of operations to match x1..i with y1..j. The matrix is constructed as follows. • Ci,0 = I • C0,j = j • Ci,j = min{(Ci-1,j-1)+cost, replace (Ci,j-1)+1, insert (Ci-1,j)+1} delete cost = 0 if xi=yi, else 1
How do we perform substring search? • The same dynamic programming algorithm can be used to find the most similar substrings of a query sting q. • The difference is that we set C0,j=0 for all j, since any text position could be the potential start of a match. • If the similarity distance bound is k, we report all positions, where Cm ≤k (m is the last row – m = |q|).
Frequency Vector • Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n]. • Example: • s = AATGATAG • f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2] • Let s be a string from the alphabet ={1, ..., }. Let f(s) =[v1, ..., v], be the frequency vector of s then i-1 vi = |s|. • An edit operation on s has one of the following effects on f(s), for 1 i , j , and i != j : • vi := vi + 1 • vi := vi - 1 • vi := vi + 1 and vj := vj - 1
Effect of Edit Operations on Frequency Vector • Delete : decreases an entry by 1. • Insert : increases an entry by 1. • Replace : Insert + Delete • Example: • s = AATGATAG => f(s) = [4, 0, 2, 2] • (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] • (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] • (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]
Frequency Distance • Let u and v be integer points in dimensional space. The frequency distance, FD 1 (u,v) between u and v is defined as the minimum number of steps in order to go from u to v ( or equivalently from v to u) by moving to a neighbor point at each step. frequency vector: f(s) =[n1, ..., n]. • Let s 1 and s 2 be two strings from the alphabet ={1, ..., } then • FD 1 (f(s 1), f(s 2)) ED (s 1 ,s 2)
f(s) FD1(f(q),f(s)) f(q) An Approximation to ED: Frequency Distance (FD1) • s = AATGATAG => f(s)=[4, 0, 2, 2] • q = ACTTAGC => f(q)=[2, 2, 1, 2] • pos = (4-2) + (2-1) = 3 • neg = (2-0) = 2 • FD1(f(s),f(q)) = 3 • ED(q,s) = 4 • FD1(f(s1),f(s2))=max{pos,neg}. • FD1(f(s1),f(s2)) ED(s1,s2).
Frequency Distance Calculation/* u and v are dimensional integer points */Algorithm : FD 1 (u,v) posDistance := negDistance := 0For i := 1 to FD1(u, v) = max { posDist, negDist}
Wavelet Vector ComputationLet s = c1c2…cnbe a string from the alphabet ={1, ..., } then Kth level wavelet transformation, k (s) , 0 <k< log2n of s is defined as:k (s) = [vk,1, ..., vk,n/2k] where vk,I = [Ak,i , Bk,i], f (ci)k =0 Ak-1,2i+ Ak-1,2i+10<k <log2n 0 k =0 Ak-1,2i- Ak-1,2i+10<k <log2n 0<i<(n/2k)-1 Ak,i= Bk,i=
Using Local Information: Wavelet Decomposition of Strings • s = AATGATAC => f(s)=[4, 1, 1, 2] • s = AATG + ATAC = s1 + s2 • f(s1) = [2, 0, 1, 1] • f(s2) = [2, 1, 0, 1] • 1(s)= f(s1)+f(s2) = [4, 1, 1, 2] • 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]
Wavelet Decomposition of a String: General Idea • Ai,j = f(s(j2i : (j+1)2i-1)) • Bi,j = Ai-1,2j - Ai-1,2j+1 First wavelet coefficient Second wavelet coefficient (s)=
Wavelet Transformation: Example s = T C A Cn = |s| = 4 0(s) = [v0,0 , v0,1 ,v0,2 ,v0,3] = [ (A0,0, B0,0), (A0,1, B0,1), (A0,2, B0,2), (A0,3, B0,3) ] = [ (f(t), 0), (f(c), 0), (f(a), 0), (f(c), 0) ] = [([0,0,1], 0), ([0,1,0], 0), ([1,0,0], 0), ([0,1,0], 0) ] 1(s) = [ ([0,1,1], [0,-1,1]), ([1,1,0], [1,-1,0]) ] 2(s) = [ ( [1,2,1], [-1,0,1] ) ] First wavelet coefficient Second wavelet coefficient
Maximum Frequency Distance Calculation FD(s1,s2) = max {FD1(f (s1), f (s2)), FD2(ψ(s1),ψ(s2))} FD1is the Frequency Distance FD2is the Wavelet Distance
transform MRS-Index Structure Creation s1 w=2a
MRS-Index Structure Creation s1 ... slide c times c=box capacity
MRS-Index Structure Creation s1 ...
MRS-Index Structure Creation s1 Ta,1 ... W=2a
Using Different Resolutions s1 Ta,1 ... W=2a Ta+1,1 ... W=2a+1
MRS-index properties • Relative MBR volume (Precision) decreases when • c increases. • w decreases. • MBRs are highly clustered. Box volume Box Capacity
B f(s) FD(f(q),f(s)) FD(f(q),B) f(q) f(q) Frequency Distance to an MBRLet q be the query string of length 2i where a <= i <= a + l - 1 . Given an MBR B, we define FD(q,B)= min(s belongs to B) FD(q,s)
1= 2 1 3 2 208 16 64 128 2. Perform a partial range query for each subquery on the corresponding row of the index structure, and refine ε. 1. Partition the query string into subqueries at various resolutions available in our index. 3. Disk pages corresponding to last result set are read, and postprocessing is done to elminate false retrievals. Range Queries s1 s2 sd ... ... ... ... w=24 ... ... ... ... w=25 ... ... ... ... w=26 ... ... ... ... w=27 q1 q2 q3 q
k-Nearest Neighbor Query k = 3
k-Nearest Neighbor Query k = 3
k-Nearest Neighbor Query r k = 3 r = Edit distance to 3rd closest substring
Experimental Settings • w={128, 256, 512, 1024}. • Human chromosomes from (www.ncbi.nlm.nih.gov) • chr02, chr18, chr21, chr22 • Plotted results are from chr18 dataset. • Queries are selected from data set randomly for 512 |q| 10000. • An NFA based technique [BYN99] is implemented for comparison.
Experimental Results 1:Effect of Box Capacity (10-NN) • The cost of the MRS-index increases as the box capacity increases. • The cost of the MRS-index is much lower than the NFA technique for all these box capacities. • Although using 2-wavelet coefficient slightly improves the performance for the same box capacity, the size of the index structure is doubled. For same amount of memory, the single coefficient version performs better
Experimental Results 2:Effect of Window Size (10-NN) • The MRS-index structure outperforms the NFA technique for all the window sizes. • The performance of the MRS index structure itself improves as the window size increases.
Experimental Results 3:k-NN queries • The performance of the MRS-index structure drops for large values of k , it still performs better than the NFA technique. • Achieved speedups up to 45 for 10 nearest neighbors. The speedup for 200 nearest neighbors is 3. • As the number of nearest neighbors increases, the performance of the MRS-index structure approaches to that of the NFA technique.
Experimental Results 4:Range Queries • The MRS-index structure performed up to 12 times faster than the NFA technique. The performance of the MRS-index structure improved when the queries are selected from different data strings. This is because the DNA strings have a high self similarity. • The performance of the MRS index structure deteriorates as the error rate increases. This is because the size of the candidate set increases as the error rate increases.
Discussion • In-memory (index size is 1-2% of the database size). • Lossless search. • 3 to 45 times faster than NFA technique for k-NN queries. • 2 to 12 times faster than NFA technique for range queries. • Can be used to speedup any previously defined technique.