450 likes | 468 Views
This research presents a compact index structure for quick substring matching in large databases, showcasing frequency vector and wavelet transform techniques. Discover our novel multi-resolution index structure and experimental results.
E N D
An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer
Whole/Substring Matching Problem • Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). database string query string
String Similarity • Motivation: • Applications • Genetic sequence databases, NCBI • Text databases, spell checkers, web search. • Video databases (e.g. VIRAGE, MEDIA360) • Database size is too large. Most of the techniques available are in-memory. • Space requirement of current indexes is too large. Base Pairs (millions) Year
Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion
Notation • q : query string. • m,n : length of strings. • r : range query radius. • = r/|q|: error rate.
String Similarity: an example • A C T - - T A G C R I I D • A A T G A T A G -
Background • Edit operations: • Insert • Delete • Replace • Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2. • Finding the edit distance is costly. • O(mn) time and space if m and n are lengths of s1 and s2 if dynamic programming is used [NW70, SW81].
Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion
Frequency Vector • Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n]. • Example: • s = AATGATAG • f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]
Effect of Edit Operations on Frequency Vector • Delete : decreases an entry by 1. • Insert : increases an entry by 1. • Replace : Insert + Delete • Example: • s = AATGATAG => f(s) = [4, 0, 2, 2] • (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] • (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] • (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]
f(s) FD1(f(q),f(s)) f(q) An Approximation to ED:Frequency Distance (FD1) • s = AATGATAG => f(s)=[4, 0, 2, 2] • q = ACTTAGC => f(q)=[2, 2, 1, 2] • pos = (4-2) + (2-1) = 3 • neg = (2-0) = 2 • FD1(f(s),f(q)) = 3 • ED(q,s) = 4 • FD1(f(s1),f(s2))=max{pos,neg}. • FD1(f(s1),f(s2)) ED(s1,s2).
Frequency Set of strings 1 Set of strings 2 Distance Edit Distance An Illustration of Frequency Distance & Edit Distance v1 v2
Using Local Information: Wavelet Decomposition of Strings • s = AATGATAC => f(s)=[4, 1, 1, 2] • s = AATG + ATAC = s1 + s2 • f(s1) = [2, 0, 1, 1] • f(s2) = [2, 1, 0, 1] • 1(s)= f(s1)+f(s2) = [4, 1, 1, 2] • 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]
Wavelet Decomposition of a String: General Idea • Ai,j = f(s(j2i : (j+1)2i-1)) • Bi,j = Ai-1,2j - Ai-1,2j+1 First wavelet coefficient Second wavelet coefficient (s)=
Wavelet Decomposition & ED • Define FD(s1,s2)=max{FD1, FD2}.
Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN and range queries • Experimental results • Conclusion
transform MRS-Index Structure Creation s1 w=2a
MRS-Index Structure Creation s1 ... slide c times c=box capacity
MRS-Index Structure Creation s1 ...
MRS-Index Structure Creation s1 Ta,1 ... W=2a
Using Different Resolutions s1 Ta,1 ... W=2a Ta+1,1 ... W=2a+1
MRS-index properties • Relative MBR volume (Precision) decreases when • c increases. • w decreases. • MBRs are highly clustered. Box volume Box Capacity
Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion
1= 2 1 3 2 208 16 64 128 Range Queries [KS01] s1 s2 sd ... ... ... ... w=24 ... ... ... ... w=25 ... ... ... ... w=26 ... ... ... ... w=27
k-Nearest Neighbor Query k = 3 r = Edit distance to 3rd closest substring
k-Nearest Neighbor Query r k = 3
k-Nearest Neighbor Query k = 3
Outline • Motivation & background • Our contribution • Experimental results • Conclusion
Experimental Settings • w={128, 256, 512, 1024}. • Human chromosomes from (www.ncbi.nlm.nih.gov) • chr02, chr18, chr21, chr22 • Plotted results are from chr18 dataset. • Queries are selected from data set randomly for 512 |q| 10000. • An NFA based technique [BYN99] is implemented for comparison.
Outline • Motivation & background • Our Contribution • Experimental results • Discussion & conclusion
Discussion • In-memory (index size is 1-2% of the database size). • Lossless search. • 3 to 45 times faster than NFA technique for k-NN queries. • 2 to 12 times faster than NFA technique for range queries. • Can be used to speedup any previously defined technique.
Future Work • Extend to weighted edit distance and affine gaps. • Extend to local similarity (substring/substring) search. • Compare the quality of answers and speed to BLAST (lossy search). • Use as a preprocessing step to BLAST. • Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).
Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree
Related Work (Similar problems) • [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. • [Gus97] (Gusfield) exact matching, suffix trees. • [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.
B f(s) FD(f(q),f(s)) FD(f(q),B) f(q) f(q) Frequency Distance to an MBR