Efficient Indexing for String Databases: A Novel Approach

An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara http://www.cs.ucsb.edu/~tamer

Whole/Substring Matching Problem • Find similar substrings in a database, that are similar to a given query string quickly, using a small index structure (1-2 % of database size). database string query string

String Similarity • Motivation: • Applications • Genetic sequence databases, NCBI • Text databases, spell checkers, web search. • Video databases (e.g. VIRAGE, MEDIA360) • Database size is too large. Most of the techniques available are in-memory. • Space requirement of current indexes is too large. Base Pairs (millions) Year

Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN & range queries • Experimental results • Conclusion

Notation • q : query string. • m,n : length of strings. • r : range query radius. •  = r/|q|: error rate.

String Similarity: an example • A C T - - T A G C R I I D • A A T G A T A G -

Background • Edit operations: • Insert • Delete • Replace • Edit distance (ED) between s1 and s2 = minimum number of edit operations to transform s1 to s2. • Finding the edit distance is costly. • O(mn) time and space if m and n are lengths of s1 and s2 if dynamic programming is used [NW70, SW81].

Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

Frequency Vector • Let s be a string from the alphabet ={1, ..., }. Let ni be the number of occurrences of the character i in s for 1i, then frequency vector: f(s) =[n1, ..., n]. • Example: • s = AATGATAG • f(s) = [nA, nC, nG, nT] = [4, 0, 2, 2]

Effect of Edit Operations on Frequency Vector • Delete : decreases an entry by 1. • Insert : increases an entry by 1. • Replace : Insert + Delete • Example: • s = AATGATAG => f(s) = [4, 0, 2, 2] • (del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] • (ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] • (AC), s = ACCTATAG => f(s) = [3, 2, 1, 2]

f(s) FD1(f(q),f(s)) f(q) An Approximation to ED:Frequency Distance (FD1) • s = AATGATAG => f(s)=[4, 0, 2, 2] • q = ACTTAGC => f(q)=[2, 2, 1, 2] • pos = (4-2) + (2-1) = 3 • neg = (2-0) = 2 • FD1(f(s),f(q)) = 3 • ED(q,s) = 4 • FD1(f(s1),f(s2))=max{pos,neg}. • FD1(f(s1),f(s2)) ED(s1,s2).

Frequency Set of strings 1 Set of strings 2 Distance Edit Distance An Illustration of Frequency Distance & Edit Distance v1 v2

Using Local Information: Wavelet Decomposition of Strings • s = AATGATAC => f(s)=[4, 1, 1, 2] • s = AATG + ATAC = s1 + s2 • f(s1) = [2, 0, 1, 1] • f(s2) = [2, 1, 0, 1] • 1(s)= f(s1)+f(s2) = [4, 1, 1, 2] • 2(s)= f(s1)-f(s2) = [0, -1, 1, 0]

Wavelet Decomposition of a String: General Idea • Ai,j = f(s(j2i : (j+1)2i-1)) • Bi,j = Ai-1,2j - Ai-1,2j+1 First wavelet coefficient Second wavelet coefficient (s)=

Wavelet Decomposition & ED • Define FD(s1,s2)=max{FD1, FD2}.

Outline • Motivation & background • Our contribution • Frequency vector, frequency distance & wavelet transform • Multi-resolution index structure • k-NN and range queries • Experimental results • Conclusion

transform MRS-Index Structure Creation s1 w=2a

MRS-Index Structure Creation s1

MRS-Index Structure Creation s1 ... slide c times c=box capacity

MRS-Index Structure Creation s1 ...

MRS-Index Structure Creation s1 Ta,1 ... W=2a

Using Different Resolutions s1 Ta,1 ... W=2a Ta+1,1 ... W=2a+1

MRS-Index Structure

MRS-index properties • Relative MBR volume (Precision) decreases when • c increases. • w decreases. • MBRs are highly clustered. Box volume Box Capacity

1= 2 1 3 2 208 16 64 128 Range Queries [KS01] s1 s2 sd ... ... ... ... w=24 ... ... ... ... w=25 ... ... ... ... w=26 ... ... ... ... w=27

k-Nearest Neighbor Query [KSF+96, SK98] k = 3

k-Nearest Neighbor Query k = 3 r = Edit distance to 3rd closest substring

k-Nearest Neighbor Query r k = 3

k-Nearest Neighbor Query k = 3

Outline • Motivation & background • Our contribution • Experimental results • Conclusion

Experimental Settings • w={128, 256, 512, 1024}. • Human chromosomes from (www.ncbi.nlm.nih.gov) • chr02, chr18, chr21, chr22 • Plotted results are from chr18 dataset. • Queries are selected from data set randomly for 512  |q|  10000. • An NFA based technique [BYN99] is implemented for comparison.

Experimental Results 1:Effect of Box Capacity (10-NN)

Experimental Results 2:Effect of Window Size (10-NN)

Experimental Results 3:k-NN queries

Experimental Results 4:Range Queries

Outline • Motivation & background • Our Contribution • Experimental results • Discussion & conclusion

Discussion • In-memory (index size is 1-2% of the database size). • Lossless search. • 3 to 45 times faster than NFA technique for k-NN queries. • 2 to 12 times faster than NFA technique for range queries. • Can be used to speedup any previously defined technique.

Future Work • Extend to weighted edit distance and affine gaps. • Extend to local similarity (substring/substring) search. • Compare the quality of answers and speed to BLAST (lossy search). • Use as a preprocessing step to BLAST. • Apply the MRS index structure for larger alphabet size (e.g. protein sequences.).

Related Work • Lossless search • Online • [Mye86] (Myers) reduce space requirement to O(rn), where r is query radius. • [WM92] (Wu, Manber) binary masks, O(rn). • [BYN99] (Beaze-Yates, Navarro) NFA • Offline (index based) • [Mye94] (Myers) condensed r-neighborhood. • [BYN97] (Beaze-Yates, Navarro) dictionary. • Lossy search • [AG90] (Altschul, Gish) BLAST. • FASTA, SENSEI, MegaBLAST, WU-BLAST, PHI-BLAST, FLASH, QUASAR, REPUTER, MumMER. • [GWWV00] (Giladi, Walker, Wang, Volkmuth) SST-Tree

Related Work (Similar problems) • [BYP92] (Beaze-Yates, Perleberg) only replace is allowed. • [Gus97] (Gusfield) exact matching, suffix trees. • [JKS00] (Jagadish, Koudas, Srivastava) exact matching with wild-cards for multidimensional strings, elided trees and R-tree.

THANK YOU

B f(s) FD(f(q),f(s)) FD(f(q),B) f(q) f(q) Frequency Distance to an MBR

Efficient Indexing for String Databases: A Novel Approach

Efficient Indexing for String Databases: A Novel Approach

Presentation Transcript

Dynamic Authenticated Index Structures for Outsourced Databases

CURE: An Efficient Clustering Algorithm for Large Databases

Birch: An efficient data clustering method for very large databases

An Efficient, Cost-Driven Index Selection Tool for MS-SQL Server

An Efficient Multi-Dimensional Index for Cloud Data Management

G-string : a novel approach for efficient search in graph databases

An Efficient Trajectory Index Structure for Moving Objects in Location-based Services

An Efficient Index-based Protein Structure Database Searching Method

An Efficient Index Structure for String Databases

Protein Structure Databases

Efficient String Matching : An Aid to Bibliographic Search

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

Querying Text Databases for Efficient Information Extraction

An Efficient Index Structure for String Databases

MicroHash: An efficient Index Structure for Wireless Sensor Devices

Chemical Structure Index

MicroHash:An Efficient Index Structure for Flash-Based Sensor Devices

An Index Provider for Proper and Efficient Index Development

An Efficient Algorithm for Read Matching in DNA Databases