670 likes | 875 Views
Similarity Searches in Sequence Databases. Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles. Contents. Introduction Whole Sequence Searches Subsequence Searches Segment-Based Subsequence Searches Multi-Dimensional Subsequence Searches
E N D
Similarity Searches in Sequence Databases Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles
Contents • Introduction • Whole Sequence Searches • Subsequence Searches • Segment-Based Subsequence Searches • Multi-Dimensional Subsequence Searches • Conclusion
25 20 temperature (oC) 15 10 5 time 8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM What is Sequence? • A sequence is an ordered list of elements. S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1 • Sequences are principal data format in many applications.
What is Similarity Search? • Similarity search finds sequences whose changing patterns are similar to that of a query sequence. • Example • Detect stocks with similar growth patterns • Find persons with similar voice clips • Find patients whose brain tumors have similar evolution patterns • Similarity search helps in clustering, data mining, and rule discovery.
Classification of Similarity Search • Similarity Searches are classified as: • Whole sequence searches • Subsequence searches • Example • S = 1,2,3 • Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 } • In whole sequence searches, the sequence S itself is compared with a query sequence Q. • In subsequence searches, every possible subsequence of S can be compared with a query sequence q.
Similarity Measure • Lp Distance Metric • L1 : Manhattan distance or city-block distance • L2 : Euclidean distance • L : maximum distance in any element pairs • requires that two sequences should have the same length
Similarity Measure (2) • Time Warping Distance • Originally introduced in the area of speech recognition • Allows sequences to be stretched along the time axis 3,5,6 3,3,5,6 3,3,3,5,6 3,3,3,5,5,6 … • Each element of a sequence can be mapped to one or more neighboring elements of another sequence. • Useful in applications where sequences may be of different lengths or different sampling rates Q = 10, 15, 20 S = 10, 15, 16, 20
Similarity Measure (3) • Time Warping Distance (2) • Defined recursively • Computed by dynamic programming technique, O(|S||Q|) DTW (S, Q[2:-]) DTW (S[2:-], Q) DTW (S[2:-], Q[2:-]) DTW (S, Q) = DBASE (S[1], Q[1]) + min DBASE (S[1], Q[1]) = | S[1] – Q[1] | P Q Q[2:-] Q[1] S S[2:-] S[1]
6 16 11 12 6 13 9 10 7 10 7 8 6 6 4 5 5 3 2 3 4 1 1 2 S 3 4 3 Q Similarity Measure (4) • Time Warping Distance (3) • S = 4,5,6,7,6,6, Q = 3,4,3 • When using L1 as a DBASE, DTW (S, Q) = 12 | S[i]Q[j] | + min (V1,V2,V3) S[i] V2 V3 V1 Q[j]
False Alarm and False Dismissal • False Alarm • Candidates not similar to a query. • Minimize false alarms for efficiency • False Dismissal • Similar sequences not retrieved by index search • Avoid false dismissals for correctness data sequences candidates candidates false alarm similar seq. similar seq. false dismissal
Contents • Introduction • Whole Sequence Searches • Subsequence Searches • Segment-Based Subsequence Searches • Multi-Dimensional Subsequence Searches • Conclusion
Problem Definition • Input • Set of data sequences {S} • Query sequence Q • Distance tolerance • Output • Set of data sequences whose distances to Q are within • Similarity Measure • Time warping distance function, DTW • L as a distance function for each element pair • If the distance of every element pair is within , then DTW(S,Q) .
Previous Approaches • Naïve Scan [Ber96] • Read every data sequence from database • Apply dynamic programming technique • For m data sequences with average length L, O(mL|Q|) • FastMap-Based Technique [Yi98] • Use FastMap technique for feature extraction • Map features into multi-dimensional points • Use Euclidean distance in index space for filtering • Could not guarantee “no false dismissal”
Previous Approaches (2) • LB-Scan [Yi98] • Read every data sequence from database • Apply the lower-bound distance function Dlb which satisfies the following lower-bound theorem: Dlb (S,Q) DTW (S,Q) • Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|)) • Guarantee no false dismissal • Based on sequential scanning
Proposed Approach • Goal • No false dismissal • High query processing performance • Sketch • Extract a time-warping invariant feature vector • Build a multi-dimensional index • Use a lower-bound distance function for filtering
Proposed Approach (2) • Feature Extraction • F(S) = First(S), Last(S), Max(S), Min(S) • F(S) is invariant to time warping transformation. • Distance Function for Feature Vectors | First(S) First(Q) | | Last(S) Last(Q) | | Max(S) Max(Q) | | Min(S) Min(Q) | DFT (F(S), F(Q)) = max
Proposed Approach (3) • Distance Function for Feature Vectors (2) • Satisfies lower-bounding theorem: DFT (F(S),F(Q)) DTW (S,Q) • More accurate than Dlb proposed in LB-Scan • Faster than Dlb (O(1) vs. O(|S|+|Q|))
Proposed Approach (4) • Indexing • Build a multi-dimensional index from a set of feature vectors • Index entry First(S), Last(S), Max(S), Min(S), Identifier(S) • Query Processing • Extract a feature vector F(Q) • Perform range queries in index space to find data points included in the following query rectangle: [ First(Q) , First(Q) + ],[ Last(Q) , Last(Q) + ], [ Max(Q) , Max(Q) + ], [ Min(Q) , Min(Q) + ] • Perform post-processing to discard false alarms
Performance Evaluation • Implementation • Implemented with C++ on UNIX operating system • R-tree is used as a multi-dimensional index. • Experimental Setup • S&P 500 stock data set (m=545, L=232) • Random walk synthetic data set • SunSparc Ultra-5
Performance Evaluation (2) • Filtering Ratio • Better-than LB-Scan
Performance Evaluation (3) • Query Processing Time • Faster than LB-Scan and Naïve-Scan
Contents • Introduction • Whole Sequence Searches • Subsequence Searches • Segment-Based Subsequence Searches • Multi-Dimensional Subsequence Searches • Conclusion
Problem Definition • Input • Set of data sequences {S} • Query sequence q • Distance tolerance • Output • Set of subsequences whose distances to q are within • Similarity Measure • Time warping distance function, DTW • Any LP metric as a distance function for element pairs
Previous Approaches • Naïve-Scan [Ber96] • Read every data subsequence from database • Apply dynamic programming technique • For m data sequences with average length n, O(mL2|q|)
Previous Approaches (2) • ST-Index [Fal94] • Assume that the minimum query length (w) is known in advance. • Locates a sliding window of size w at every possible location • Extract a feature vector inside the window • Map a feature vector into a point and group trails into MBR (Minimum Bounding Rectangle) • Use Euclidean distance in index space for filtering • Could not guarantee “no false dismissal”
Proposed Approach • Goal • No false dismissal • High performance • Support diverse similarity measure • Sketch • Convert into sequences of discrete symbols • Build a sparse suffix tree • Use a lower-bound distance function for filtering • Apply branch-pruning to reduce the search space
Proposed Approach (2) • Conversion • Generate categories from the distribution of element values • Maximum-entropy method • Equal-interval method • DISC method • Convert element to the symbol of the corresponding category • Example A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0] S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1 SC = B, B, C, D, B, A
Proposed Approach (3) • Indexing • Extract suffixes from sequences of discrete symbols. • Example From S1C= A, B, B, A, we extract four suffixes: ABBA, BBA, BA, A
Proposed Approach (4) • Indexing (2) • Build a suffix tree. • Suffix tree is originally proposed to retrieve substrings exactly matched to the query string. • Suffix tree consists of nodes and edges. • Each suffix is represented by the path from the root node to a leaf node. • Labels on the path from the root to the internal node Ni represents the longest common prefix of the suffixes under Ni • Suffix tree is built with computation and space complexity, O(mL).
Proposed Approach (4) • Indexing (3) • Example : suffix tree from S1C= A, B, B, A and S2C= A, B A B B B $ A A B $ $ $ $ A $ S1C[1:-] S2C[1:-] S1C[4:-] S1C[2:-] S1C[3:-] S2C[2:-]
Proposed Approach (5) • Query Processing query (q, ) Index Searching candidates answers Post Processing suffix tree data sequences
Proposed Approach (6) • Index Searching • Visit each node of suffix tree by depth-first traversal. • Build lower-bound distance table for q and edge labels. • Inspect the last columns of newly added rows to find candidates. • Apply branch-pruning to reduce the search space. • Branch-pruning theorem: If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .
Proposed Approach (7) • Index Searching (2) • Example : q = 2, 2, 1, = 1.5 N1 A 1 2 2 A 2 2 1 q ….. N2 B D B 1 1 1.1 D 2.1 2.1 4.1 A 1 2 2 N3 N4 A 1 2 2 2 2 1 — q — 2 2 1 q ….. …..
Proposed Approach (8) • Lower-Bound Distance Function DTW-LB 0 if v is within the range of A (A.min v) P if v is smaller than A.min (v A.max) P if v is larger than A.max DBASE-LB (A, v) = v A.max A.max A.max v A.min A.min A.min v possible minimum distance = 0 possible minimum distance = (A.min – v)P possible minimum distance = (v – A.max)P
Proposed Approach (9) • Lower-Bound Distance Function DTW-LB (2) • satisfies the lower-bounding theorem DTW-LB(sC, q) DTW (s,q) • computation complexity O(|sC||q|) DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) + min DTW-LB (sC, q[2:-]) DTW-LB (sC[2:-], q) DTW-LB (sC[2:-], q[2:-])
Proposed Approach (10) • Computation Complexity • m is the number of data sequences. • L is the average length of data sequences. • The left expression is for index searching. • The right expression is for post-processing. • RP ( 1) is the reduction factor by branch-pruning. • RD ( 1) is the reduction factor by sharing distance tables. • n is the number of subsequences requiring post-processing.
Proposed Approach (11) • Sparse Indexing • The index size is linear to the number of suffixes stored. • To reduce the index size, we build a sparse suffix tree (SST). • That is, we store the suffix SC[i:-] only if SC[i] SC[i–1]. • Compaction Ratio • Example • SC = A, A, A, A, C, B, B • store only three suffixes (SC[1:-], SC[5:-], and SC[6:-]) • compaction ratio C = 7/3
Proposed Approach (12) • Sparse Indexing (2) • When traversing the suffix tree, we need to find non-stored suffixes and compute their distances to q. • Assume that k elements of sC have the same value. • Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored. • For non-stored suffixes, we introduce another lower-bound distance function. DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1) DBASE-LB(sC[1], q[1]) • DTW-LB2 satisfies the lower-bounding theorem. • DTW-LB2 is O(1) when DTW-LB(sC, q) is given.
Proposed Approach (13) • Sparse Indexing (3) • With sparse indexing, the complexity becomes: • m is the number of data sequences. • L is the average length of data sequences. • C is the compaction ratio. • n is the number of subsequences requiring post-processing. • RP ( 1) is the reduction factor by branch-pruning. • RD ( 1) is the reduction factor by sharing distance tables.
Performance Evaluation • Implementation • Implemented with C++ on UNIX operating system • Experimental Setup • S&P 500 stock data set (m=545, L=232) • Random walk synthetic data set • Maximum-Entropy (ME) categorization • Disk-based suffix tree construction algorithm • SunSparc Ultra-5
Performance Evaluation (2) • Comparison with Naïve-Scan • increasing distance-tolerances • S&P 500 stock data set, |q|=20
Performance Evaluation (3) • Scalability Test • increasing average length of data sequences • random-walk data set, |q|=20,m=200
Performance Evaluation (4) • Scalability Test (2) • increasing total number of data sequences • random-walk data set, |q|=20, L=200
Contents • Introduction • Whole Sequence Searches • Subsequence Searches • Segment-Based Subsequence Searches • Multi-Dimensional Subsequence Searches • Conclusion
Introduction • We extend the proposed subsequence searching method to large sequence databases. • In the retrieval of similar subsequences with time warping distance function, • Sequential Scanning is O(mL2|q|). • The proposed method is O(mL2|q| / R) (R 1). • It makes search algorithms suffer from severe performance degradation when L is very large. • For a database with long sequences, we need a new searching scheme linear to L.
SBASS • We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS) • Sequences are divided into a series of piece-wise segments. • When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments. • The lengths of segments may be different. • SS represents the segmented sequence of S. S = 4,5,8,9,11,8,4,3 |S| = 8 SS = 4,5,8,9,11, 8,4,3 |SS| = 2
SBASS (2) • Only four subsequences of SS are compared with QS. SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5] S SS[3] SS[2] SS[1] SS[4] SS[5] SS qS qS[1] qS[2]
SBASS (3) • For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|). • Sequential scanning for SBASS scheme is O(mL|q|). • We introduce an indexing technique with O(mL|q|/R) (R 1).
Sketch of Proposed Approach • Indexing • Convert sequences to segmented sequences. • Extract a feature vector from each segment. • Categorize feature vectors. • Convert segmented sequences to sequences of symbols. • Construct suffix tree from sequences of symbols. • Query Processing • Traverse the suffix tree to find candidates. • Discard false alarms in post processing.
Segmentation • Approach • Divide at peak points. • Divide further if maximum deviation from interpolation line is too large. • Eliminate noises. • Compaction Ratio (C) = |S| / |SS| too large deviation noises