Similarity Searches in Sequence Databases

Similarity Searches in Sequence Databases Sang-Hyun Park KMeD Research Group Computer Science Department University of California, Los Angeles

Contents • Introduction • Whole Sequence Searches • Subsequence Searches • Segment-Based Subsequence Searches • Multi-Dimensional Subsequence Searches • Conclusion

25 20 temperature (oC) 15 10 5 time 8AM 10AM 12PM 2PM 4PM 6PM 8PM 10PM What is Sequence? • A sequence is an ordered list of elements. S = 14.3, 18.2, 22.0, 22,4, 19.5, 17.1, 15.8, 15.1 • Sequences are principal data format in many applications.

What is Similarity Search? • Similarity search finds sequences whose changing patterns are similar to that of a query sequence. • Example • Detect stocks with similar growth patterns • Find persons with similar voice clips • Find patients whose brain tumors have similar evolution patterns • Similarity search helps in clustering, data mining, and rule discovery.

Classification of Similarity Search • Similarity Searches are classified as: • Whole sequence searches • Subsequence searches • Example • S =  1,2,3  • Subsequences (S) = { 1, 2, 3, 1,2, 2,3, 1,2,3 } • In whole sequence searches, the sequence S itself is compared with a query sequence Q. • In subsequence searches, every possible subsequence of S can be compared with a query sequence q.

Similarity Measure • Lp Distance Metric • L1 : Manhattan distance or city-block distance • L2 : Euclidean distance • L : maximum distance in any element pairs • requires that two sequences should have the same length

Similarity Measure (2) • Time Warping Distance • Originally introduced in the area of speech recognition • Allows sequences to be stretched along the time axis 3,5,6 3,3,5,6  3,3,3,5,6  3,3,3,5,5,6  … • Each element of a sequence can be mapped to one or more neighboring elements of another sequence. • Useful in applications where sequences may be of different lengths or different sampling rates Q = 10, 15, 20  S =  10, 15, 16, 20 

Similarity Measure (3) • Time Warping Distance (2) • Defined recursively • Computed by dynamic programming technique, O(|S||Q|) DTW (S, Q[2:-]) DTW (S[2:-], Q) DTW (S[2:-], Q[2:-]) DTW (S, Q) = DBASE (S[1], Q[1]) + min DBASE (S[1], Q[1]) = | S[1] – Q[1] | P Q Q[2:-] Q[1] S S[2:-] S[1]

6 16 11 12 6 13 9 10 7 10 7 8 6 6 4 5 5 3 2 3 4 1 1 2 S 3 4 3 Q Similarity Measure (4) • Time Warping Distance (3) • S = 4,5,6,7,6,6, Q = 3,4,3 • When using L1 as a DBASE, DTW (S, Q) = 12 | S[i]Q[j] | + min (V1,V2,V3) S[i] V2 V3 V1 Q[j]

False Alarm and False Dismissal • False Alarm • Candidates not similar to a query. • Minimize false alarms for efficiency • False Dismissal • Similar sequences not retrieved by index search • Avoid false dismissals for correctness data sequences candidates candidates false alarm similar seq. similar seq. false dismissal

Problem Definition • Input • Set of data sequences {S} • Query sequence Q • Distance tolerance  • Output • Set of data sequences whose distances to Q are within  • Similarity Measure • Time warping distance function, DTW • L as a distance function for each element pair • If the distance of every element pair is within , then DTW(S,Q)  .

Previous Approaches • Naïve Scan [Ber96] • Read every data sequence from database • Apply dynamic programming technique • For m data sequences with average length L, O(mL|Q|) • FastMap-Based Technique [Yi98] • Use FastMap technique for feature extraction • Map features into multi-dimensional points • Use Euclidean distance in index space for filtering • Could not guarantee “no false dismissal”

Previous Approaches (2) • LB-Scan [Yi98] • Read every data sequence from database • Apply the lower-bound distance function Dlb which satisfies the following lower-bound theorem: Dlb (S,Q)    DTW (S,Q)   • Faster than the original time warping distance function (O(|S|+|Q|) vs. O(|S||Q|)) • Guarantee no false dismissal • Based on sequential scanning

Proposed Approach • Goal • No false dismissal • High query processing performance • Sketch • Extract a time-warping invariant feature vector • Build a multi-dimensional index • Use a lower-bound distance function for filtering

Proposed Approach (2) • Feature Extraction • F(S) =  First(S), Last(S), Max(S), Min(S)  • F(S) is invariant to time warping transformation. • Distance Function for Feature Vectors | First(S)  First(Q) | | Last(S)  Last(Q) | | Max(S)  Max(Q) | | Min(S)  Min(Q) | DFT (F(S), F(Q)) = max

Proposed Approach (3) • Distance Function for Feature Vectors (2) • Satisfies lower-bounding theorem: DFT (F(S),F(Q))    DTW (S,Q)   • More accurate than Dlb proposed in LB-Scan • Faster than Dlb (O(1) vs. O(|S|+|Q|))

Proposed Approach (4) • Indexing • Build a multi-dimensional index from a set of feature vectors • Index entry  First(S), Last(S), Max(S), Min(S), Identifier(S)  • Query Processing • Extract a feature vector F(Q) • Perform range queries in index space to find data points included in the following query rectangle:  [ First(Q)  , First(Q) +  ],[ Last(Q)  , Last(Q) +  ], [ Max(Q)  , Max(Q) +  ], [ Min(Q)  , Min(Q) +  ]  • Perform post-processing to discard false alarms

Performance Evaluation • Implementation • Implemented with C++ on UNIX operating system • R-tree is used as a multi-dimensional index. • Experimental Setup • S&P 500 stock data set (m=545, L=232) • Random walk synthetic data set • SunSparc Ultra-5

Performance Evaluation (2) • Filtering Ratio • Better-than LB-Scan

Performance Evaluation (3) • Query Processing Time • Faster than LB-Scan and Naïve-Scan

Problem Definition • Input • Set of data sequences {S} • Query sequence q • Distance tolerance  • Output • Set of subsequences whose distances to q are within  • Similarity Measure • Time warping distance function, DTW • Any LP metric as a distance function for element pairs

Previous Approaches • Naïve-Scan [Ber96] • Read every data subsequence from database • Apply dynamic programming technique • For m data sequences with average length n, O(mL2|q|)

Previous Approaches (2) • ST-Index [Fal94] • Assume that the minimum query length (w) is known in advance. • Locates a sliding window of size w at every possible location • Extract a feature vector inside the window • Map a feature vector into a point and group trails into MBR (Minimum Bounding Rectangle) • Use Euclidean distance in index space for filtering • Could not guarantee “no false dismissal”

Proposed Approach • Goal • No false dismissal • High performance • Support diverse similarity measure • Sketch • Convert into sequences of discrete symbols • Build a sparse suffix tree • Use a lower-bound distance function for filtering • Apply branch-pruning to reduce the search space

Proposed Approach (2) • Conversion • Generate categories from the distribution of element values • Maximum-entropy method • Equal-interval method • DISC method • Convert element to the symbol of the corresponding category • Example A = [0, 1.0], B = [1.1, 2.0], C = [2.1, 3.0], D = [3.1, 4.0] S = 1.3, 1.6, 2.9, 3.3, 1.5, 0.1 SC = B, B, C, D, B, A

Proposed Approach (3) • Indexing • Extract suffixes from sequences of discrete symbols. • Example From S1C= A, B, B, A, we extract four suffixes: ABBA, BBA, BA, A

Proposed Approach (4) • Indexing (2) • Build a suffix tree. • Suffix tree is originally proposed to retrieve substrings exactly matched to the query string. • Suffix tree consists of nodes and edges. • Each suffix is represented by the path from the root node to a leaf node. • Labels on the path from the root to the internal node Ni represents the longest common prefix of the suffixes under Ni • Suffix tree is built with computation and space complexity, O(mL).

Proposed Approach (4) • Indexing (3) • Example : suffix tree from S1C= A, B, B, A and S2C= A, B A B B B $ A A B $ $ $ $ A $ S1C[1:-] S2C[1:-] S1C[4:-] S1C[2:-] S1C[3:-] S2C[2:-]

Proposed Approach (5) • Query Processing query (q, ) Index Searching candidates answers Post Processing suffix tree data sequences

Proposed Approach (6) • Index Searching • Visit each node of suffix tree by depth-first traversal. • Build lower-bound distance table for q and edge labels. • Inspect the last columns of newly added rows to find candidates. • Apply branch-pruning to reduce the search space. • Branch-pruning theorem: If all columns of the last row of the distance table have values larger than a distance tolerance , adding more rows on this table does not yield the new values less than or equal to .

Proposed Approach (7) • Index Searching (2) • Example : q = 2, 2, 1,  = 1.5 N1 A 1 2 2 A 2 2 1 q ….. N2 B D B 1 1 1.1 D 2.1 2.1 4.1 A 1 2 2 N3 N4 A 1 2 2 2 2 1 — q — 2 2 1 q ….. …..

Proposed Approach (8) • Lower-Bound Distance Function DTW-LB 0 if v is within the range of A (A.min  v) P if v is smaller than A.min (v  A.max) P if v is larger than A.max DBASE-LB (A, v) = v A.max A.max A.max v A.min A.min A.min v possible minimum distance = 0 possible minimum distance = (A.min – v)P possible minimum distance = (v – A.max)P

Proposed Approach (9) • Lower-Bound Distance Function DTW-LB (2) • satisfies the lower-bounding theorem DTW-LB(sC, q)    DTW (s,q)   • computation complexity O(|sC||q|) DTW-LB (sC, q) = DBASE-LB(sC[1], q[1]) + min DTW-LB (sC, q[2:-]) DTW-LB (sC[2:-], q) DTW-LB (sC[2:-], q[2:-])

Proposed Approach (10) • Computation Complexity • m is the number of data sequences. • L is the average length of data sequences. • The left expression is for index searching. • The right expression is for post-processing. • RP ( 1) is the reduction factor by branch-pruning. • RD ( 1) is the reduction factor by sharing distance tables. • n is the number of subsequences requiring post-processing.

Proposed Approach (11) • Sparse Indexing • The index size is linear to the number of suffixes stored. • To reduce the index size, we build a sparse suffix tree (SST). • That is, we store the suffix SC[i:-] only if SC[i]  SC[i–1]. • Compaction Ratio • Example • SC = A, A, A, A, C, B, B • store only three suffixes (SC[1:-], SC[5:-], and SC[6:-]) • compaction ratio C = 7/3

Proposed Approach (12) • Sparse Indexing (2) • When traversing the suffix tree, we need to find non-stored suffixes and compute their distances to q. • Assume that k elements of sC have the same value. • Then, sC[1:-] is stored but sC[i:-] (i=2,3,…,k) is not stored. • For non-stored suffixes, we introduce another lower-bound distance function. DTW-LB2 (sC[i:-], q) = DTW-LB(sC, q) – (i – 1)  DBASE-LB(sC[1], q[1]) • DTW-LB2 satisfies the lower-bounding theorem. • DTW-LB2 is O(1) when DTW-LB(sC, q) is given.

Proposed Approach (13) • Sparse Indexing (3) • With sparse indexing, the complexity becomes: • m is the number of data sequences. • L is the average length of data sequences. • C is the compaction ratio. • n is the number of subsequences requiring post-processing. • RP ( 1) is the reduction factor by branch-pruning. • RD ( 1) is the reduction factor by sharing distance tables.

Performance Evaluation • Implementation • Implemented with C++ on UNIX operating system • Experimental Setup • S&P 500 stock data set (m=545, L=232) • Random walk synthetic data set • Maximum-Entropy (ME) categorization • Disk-based suffix tree construction algorithm • SunSparc Ultra-5

Performance Evaluation (2) • Comparison with Naïve-Scan • increasing distance-tolerances • S&P 500 stock data set, |q|=20

Performance Evaluation (3) • Scalability Test • increasing average length of data sequences • random-walk data set, |q|=20,m=200

Performance Evaluation (4) • Scalability Test (2) • increasing total number of data sequences • random-walk data set, |q|=20, L=200

Introduction • We extend the proposed subsequence searching method to large sequence databases. • In the retrieval of similar subsequences with time warping distance function, • Sequential Scanning is O(mL2|q|). • The proposed method is O(mL2|q| / R) (R  1). • It makes search algorithms suffer from severe performance degradation when L is very large. • For a database with long sequences, we need a new searching scheme linear to L.

SBASS • We propose a new searching scheme: Segment-Based Subsequence Searching scheme (SBASS) • Sequences are divided into a series of piece-wise segments. • When a query sequence q with k segments is submitted, q is compared with those subsequences which consist of k consecutive data segments. • The lengths of segments may be different. • SS represents the segmented sequence of S. S = 4,5,8,9,11,8,4,3 |S| = 8 SS = 4,5,8,9,11, 8,4,3 |SS| = 2

SBASS (2) • Only four subsequences of SS are compared with QS. SS[1],SS[2], SS[2],SS[3], SS[3],SS[4], SS[4],SS[5] S SS[3] SS[2] SS[1] SS[4] SS[5] SS qS qS[1] qS[2]

SBASS (3) • For SBASS scheme, we define the piece-wise time warping distance function (where k = |qS| = |sS|). • Sequential scanning for SBASS scheme is O(mL|q|). • We introduce an indexing technique with O(mL|q|/R) (R  1).

Sketch of Proposed Approach • Indexing • Convert sequences to segmented sequences. • Extract a feature vector from each segment. • Categorize feature vectors. • Convert segmented sequences to sequences of symbols. • Construct suffix tree from sequences of symbols. • Query Processing • Traverse the suffix tree to find candidates. • Discard false alarms in post processing.

Segmentation • Approach • Divide at peak points. • Divide further if maximum deviation from interpolation line is too large. • Eliminate noises. • Compaction Ratio (C) = |S| / |SS| too large deviation noises

Similarity Searches in Sequence Databases

Similarity Searches in Sequence Databases

Presentation Transcript

Sequence Similarity

Sequence Databases

Sequence Databases

Sequence Databases

Sequence Similarity Searching

Similarity Search in Protein Databases

Sequence databases

Similarity Searches on Sequence Databases

Sequence Similarity

Sequence similarity Analysis

Sequence similarity Analysis

Similarity Searches on Sequence Databases

Sequence Databases

Sequence similarity Analysis

Module 2 Sequence DBs and Similarity Searches

Similarity searches in sequence databases

Sequence Similarity

Sequence Similarity Searches

Sequence Databases

Sequence Similarity

Similarity Searches in Sequence Databases

Sequence Databases