140 likes | 154 Views
A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services. Reza Sadri Carlo Zaniolo sadri@procom.com zaniolo@cs.ucla.edu Amir Zarkesh Jafar Adibi azarkesh@u4cast.com jabibi@u4cast.com.
E N D
A Sequential Pattern Query Language for Supporting Instant Data Mining for e-Services Reza Sadri Carlo Zaniolo sadri@procom.com zaniolo@cs.ucla.edu Amir Zarkesh Jafar Adibi azarkesh@u4cast.com jabibi@u4cast.com
Sequential Patterns in E-Commerce Applications • Applications: • Targeted Advertising • Instant analysis of stock market trends • Mining web access logs • Fraud detection • Requirements: • Expressive query language for finding complex patterns in database sequences • Efficient and scalable implementation: Query Optimization
State of The Art • ADT (e.g.. Informix Datablades): Not flexible enough, no Optimization • SEQ: Enhanced ADTs (e. g. sets and sequences) with their own query language • SRQL: Adding sequence algebra operators to relational model
SQL-TS • A query language for finding complex patterns in sequences • Minimal extension of SQL—only the from clause affected • A new Query optimization technique based on extensions of the Knuth, Morris & Pratt (KMP) string-search algorithm
Example in Mining Weblogs Consider a table: Sessions(SessNo, ClickTime, PageNo, PageType) That keeps track of pages visited in a session (sequence of requests from the same user) Possible page type: content product description purchase
Example in Mining Weblogs • SQL-TS queries to find the ideal scenarios: SELECT B.PageNo, C.ClickTime FROM Sessions CLUSTER BY SessNo SEQUENCE BY ClickTime AS (A, B, C) WHERE A.PageType=‘content’ AND B.PageType=‘product’ AND C.PageType=‘purchase’
Example in Fraud Detection Consider a Table log that keeps track of credit card transactions: Spending(Date, AccountNo, Amount) A surge in average spending might be sign of credit card theft.
Example in Fraud Detection Track 30-day average spending and when it increases considerably for two consecutive days: Select Z.AccountNo, Z.Date FROM Spending CLUSTER BY AccountNo SEQUENCE BY Date AS (*X, Y, Z) WHERE COUNT(*X)=30 AND Y.Amount > 5 * AVG(*X.Amount) AND Z.Amount > 5 * AVG(*X.Amount) Notice the Use of star and aggregates.
Optimized string search:KMP Consider text array text and pattern array p: i 1 2 3 4 5 6 7 8 9 10 11 text[i] a b a b a b c a b c a j 1 2 3 4 5 6 pattern[j] a b a b c a • After failing, use the information acquired so to: • - backtrack to shift(j), rather than i+1, and • - only check pattern values after next(j) But in SQL-TS we have general predicates & star patterns
i – j + 1 i – j + shift(j) + 1 i - j + shift(j) + next(j) i Input shift(j) + 1 shift(j) + next(j) 1 Pattern j next(j) j - shift(j) Shifted Pattern 1 shift(j) shift and next • Success for first j-1 elements of pattern. Failure for jth element (when input is at i) • Any shift less than shift(j) is guaranteed to lead to failure, • Match elements in the pattern starting at next(j)
Optimal Pattern Search (OPS) Search path for naive algorithm vs. optimized algorithm:
*U (less than 2% change) *T *V *Y *R *Z (less than 2% change) *W (less than 2% change) Optimizing Star Patterns Relaxed Double Bottom: • Only considering increases and decreases that are more than 2%
Conclusions • Disjunctive queries, partial ordered domains, aggregates also treated in this approach • Old applications—more power & flexibility than Datablades ADTs of commercial DBMSs • Ongoing implementation, by building on the user-defined aggregates supported in AXL.