190 likes | 374 Views
Mining Time-Series Databases. Mohamed G. Elfeky. Introduction. A Time-Series Database is a database that contains data for each point in time. Examples: Weather Data Stock Prices. What to Mine?. Full Periodic Patterns
E N D
Mining Time-Series Databases Mohamed G. Elfeky
Introduction • A Time-Series Database is a database that contains data for each point in time. • Examples: • Weather Data • Stock Prices
What to Mine? • Full Periodic Patterns • Every point in time contributes to the cyclic behavior of the time-series for each period. • e.g., describing the weekly stock prices pattern considering all the days of the week. • Partial Periodic Patterns • Describing the behavior of the time-series at some but not all points in time. • e.g., discovering that the stock prices are high every Saturday and small every Tuesday.
Mining Partial Periodic Patterns • Problem Definition • Methods • Apriori • Max-Subpattern Hit Set Jiawei Han, Guozhu Dong, and Yiwen Yin – ICDE98
Problem Definition • The time-series is: S = D1 D2 … Dn • A pattern is: s = s1 … sp over the set of features L and the letter *. • |s| = p is the period of the pattern s. • L-length of s is the number of si that is not *. • If s has L-length j, it is called a j-pattern. • A subpattern of s is: s’ = s’1 … s’psuch that for each position i: s’iis a * or subset of si.
Problem Definition (Cont.) • Each segment of the form Di|s|+1 … Di|s|+|s|is called a period segment. • A period segment matchess if for each position j, either sjis * or subset of Di|s|+j. • The frequency count of s in a time-series S is the number of period segments of S that matches s. • The confidence of s is defined as the division of its frequency count by the maximum number of periods of length |s| in S. • A pattern is called frequent if its confidence not less than a minimum threshold.
Problem Definition (Example) • The pattern: a*{a,c}de is of length 5 and of L-length 4 and so it is called 4-pattern. • The patterns: a*{a,c}** and **cde are subpatterns of the above pattern. • In the series a{b,c}baebaced, the pattern: a*b, whose period is 3, has frequency count 2. Its confidence is 2/3 where 3 is the maximum number of periods of length 3.
Apriori Method • Apriori Property: Each subpattern of a frequent pattern of period p is itself a frequent pattern of period p. • Method: • Find F1, the set of frequent 1-patterns of period p. • Find all frequent i-patterns of period p, for i from 2 to p, based on the idea of Apriori, and terminate when the candidate i-pattern set is empty.
Max-Subpattern Hit Set Method • Definitions • Algorithm • Implementation Data Structure
Definitions • A candidate max-patternCmax is the maximal pattern which can be generated from F1 (the set of frequent 1-patterns). • Example: • If F1 = {a***, *b** , *c** , **d*}, • Then Cmax = a{b,c}d*
Definitions (Cont.) • A subpattern of Cmax is hit in a period segment Si if it is the maximal subpattern of Cmaxin Si. • Example: • For Cmax = a{b,c}d* and Si = a{b,c}ce, • The hit subpattern is: a{b,c}** • The hit setH is the set of all hit subpatterns of Cmax in S.
Algorithm • Scan S once to find F1 and form the candidate max-pattern Cmax. • Scan S again, and for each period segment, add its max-subpattern to the hit set setting its count to 1 if it is not exist, or increase its count by 1. • Derive the frequent patterns from the hit set.
Implementation Data Structure Max-Subpattern Tree • The root node is: Cmax. • A child node is a subpattern of the parent node with one non-* letter missing. The link is labeled by this letter. • A node containing only 2 non-* letters have no children since they are already in F1. • Each node has a count field which registers its number of hits.
10 d a b c 0 50 40 32 acd* abd* a{b,c}** *{b,c}d* a d a d b b c b b c d a 2 18 8 0 5 19 *bd* *{b,c}** a*d* ac** ab** *cd* Max-Subpattern Tree (Example) a{b,c}d*
Max-Subpattern Tree (Construction) • Finding w the max-subpattern in the current segment. • Search for w in the tree, starting from the root and following the path corresponds to the missing non-* letters in order. • If the node w is found, increase its count by 1. Otherwise, create a new node w (with count 1) and its missing ancestors in the followed path (with count 0).
Max-Subpattern Tree (Construction) *cd* 0 a{b,c}d* a 0 *{b,c}d* b 1 *cd*
Max-Subpattern Tree (Traversal) • After the second scan, the tree will contain all the max subpatterns of the time-series. • Now the tree must be traversed to compute the confidence value of each subpattern.
Max-Subpattern Tree (Traversal) • The frequency count of each node is the sum of its count and those of all its reachable ancestors. • For Example: • The frequency count of *cd* is 78. • The frequency count of a*d* is 105.
10 d a b c 0 50 40 32 acd* abd* a{b,c}** *{b,c}d* a d a d b b c b b c d a 2 18 8 0 5 19 *bd* *{b,c}** a*d* ac** ab** *cd* Max-Subpattern Tree (Example) a{b,c}d*