330 likes | 562 Views
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data. Yi-Cheng Chen, Wen- Chih Peng and Suh -Yin Lee ICDM 2011. Outlines. Motivation Preliminaries Endpoint representation CEMiner algorithm Experimental result Conclusion. Motivation.
E N D
CEMiner – An Efficient Algorithm for Mining Closed Patterns fromTime Interval-based Data Yi-Cheng Chen, Wen-ChihPeng and Suh-Yin Lee ICDM2011
Outlines • Motivation • Preliminaries • Endpoint representation • CEMiner algorithm • Experimental result • Conclusion
Motivation • Existing studies only focus on mining closed sequential patterns from time point-based data.
Cont. • In this paper, we discuss and design an efficient method to discover closed temporal patterns from interval-based data. • Three contributions: • We simplify the processing of complex relations. i.e., only “before”, “after” and “equal.” • Endpoint representation • A novel algorithm, CEMiner(Closed Endpoint Temporal Miner).
Preliminaries • Definition 1. Event interval and event sequence • E= {e1, e2,…, ek} be the set of event symbols: {A, B, C, D, E } • The triplet (ei , si, fi) is an event interval : (A , 2 , 7) • An event sequence is a series of event interval triplets : <(A, 2 , 7), (B, 5, 10), …, (E, 18 , 20)>.
Cont. • Definition 2. Temporal database • Database DB = {r1, r2, …, rm}, each record ri , consists of a sequence-id, SID and an event. • DB is called a temporal database.
Endpoint representation • When describing relationships among more than three events, Allen’s temporal logics may suffer several problems. • A suitable representation is very important for describing a temporal pattern. • A new expression, endpoint representation is proposed to address the ambiguous and scalable problem.
Cont. • Definition 3. Endpoint sequence • event sequence q= <( A , 2 , 7 ), ( B , 5 , 10 ), ( C , 5 , 12 ), ( D , 16 , 22 ), ( E , 18 , 20 )> • Tq = { 2 ,7 ,5 ,10 ,5 ,12 ,16 ,22 ,18 ,20 } • endpoint sequence : qe = <2 ,5 ,5 ,7 ,10 ,12 ,16 ,18 ,20 ,22> • endpoint representation : <>
Cont. • The endpoint representation has several benefits : • Scalability • Nonambiguity • Simplicity
CEMineralgorithm • CEMiner (standing for Closed Endpoint temporal Miner) utilizes the arrangement of endpoints to accomplish the closed temporal pattern mining. • Closure Checking • subsequence & supersequence • Ex. Given two sequences = <A, B, C>,𝛽 = <A, D, B, C, E>, we say is a subsequence of 𝛽, and 𝛽 is a supersequence of.
Cont. • Definition 4. Closed temporal pattern • CTP = {( 𝛼 ∈ TP ) ˄ ( ∄𝛽 ∈ TP ) such that (𝛼 ⊆ β) ∧ ( support (𝛼) = support (𝛽) )} • Given two sequence 𝛼and 𝛽 • If 𝛼is a closed temporal pattern, • 𝛼is a temporal pattern and • there doesn’t exist a supersequence𝛽 and support (𝛼) = support (𝛽).
Cont. • Ex. • min_sup = 2 • The endpoint sequence = <> is a temporal pattern but not a closed temporal pattern. • Because<> ⊂ <> and both support = 2.
Cont. • Closure Checking • To verify a new closed temporal pattern p, we require checking whether p is a sub-sequence or super-sequence of an existing temporal pattern p’ and the projected database of p and p’ is equal. • This paper borrow BI-Directional Extension [WH04] to check patterns’ closure. • Forward-extension • Backward-extension
Cont. • Definition 5. Forward-extension and backward-extension • If = <> is non-closed, there must exist at least one endpoint x, which can be used to extend to a new endpoint sequence ’, support () = support (’). • can be extended in five ways: (1)’=〈〉 (2)’=〈〉 • 𝛼’ a forward-extension sequence (3)’=〈〉 (4)’=〈〉 (5)’=〈〉 • ’ backward-extension sequence
Cont. • If there exists no forward-extension endpoint nor backward-extension , 𝛼must be a closed endpoint sequence. • The CEMinerchecks closure in two directions as follows, • Forward directional checking • Backward directional checking
Cont. • Definition First instance of a prefix sequence • Ex. • The first instance of the prefix sequence ABin sequence CAABCis CAAB.
Cont. • Definition 6. The i-th last-in-first appearance • Ex. • 〈ABAB(AB)(AB) 〉 • p =〈〉 1. The last-in-first appearance w.r.t. prefix p in? (1) 1≤ i < n,n=4,i=2 first instance :〈ABAB(AB)(AB) 〉 2. The last-in-first appearance w.r.t. prefix p in? (2) i = n, i = n = 4 first instance :〈ABAB(AB)(AB) 〉
Cont. • Definition 7. The i-th semi-maximum period • Ex. • 〈ABAB(AB)(AB) 〉 • p =〈〉 1. semi-maximum period of prefix p in (1) i =1 , before the last-in-first appearance : 〈ABAB(AB)(AB) 〉 2. semi-maximum period of prefix p in (2) 1< i ≤n, n=4, i=2 a. end of the first instance of 〈〉:〈AB〉 b. the 2-th last-in-first appearance w.r.t p: B 〈ABAB(AB)(AB) 〉
Cont. • EbackScan search • Let an endpoint sequence, if there exists i, 1 ≤ i ≤ n and there exists an endpoint x which appears in each of the i-th semi-maximum periods of the prefixin database. • We can derive a new endpoint sequenceand we can stop growing the endpoint sequence . • Ex. • Prefix sequence p = <A, C> • B is the 2nd semi-max. period of the prefix p in database • We can derive a new prefix sequence p’ = <A, B, C>
CEMiner Algorithm • We use three pruning strategies to reduce the searching space efficiently and effectively. • (1)pre-pruning • (2) post-pruning • (3) pair-pruning
CEMinerAlgo. • Pair-pruning: • If the endpoint is a starting endpoint, we can omit the closure checking. • Because the starting endpoint and finishing endpoint always occur in pairs in an endpoint sequence.
CEMinerAlgo. • Ex. • Prefix p =<> • Endpoint B+ is a backward-extension endpoint of p. • So we can stop growingp.
CEMinerAlgo. • Pre-pruning: • If y is finishing endpoint and it has corresponding starting endpoint in.
CEMinerAlgo. • Post-pruning: • A finish point is called significant, if it has a corresponding starting endpoint in projected postfix or in.
Conclusion • We develop an efficient algorithm, CEMiner, to discover closed temporal patterns without candidate generation, based on proposed endpoint representation. • The algorithm further employs three pruning methods to reduce the search space effectively.