180 likes | 312 Views
Mining Serial Episode Rules with Time Lags over Multiple Data Streams. Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08. Outline. Introduction Related work Preliminaries
E N D
Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08
Outline • Introduction • Related work • Preliminaries • Support of a serial episode • Support/ confidence of a serial episode rule • Data structure used in the algorithms • Algorithms • LossyDL • TLT • Experiments • Conclusions
Introduction • In many applications, data are generated as a form of continuous data streams. • Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams • When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested • Regarding the values of flows and occupancies coming from roads as an environment of multi-streams and finding serial episode rules from it • Serial episode rules with time lags (SER) : XlagY
Related Work • Finding episodes/episode rules from static time series data has been studied for decades • Finding episodes over data streams • Serial episodes [SSDBM04] • Episodes [KDD07] Precursor Successor B D A D B A E L C Serial episode rule Episode D B A Serial episode
Preliminaries • Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn • n-tuple event: a set of items coming from all streams at the same time • itemset: a subset of an n-tuple event • serial episode: described as an ordered list of itemsets e.g. serial episode (aA)(bB) Itemset {gA} time: 1, 2, 3, 4, 5, 6, 7, 8 DS1: a, b, b, c, g, a, b, f DS2: A, B, S, G, A, B, A, F DSn: , , , , , , , … n-tuple event
Preliminaries (cont.) • Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if • S occurs in [a, b] • S does not occur in any proper subintervals of [a, b] • If (b-a+1) T, a time bound given by users, [a, b] is valid • MO(S): the set of all minimal occurrences of S • Supp(S): the number of valid minimal occurrences of S Time bound T: 3 DS1 DS2
Preliminaries (cont.) • A SER is R: S1Lag = LS2 • Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L} • Conf(R) = Supp(R)/Supp(S1) 4 Time bound T: 3 DS1 DS2
Preliminaries (cont.) • Problem Formulation: given 4 parameters • the maximum time lag (Lmax) • the minimum support (minsup) • the minimum confidence (minconf) • the time bound (T) • Find all SERs e.g. R: S1Lag = LS2 satisfying • L Lmax • Supp(R) N minsup, (N: the number of received n-tuple events) • Conf(R) minconf • Calculating supports for serial episodes and SERs must take T into account
Preliminaries (cont.) • Using the prefix tree for keeping serial episodes • S: a serial episode, X: an item • S+X: X follows S • S+_X: X and the last itemset in S appear at the same time Level 0 Root A B Level 1 _B B Serial episode (AB) Level 2 Serial episode (A)(B)
B B [2, 3] [1, 3] LossyDL • The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules Processing C can generate (B)(C): [2, 3] and (BC): [3, 3] At time point = 3, a 2-tupe event (BC) arrives, T = 3 Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order) B A [2, 2] [3, 3] [1, 1] The last two minimal occurrences needs to be checked B [1, 2] [1, 3]: not minimal Using Lossy Counting [VLDB02], whenever N 0 mod 1/, the oldest minimal occurrence is removed
LossyDL (Rule Generation) • Mining SERs • For any two serial episode with supports (minsup ) N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed • For each R: S1Lag = LS2, it will be returned if • Supp(R) (minsup ) N, and • (Supp(R) + N)/Supp(S1) minconf
TLT • A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating • Keeping supports instead of the minimal occurrence lists • How to generate rules without the minimal occurrence lists? • Re: using the following observations to prune the insignificant rules • Observations • XL(AB) and XLA, obviously Supp(XLA) Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf • (AB)L(CD) and ALC, obviously Supp(ALC) Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB) minconf
TLT (cont.) • Observations (cont.): • Given a SER: (A)(B)5(CD), and T = 3 • A1B or A2B, that is ApB, 0<p< T (T1 types) • A1B4(CD), A2B3(CD), that is ApBLp(CD) • Supp(ApBLp(CD)) min(Supp(ApB), Supp(BLpC)) • (A)(B)5(CD) is not significant, if • pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B) minconf • Using the observations to prune insignificant rules • Time lag table (TLT) • ALB is a reduced SER, if A and B are single items • For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor) • Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER
TLT (cont.) • The support and the last two minimal occurrences of an serial episode are kept in the prefix tree • Keeping supports instead of keeping minimal occurrence lists • Keeping the last two minimal occurrences for updating the supports • WheneverN 0 mod 1/, all supports are decreased by 1 • In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables
TLT (Rule Generation) • Mining SERs • Any two serial episode with supports (minsup ) N form the candidate SERs • A candidate SER will be returned if it can pass the pruning rules from the above observations
Experiments • Two real dataset • PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts • Four streams with distinct items # = 28 • Traffic: the dataset is “Twin Cities’ Traffic data near the 50th St. during the first week of Feb, 2006 • Three streams with distinct items # = 55 • Parameter setting • = 0.1minsup • Lmax = 10
Conclusions • We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision • In the near future, we will combine these two methods into a hybrid method to investigate the balance between memory space and precision • Moreover, we will try to extend the problem of finding serial episode rules to that of finding general episode rules