220 likes | 244 Views
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively. Daesu Lee,Wonsuk Lee IEEE,ICDM ’ 05 報告者:林靜怡 2006/05/05. Introduction. Confine the memory usage of a data mining process estDec - prefix tree estDec+ - CP-tree. CP-tree. Compressed-prefix tree
E N D
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively Daesu Lee,Wonsuk Lee IEEE,ICDM’05 報告者:林靜怡 2006/05/05
Introduction • Confine the memory usage of a data mining process • estDec - prefix tree • estDec+ - CP-tree
CP-tree • Compressed-prefix tree • Effectively used in finding frequent or maximal frequent itemsets • a node of a CP-tree can maintain the information of several itemsets together • size of a CP-tree can be flexibly controlled by merging or splitting nodes
CP-tree(Conti.) • Two consecutive nodes by a prefix tree are merged in a CP-tree when the current support difference between their corresponding itemsets is less than or equal to a merging gap threshold δ (0,1)
Definition • :a prefix tree • S :a subtree of • :the itemset represented by the root of S • :an itemset represented by a node of S • δ:merging gap threshold • |S|:number of nodes in S • :the total number of transactions in the current data stream • subtree S is a mergeable subtree and compressed into a node of a CP-tree that is equivalent to
CP-node structure • A node m of CP-tree maintains four entries m(τ, π, , ) • Item-listτ: - m.τ[1]:root node of S,the shortest itemset of the node m and denoted by - m.τ[|S|]:the right-most leaf node in the lowest level of S,the longest itemset of the node m and denoted by
CP-node structure • Parent-index list π - y’s parent is x - • Largest count - the current count of the shortest itemset • Smallest count - the current count of the longest itemset
CP-tree |10-9|/10 = 0.1 0.1 <= 0.2 |10-5|/10 = 0.5 0.5 > 0.2
Merged-count Estimation • Given the item-list of a node m in a CP-tree • :the itemset represented by (1<=j<=n) • the current counts of the remaining itemsets can be estimated by a formula • f(m, j):a count estimation function that can model the count in terms of the counts and of the shortest and longest itemsets
CP-tree Maintenance • :the parent node of m • Node-merge - - a new significant itemset e is identified by the inserting-count estimation process, so that a new node for the itemset needs to be inserted as a child of the node m. • Node-split -
Finding Maximal Frequent Itemsets • estDec+ Method • Adaptive Memory Utilization
estDec+ Method • Parameter updating phase • Count updating & node restructuring phase • Itemset insertion phase • Maximal frequent itemset selection phase
Parameter updating phase The total number of transactions in the current data stream is updated. • Count updating & node restructuring phase :m’s parent prune - split - merge -
Itemset insertion phase • insert any new significant itemset which has not been maintained in • any insignificant item whose current support is less than is filtered out in the transaction => • Traversed to find out any new significant itemset induced by the items in
Maximal frequent itemset selection phase • retrieves all the currently frequent or maximal frequent itemsets by traversing the monitoring tree • Force-pruning - all the nodes whose largest counts are less than - performed periodically
Adaptive Memory Utilization • In order to minimize the estimation error caused by the merged-count estimation, it is very important to keep the value ofδas small as possible. • The size of a CP-tree is inversely proportional to the value of δ. • the value of should be dynamically changed in the parameter update phase
Adaptive Memory Utilization • :the upper bound of desired memory usage • :the lower bound of desired memory usage • :Confined memory space • :current memory usage
Experiment • Data sets:T10.I4.D1000K and WebLog • 1.8 GHz Pentium PC • 512 MB Memory • Linux 7.3 • = 0.001 • Count estimation function f(m,j)