200 likes | 330 Views
Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches . Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang Xiang, and Ning Ruan (KSU). Problem Definition.
E N D
Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang Xiang, and Ning Ruan (KSU)
Problem Definition • Given a large collection of frequent itemsets and their supports, how we can concisely represent them? • Coverage criterion • The Spanning Set Approach [F. Arati, A. Gionis, Mannila, Approximating a collection of frequent sets, KDD’04]. • Frequency criterion • The Profile-based Approach [X. Yan, H. Cheng, J. Han, and D. Xin, Summarizing itemset patterns, a profile-based approach, KDD’05.] • The Markov Random Field Approach [C. Wang and S. Parthasarathy, Summarizing itemset patterns using probabilistic models, KDD’06.]
Frequency Criterion • The restoration function of a set of itemsets S is a function • The restoration error: We use 2-norm in this study.
Probabilistic Restoration Function • Applying the independence probabilistic model for a set of itemsets S: • An example,
Problem 1: Optimal Parameters What are the optimal parameters, p(S),p(a),p(c),p(d), minimizing the restoration error:
Non-Linear Regression • We introduce the independent variable We have |S| data points.
Linear Regression Approximation Using Taylor expansion, we show the restoration error from linear regression is very close to the error by using the non-linear regression!
Problem 2: Optimal Partition • To reduce the restoration error, we adopt the partition strategy • Partition the entire collection of frequent itemsets into K disjoint subsets, and build the restoration function for each subset • How to optimally partition a set of itemsets into K disjoint subsets so that the total restoration error can be minimized?
Our Approaches • NP-hard problem • Two heuristic algorithms • K-Regression • Tree Regression
K-Regression • A k-means type clustering procedure: • Random partition the set of itemsets S into K partition • [Regression Step] Apply regression to find the optimal parameters on each partition • [Re-assignment Step] For each itemset, assign it to the partition which minimizes its restoration error based on the optimal parameters discovered by Step 2 • Repeat 2 and 3 until the total restoration error does not increase or the improvement is small Just as k-means, k-regression is guaranteed to converge!
Tree Regression S={{a},{b},{c},{d},{a,b},{a,c},{b,c},{a,d},{c,d}, {a,b,c},{a,b,d},{a,c,d}} Using Regression to find optimal parameters for each subset of itemsets
Tree Regression Construction • A Decision-type of construction algorithm • Question 1: How to find K subsets of itemsets? • Question 2: How to find the optimal splitting? • Answer for Q1 • Maintain a queue for the “current” leaf node, and always pick up the leaf nodes with the maximal average restoration error to split • Answer for Q2 • Maximally reduce the total restoration error Min E(S)-E(S_1)-E(S_2)
An Interesting Connection • Jerome H. Friedman’s 1977 paper, “A tree-structured Approach to nonparametric multiple regression”. • Unfortunately, this work seems never got enough attention. However, it seems part of the inspiration for the CART (regression tree) and MARS (Multivariate Adaptive Regression Spline).
Conclusion • Using linear regression to identify optimal parameters of the probabilistic restoration function (based on the independence assumption) for a set of itemsets • Two algorithms to optimally partition the set of itemsets into K parts • K-regression • Tree regression