480 likes | 495 Views
This paper explores the use of model trees, specifically regression trees, in solving regression problems in classical data mining. The proposed tree structure captures both global and local effects of regression variables, allowing for efficient and accurate predictions. The paper also discusses techniques for evaluating and filtering useless splitting nodes.
E N D
Annalisa Appice Dipartimento di Informatica Universita’ di Bari Mining Relational Model Trees Department of Computer Science University of Bari Knowledge Acquisition &Machine Learning Lab
Regression problem in classical data mining Given • m independent (or predictor) variables Xi(both continuous and discrete) • a continuous dependent (or response) variable Y to be predicted • a set of n training cases (x1, x2, …, xm, y) Build • a function y=g(x)such that it correctly predicts the value of the response variable for each m-tuple (x1, x2, …, xm) Mining Relational Model Trees
Regression trees: approximation by means of a piecewise constant function Model trees: approximation by means of a piecewise multiple (linear) function X10.1 X10.3 Y=0.5 Y = 3 +1.1X1 X22.1 X20.1 Y=3X1+1.1X2 Y = 1.9 Y = 0.9 Y = 0.9 Regression trees and model trees Partitioning of observations + local regression models regression ormodels trees Mining Relational Model Trees
X1 3 Phase 1: partitioning of the training set Phase 2: association of models to the leaves Y=3+2X1 Model trees: state of the art • Data Mining • Karalic, (1992): RETIS • Quinlan, (1992): M5 • Wang & Witten, (1997): M5’ • Lubinsky, (1994): TSIR • Torgo, (1997): HTL • … Statistics • Ciampi (1991): RECPAM • Siciliano & Mola (1994) The tree-structure is generated according to a top-down strategy. Mining Relational Model Trees
Model trees: state of the art Models in the leaves have only a “local” validity they are built on the basis of training cases falling in the corresponding partition of the feature space. “Global” effects can be represented by variables that are introduced in the regression models at higher levels of the model trees A different tree-structure is required! Internal nodes can • either define a further partitioning of the feature space • or introduce some regression variables in the models to be associated to the leaves. Mining Relational Model Trees
Regression nodes compute only a straight-line regression. They have only one child. t Y=a+bXi Xj t’ nL nR Y=c+dXu Y=e+fXw t’L t’R Two types of nodes Two types of nodes: • Splitting nodes perform a Boolean test. Xi Xi{xi1,…,xih} continuous variable discrete variable t t tR tL tL tL tR Y=c+dXw Y=a+bXu Y=a+bXu Y=c+dXw Mining Relational Model Trees
Y=a1+b1X1 Y, X1, X2 Y’= Y - (a1+b1X1) Y’=a3+b3X’2 X’2=X2 - (a2+b2X1) What is passed down? • Splitting nodes pass down to each child only a subgroup of training cases, without any change on the variables. • Regression nodes pass down to their unique child all training cases. Values of the variables not included in the model are transformed to remove the linear effect of those variables already included. Mining Relational Model Trees
0 Y=a+bX1 T Leaves are associated with a straight-line regression function 0 Y=a+bX1 T 1 X3 1 X3 2 7 X2 Y=i+lX4 2 7 The multiple regression model associated to a leaf is the composition of straight-line regression functions found along the path from the root to a leaf X2 Y=i+lX4 3 4 3 Y=c+dX3 X4 4 Y=c+dX3 X4 5 6 Y=g+hX3 Y=e+fX2 5 6 Y=g+hX3 Y=e+fX2 An example of model tree Mining Relational Model Trees
Building a regression model stepwise: some tricks Example: build a multiple regression model with two independent variables: Y=a+bX1 + cX2 through a sequence of straight-line regressions Build:Y = a1+b1X1 Build: X2 =a2+b2X1 Compute the residuals on X2: X'2 = X2 -(a2+b2X1) Compute the residuals on Y: Y' = Y -(a1+b1X1) Regress Y’ on X'2 alone: Y’ = a3 + b3X'2. By substituting the equation of X'2 in the last equation: Y = a3 +a1- a2b3 + b3X2 –(b2b3-b1)X1. it can be proven thata=a3-a2b3 +a1, b=-b2b3 +b1andc=b3. Mining Relational Model Trees
Y=a+bXi t Xj< t’ nL nR Y=c+dXu Y=e+fXw t’R t’L The global effect of regression nodes R • Both regression models associated to the leaves include Xi. • The contribution of Xi to Y can be different for each leaf, but • It can be reliably estimated on the whole region R Y R1 R2 Xj Mining Relational Model Trees
Advantages of the proposed tree structure • It captures both the “global” and the “local” effects of regression variables • Multiple regression models at the leaves can be efficiently built stepwise • The multiple regression model at a leaf can be easily computed the heuristic function for the selection of regression and splitting nodes can take it into account Mining Relational Model Trees
Y=a+bXi Regression node: t (Xi,Y) = min { R(t), (Xj,Y) for all possible variables Xj }. Xj t’ nL nR t’R t’L Y=c+dXu Y=e+fXv Evaluating splitting and regression nodes Xi t • Splitting node: Y=a+bXu Y=c+dXv tL tR R(tL) (R(tL) ) is the resubstitution error associated of the left (right) child. Mining Relational Model Trees
Filtering useless splitting nodes Problem: a splitting node with identical straight-line regressions associated with children the split is really modelling a regression step How to recognize? Solution: compare the two regression lines associated with children of a splitting according to a statistical test for coincident regression lines (Weisberg, 1985). Mining Relational Model Trees
Stopping criteria • The first performs the partial F-test to evaluate the contribution of a new independent variable to the model. • The second requires the number of cases in each node to be greater than a minimum value. • The third operates when all continuous variables along the path from the root to the current node are used in regression steps and there are no discrete variables in the training set. • The fourth creates a leaf if the error in the current node is below a fraction of the error in the root node. • The fifth stops the growth when the coefficient of determination is greater than a minimum value. Mining Relational Model Trees
Related works … and problems In principle, the optimal split should be chosen on the basis of the fit of each regression model to the data. Problem: in some systems (M5, M5’ and HTL) the heuristic function does not take into account the model associated with the leaves of the tree. The evaluation function is incoherent with respect to the model tree being built. Some simple regression models are not correctly discovered Mining Relational Model Trees
1,8 1,6 1,4 x 0.4 1,2 1 0,8 True False 0,6 0,4 y=0.963+0.851x y=1.909-0.868x 0,2 0 -1,5 -1 -0,5 0 0,5 1 1,5 2 2,5 Related works … and problems Example: Cubist splits the data at -0.1 and builds the following models: X -0.1: Y = 0.78 + 0.175*X X > -0.1: Y = 1.143 - 0.281*X Mining Relational Model Trees
Related works … and problems Retis solves this problem by computing the best multiple regression model at the leaves for each splitting node. The problem is theoretically solved, but … • Computationally expensive approach: a multiple regression model for each possible test.The choice of the first split is O(m3N2). • All continuous variables are involved in multiple linear models associated to the leaves. So, when some of the independent variables are linearly related to each other, several problems may occur(Collinearity). Mining Relational Model Trees
Related works … and problems TSIR induces model trees with regression nodes and splitting nodes, but … The effect of the regressed variable in a regression node is not removed when cases are passed down • the multiple regression model associated to each leaf cannot be correctly interpreted from a statistical viewpoint. Mining Relational Model Trees
Computational complexity • It can be proved that SMOTI has an O(m3n2)worst case complexity for the selection of any node (splitting or regression). • RETIS has the same complexity for node selection, although RETIS does not select a subset of variables to solve collinearity problems. Mining Relational Model Trees
Empirical evaluation • For pairwise comparison with Retis and M5’, which art the state-of-the-art model tree induction systems the non-parametric Wilcoxon two-sample paired signed rank test is used. • Experiments (Malerba et al, 20041): • laboratory-sized data sets • UCI datasets Mining Relational Model Trees
…Empirical evaluation on laboratory-sized data… Retis M5 SMOTI Mining Relational Model Trees
…Empirical evaluation on laboratory-sized data Retis Time(s) M5 SMOTI Number of examples Mining Relational Model Trees
… Empirical Evaluation on UCI data… Mining Relational Model Trees
… Empirical Evaluation on UCI data. For some datasets SMOTI mines interesting patterns that no previous study on model trees has ever revealed. This aspect proves the easy interpretability of the model trees induced by SMOTI. For example: Abalone (marine crustaceans). The goal is to predict the age (number of rings). SMOTI builds a model tree with a regression node in the root. The straight-line regression selected at the root is almost invariant for all model trees and expresses a linear dependence between the number of rings (dependent variable) and the shucked weight (independent variable). This is a clear example of global effect. Mining Relational Model Trees
SMOTI: open issues • The DM system KDB2000 http://www.di.uniba.it/~malerba/software/kdb2000/index.htm that implements SMOTI is not tightly integrated with the DBMS Tighter integration with a DBMS • Cannot be applied directly to multi-relational data mining tasks the unit of analysis is an individual described by a set of random variables each of which result in just one single value Mining Relational Model Trees
Xr j Xr j Xr j Xr j Xr j Xr j From classical to relational data mining ...while in the most real world application complex objects are described in terms of properties and relations Example In spatial domains the effect of a predictor variable at any site may not be limited to the specified site (spatial autocorrelation) • E.g.: no communal establishment (schools, hospitals) in an ED, but many of them are located in the nearby EDs. Mining Relational Model Trees
Multi-relational representation • Augment data table with information about neighboring units. target relevant objects Mining Relational Model Trees
Regression Problem in relational data mining • Given • a training set O stored in relational tables S={T0,T1,…,Th} of a relational database D • a set of v primary key constraintsPK on relations in S, • a set of wforeign key constraints FK on relations in S, • a target relationT(X1,… ,Xn, Y) S, • a target continuous attributeYin T, different from the primary key or foreign key in T. Find • a multi-relational regression model which predicts the value of Yfor for some object represented as a tuple in T and related tuples in S according to foreign key paths. Mining Relational Model Trees
How to work with (multi-)relational data? • Moulding relational database in a single table such that traditional attribute-value algorithms are able to work on • create a single relation by deriving attributes from other joined tables • construct of a single relation that summarizes and/or aggregates information found in other tables • Solve mining problems in their original representation. • FORS (Karalic, 1997) • SRT(Kramer, 1996), S-CART (Kramer, 1999),TILDE-RT(Blockeel, 1998 ) Mining Relational Model Trees
Strengths and Weaknesses of current multi-relational regression methods • Strengths • solve Relational Regression problems in their original representation. • able to exploit background knowledge in the mining process • learn multi-relational patterns • Weaknesses • knowledge of data model is not used to guide the search process • data is stored as Prolog facts • not integrated with the database • do not differentiate global vs. local effects of variables in a regression model Idea: to combine the achievements of the KDD field on the integration of data mining with database systems, with results reported in the ILP field on how to upgrade propositional data mining algorithms to multi-relational representations. Mining Relational Model Trees
Global/local effect+ multi-relational model =Mr-SMOTI Tightly integrating the data mining engine with a relational DBMS Upgrading SMOTI to multi-relational representations Mr-SMOTI • Mr-SMOTI is the relational extension of SMOTI that outputs relational model trees such that • each node corresponds with a subset of training data and it is associated with a portion of D intensionally described by a relational pattern, • each leaf is associated with a (multiple) regression function which may involve predictor variables from several tables in D, • each variable that is eventually introduced in left branch of a node must not occur in the right branch of that node, • relational patterns associated with nodes are represented with regressionselection graphs that extends selection graph definition (Knobbe,99), • Regression selection graphs are translated into SQL expressions stored in XML format. Mining Relational Model Trees
Order Detail Customer Quantity 70 Order Date in {02/09/02} What is a regression selection graph? CreditLine • It corresponds to tuples describing a subset of the instances from database eventually modified by removing effect of regression steps • Nodes correspond to the tables from the database whose attributes are replaced by corresponding residuals • Arcs correspond to foreign key associations between tables • Open arcs = “have at least one” • Closed arcs = “have no of ” Mining Relational Model Trees
1st case Customer Order Detail Customer Order Detail Sale 120 Sale >120 Relational splitting nodes add condition + add negative condition add present arc and open node + add absent arc and closed node add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Detail Order Mining Relational Model Trees
Relational splitting nodes add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Detail Order 2nd case Customer Order Customer Detail Order Detail Quantity 22 Customer Order Detail Quantity 22 Mining Relational Model Trees
Customer Order Customer Detail Order Customer Order Detail Relational splitting nodes add condition + add negative condition (split condition) add present arc and open node + add absent arc and closed node (join condition) Customer Order Mining Relational Model Trees
Relational splitting nodes with look-ahead Customer Customer Order Customer Detail Quantity 22 Customer Order Detail Quantity 22 Mining Relational Model Trees
Relational regression nodes add regression condition Customer(Id, Sale,CreditLine,Agent) Order(Id, Date, Client, Pieces) CreditLine’= CreditLine-(5Sale-0.5) Pieces’= Pieces-(-2.5Sale-3.2) Customer(Id, Sale, CreditLine-5Sale+0.5,Agent) Order(Id, Date, Client, Pieces+2.5Sale+3.2) Mining Relational Model Trees
Order Customer Order Date in {02/09/02} Select Id, avg(5.25Sale+0.1Pieces-2.18) as CreditLine From Customer, Order Where Customer.ID=Order.Client Group by Customer.Id Relational model trees: an example Customer Customer Order Order Customer … Customer(Id, Sale, CreditLine-5 Sale+0.5,Agent) Order(Id, Date, Client, Pieces+2.5Sale+3.2) Order Customer Date in {02/09/02} … Customer(Id, Sale, CreditLine-5Sale+0.5-0.1(Pieces+2.5Sale+3.2)+2,Agent) Order(Id, Date, Client, Pieces+2.5Sale+3.2) Mining Relational Model Trees
How to choose the best relational node? • Start withroot nodethat is associated with selection graph containing only target node • Findgreedy heuristics to choose regression selection graph refinements • use binary splitsfor simplicity • for each refinement get complementary refinement • store regression coefficientin order to compute residuals on continuous attributes • choose the best refinement based on evaluation functions Mining Relational Model Trees
Evaluating relational splitting node Customer Customer Order Customer Order Mining Relational Model Trees
Evaluating relational regression node (t) = min {R(t),(t’)}. • where • R(t) is the resubstitution error computed on the tuples returned on tuples extracted by regression selection graph associated with t, • t’ is the best splitting node following t. Mining Relational Model Trees
Stopping criteria • The first requires the number of target objects in each node to be greater than a minimum value. • The second operates when all continuous attributes along the path from the root to the current node are used in regression steps and there are no add open node and present arc refinement including new continuous attributes. • The third stops the growth when the coefficient of determination is greater than a minimum value. Mining Relational Model Trees
Mr-SMOTI: some details Mr-SMOTI has been implemented as a component of the KDD system MURENA. MURENA has been implemented in java and interfaces an Oracle datatabase. http://www.di.uniba.it/%7Ececi/micFiles/systems/The%20MURENA%20project.html Mining Relational Model Trees
Empirical evaluation on laboratory-sized data Mining Relational Model Trees
Empirical evaluation on laboratory-sized data Wilcoxon test (alpha=0.05) … Mining Relational Model Trees
Empirical evaluation on real data Mining Relational Model Trees
Improving efficency by materializing intermediate results Mining Relational Model Trees
Questions? Mining Relational Model Trees