Anna Atramentov Major: Computer Science Program of Study Committee:

A Multi-Relational Decision Tree Learning Algorithm – Implementation and Experiments Anna Atramentov Major: Computer Science Program of Study Committee: Vasant Honavar, Major Professor Drena Leigh Dobbs Yan-Bin Jia Iowa State University, Ames, Iowa 2003

KDD and Relational Data Mining • Term KDD stands for Knowledge Discovery in Databases • Traditional techniques in KDD work with the instances represented by one table • Relational Data Mining is a subfield of KDD where the instances are represented by several tables

Motivation Importance of relational learning: • Growth of data stored in MRDB • Techniques for learning unstructured data often extract the data into MRDB Promising approach to relational learning: • MRDM (Multi-Relational Data Mining) framework developed by Knobbe’s (1999) • MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2002) Goals • Speed up MRDM framework and in particular MRDTL algorithm • Incorporate handling of missing values • Perform more extensive experimental evaluation of the algorithm

Relational Learning Literature • Inductive Logic Programming (Dzeroski and Lavrac, 2001; Dzeroski et al., 2001; Blockeel, 1998; De Raedt, 1997) • First order extensions of probabilistic models • Relational Bayesian Networks(Jaeger, 1997) • Probabilistic Relational Models (Getoor, 2001; Koller, 1999) • Bayesian Logic Programs (Kersting et al., 2000) • Combining First Order Logic and Probability Theory • Multi-Relational Data Mining (Knobbe et al., 1999) • Propositionalization methods (Krogel and Wrobel, 2001) • PRMs extension for cumulative learning for learning and reasoning as agents interact with the world (Pfeffer, 2000) • Approaches for mining data in form of graph (Holder and Cook, 2000; Gonzalez et al., 2000)

Problem Formulation Given: Data stored in relational data base Goal: Build decision tree for predicting target attribute in the target table Example of multi-relational database schema instances

… … … … sunny not sunny {d1, d2} {d3, d4} Temperature hot not hot No No {d3} {d4} Yes Propositional decision tree algorithm. Construction phase {d1, d2, d3, d4} Tree_induction(D: data) A = optimal_attribute(D) if stopping_criterion (D) return leaf(D) else Dleft := split(D, A) Dright := splitcomplement(D, A) childleft := Tree_induction(Dleft) childright := Tree_induction(Dright) return node(A, childleft, childright) Outlook

Grad. Student Staff Staff Grad. Student Staff Grad. Student Grad. Student GPA >2.0 GPA >2.0 MR setting. Splitting data with Selection Graphs Department Graduate Student Staff complement selection graphs

Grad.Student GPA >3.9 What is selection graph? • It corresponds to the subset of the instances from target table • Nodes correspond to the tables from the database • Edges correspond to the associations between tables • Open edge = “have at least one” • Closed edge = “have non of ” Grad.Student Department Staff Specialization=math

Transforming selection graphs into SQL queries Staff SelectdistinctT0.id FromStaff Where T0.position=Professor Position = Professor Select distinctT0.id FromStaff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Generic query: select distinctT0.primary_key fromtable_list wherejoin_list andcondition_list Staff Grad. Student SelectdistinctT0.id FromStaff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Select distinct T0. id From Staff T0, Graduate_Student T1 WhereT0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Staff Grad. Student GPA >3.9

Staff Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … MR decision tree • Each node contains selection graph • Each child selection graph is a supergraphof the parent selection graph

Staff Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … How to choose selection graphs in nodes? Problem: There are too many supergraph selection graphs to choose from in each node Solution: • start with initial selection graph • find greedy heuristic to choose supergraphselection graphs: refinements • use binary splits for simplicity • for each refinementget complement refinement • choose the best refinement basedon information gain criterion Problem: Somepotentiallygood refinementsmay give noimmediate benefit Solution: • look ahead capability

Department Grad.Student Staff Refinements of selection graph • add condition to the node - explore attribute information in the tables • add present edge and open node –explore relational properties between the tables Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9

Grad.Student Department Staff Department Grad.Student Grad.Student GPA >3.9 Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement • add condition to the node • add present edge and open node Specialization=math Position = Professor Specialization=math complement refinement Specialization=math Position != Professor

Grad.Student Department Staff Department Grad.Student Grad.Student GPA >3.9 Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA >2.0 Refinements of selection graph refinement GPA >2.0 • add condition to the node • add present edge and open node Specialization=math Specialization=math complement refinement Specialization=math

Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Department Grad.Student #Students >200 GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node #Students >200 Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

Department Grad.Student Grad.Student Department Staff Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Department Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node Grad.Student GPA >3.9 Specialization=math complement refinement Note: information gain = 0 Specialization=math

Department Staff Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student Staff GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff Specialization=math • add condition to the node • add present edge and open node Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

Staff Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Staff Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff • add condition to the node • add present edge and open node Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

Grad.S Department Grad.Student Grad.Student Department Staff Staff Grad.Student Grad.Student Department Grad.S Staff GPA >3.9 Grad.Student GPA >3.9 Refinements of selection graph refinement Grad.Student Department Staff • add condition to the node • add present edge and open node Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Specialization=math

Department Grad.Student Grad.Student Department Staff Department Staff Grad.Student GPA >3.9 Department Look ahead capability refinement Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9 Specialization=math complement refinement Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9

Department Grad.Student Grad.Student Department Staff Department Department Staff complement refinement Department Grad.Student Grad.Student Department GPA >3.9 Staff Grad.Student GPA >3.9 #Students > 200 refinement Look ahead capability Grad.Student Department Staff Specialization=math Grad.Student GPA >3.9 #Students > 200 Specialization=math Specialization=math

Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … MRDTL algorithm. Construction phase Staff for each non-leaf node: • consider all possible refinements and their complements of the node’s selection graph • choose the best onesbased on informationgain criterion • createchildrennodes Staff Grad.Student Staff Grad.Student

MRDTL algorithm. Classification phase Staff for each leaf: • apply selection graph of theleaf to the test data • classify resulting instanceswith classificationof the leaf Staff Grad.Student Staff Grad.Student Grad.Student … … Staff Grad. Student Staff Grad.Student GPA >3.9 GPA >3.9 … … … … Staff Grad. Student Staff Grad. Student Position =Professor GPA >3.9 …………….. GPA >3.9 Department Department 70-80k 80-100k Spec=math Spec=physics

Grad.Student Department Staff Grad.Student GPA >3.9 The most time consuming operations of MRDTL Entropy associated with this selection graph: Specialization=math E =  (ni /N)log (ni /N) Query associated with counts ni: select distinct Staff.Salary, count(distinct Staff.ID) fromStaff, Grad.Student, Deparment wherejoin_list andcondition_list group by Staff.Salary n1 n2 Result of the query is the following list: … ci , ni

Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA >2.0 The most time consuming operations of MRDTL GPA >2.0 Specialization=math Entropy associated with each of the refinements Specialization=math select distinct Staff.Salary, count(distinct Staff.ID) fromtable_list wherejoin_list andcondition_list group by Staff.Salary Specialization=math

Grad.Student Department Staff Grad.Student GPA >3.9 A way to speed up - eliminate redundant calculations Problem:For selection graph with 162 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation:For this selection graph tables Staff and Grad.Student will be joined over and over for all the children refinements of the tree A way to fix:calculate it only once and save for all further calculations Specialization=math

Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math

Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Entropy associated with this selection graph: Specialization=math E =  (ni /N)log (ni /N) Query associated with counts ni: selectS.Salary, count(distinct S.Staff_ID) fromS group by S.Salary n1 Result of the query is the following list: n2 ci , ni …

Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Queries associated with the addcondition refinement: select S.Salary, X.A, count(distinct S.Staff_ID) fromS, X where S.X_ID = X.ID group by S.Salary, X.A Specialization=math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))

Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Queries associated with the addedge refinement: select S.Salary, count(distinct S.Staff_ID) fromS, X, Y where S.X_ID = X.ID, and e.cond group by S.Salary Specialization=math Calculations for the complement refinement: count(ci , Rcomp(S)) = count(ci, S) – count(ci , R(S))

Speed Up Method • Significant speed up in obtaining the counts needed for the calculations of the entropy and information gain • The speed up is reached by the additional space used by the algorithm

Handling Missing Values Graduate Student Department Staff For each attribute which has missing values we build a Naïve Bayes model: Staff.Position Staff.Name Staff.Dep Department.Spec …

Handling Missing Values Graduate Student Department Then the most probable value for the missing attribute is calculated by formula: Staff P(vi | X1.A1, X2.A2, X3.A3 …) = P(X1.A1, X2.A2, X3.A3 …| vi) P(vi) / P(X1.A1, X2.A2, X3.A3 … ) = P(X1.A1| vi) P(X2.A2| vi) P(X3.A3| vi) … P(vi) / P(X1.A1, X2.A2, X3.A3 … )

Experimental results. Mutagenesis • Most widely DB used in ILP. • Describes molecules of certain nitro aromatic compounds. • Goal: predict their mutagenic activity (label attribute) – ability to cause DNA to mutate. High mutagenic activity can cause cancer. • Two subsets regression friendly (188 molecules) and regression unfriendly (42 molecules). We used only regression friendly subset. • 5 levels of background knowledge: B0, B1, B2, B3, B4. They provide richer descriptions of the examples. We used B2 level.

Experimental results. Mutagenesis • Schema of the mutagenesis database • Results of 10-fold cross-validation for regression friendly set. Best-known reported accuracy is 86%

FUNCTION Experimental results. KDD Cup 2001 • Consists of a variety of details about the various genes of one particular type of organism. • Genes code for proteins, and these proteins tend to localize in various parts of cells and interact with one another in order to perform crucial functions. • 2 Tasks: Prediction of gene/protein localization and function • 862 training genes, 381 test genes. • Many attribute values are missing: 70% of CLASS attribute, 50% of COMPLEX, and 50% of MOTIF in composition table

Experimental results. KDD Cup 2001 Best-known reported accuracy is 72.1% Best-known reported accuracy is 93.6%

Experimental results. PKDD 2001 Discovery Challenge • Consists of 5 tables • Target table consists of 1239 records • The task is to predict the degree of the thrombosis attribute from ANTIBODY_EXAM table • The results for 5:2 cross validation: Best-known reported accuracy is 99.28%

Summary • the algorithm significantly outperforms MRDTL in terms of running time • the accuracy results are comparable with the best reported results obtained using different data-mining algorithms Future work • Incorporation of the more sophisticated techniques for handling missing values • Incorporating of more sophisticated pruning techniques or complexity regularizations • More extensive evaluation of MRDTL on real-world data sets • Development of ontology-guided multi-relational decision tree learning algotihms to generate classifiers at multiple levels of abstraction [Zhang et al., 2002] • Development of variants of MRDTL that can learn from heterogeneous, distributed, autonomous data sources, based on recently developed techniques for distributed learning and ontology based data integration

Thanks to • Dr. Honavar for providing guidance, help and support throughout this research • Colleges from Artificial Intelligence Lab for various helpful discussions • My committee members: Drena Dobbs and Yan-Bin Jia for their help • Professors and lecturers of the Computer Science department for the knowledge that they gave me through lectures and discussions • Iowa State University and Computer Science department for funding in part this research

Anna Atramentov Major: Computer Science Program of Study Committee: