120 likes | 228 Views
Speeding Up Multi-Relational Data Mining. Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html.
E N D
Speeding Up Multi-Relational Data Mining Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA www.cs.iastate.edu/~honavar/aigroup.html * Support provided in part by National Science Foundation, Carver Foundation, and Pioneer Hi-Bred, Inc.
Motivation Importance of relational learning: • Growth of data stored in MRDB • Techniques for learning unstructured data often extract the data into MRDB One of the promising approaches to relational learning: • MRDM (Multi-Relational Data Mining) framework developed by Knobbe et. al. (1999) • MRDTL (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva et. al. (2002) Goal • Speed up MRDM framework and in particular MRDTL algorithm
Problem Formulation Given: Data stored in relational database Goal: Learn a predictive model for the instances in the target table Example of multi-relational database schema instances
Grad.Student GPA >3.9 MRDM overview. Selection graphs Grad.Student Department • Nodes correspond to the tables from the database • Edges correspond to the associations between tables • It corresponds to the subset of the instances from the target table having some property • It is a way of specifying attributes in the relational setting Staff Specialization=math Staff
MRDM overview. Transforming selection graphs into SQL queries Select distinctT0.id FromStaff T0, Graduate_Student T1 Where T0.id=T1.Advisor Staff Grad. Student Generic query: select distinctT0.primary_key fromtable_list wherejoin_list andcondition_list Staff Grad. Student SelectdistinctT0.id FromStaff T0 Where T0.id not in ( Select T1. id From Graduate_Student T1) Grad. Student Select distinct T0. id From Staff T0, Graduate_Student T1 WhereT0.id=T1.Advisor T0. id not in ( Select T1. id From Graduate_Student T1 Where T1.GPA > 3.9) Staff Grad. Student GPA >3.9
Grad.Student Department Staff Grad.Student GPA >3.9 Grad.Student Department Staff Grad.Student Grad.Student Department Staff GPA >3.9 Grad.Student GPA >3.9 Grad.Student GPA>2.0 MRDM overview. Refinements of selection graphs refinement GPA >2.0 Specialization=math Specialization=math complement refinement Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 The most time consuming operations of MRDTL Query associated with the selection graph: Specialization=math select distinct Staff.Salary, count(distinct Staff.ID) fromStaff, Grad.Student, Department wherejoin_list andcondition_list group by Staff.Salary
Grad.Student Department Staff Grad.Student GPA >3.9 A way to speed up - eliminate redundant calculations Problem:For selection graph with 160 nodes the time to execute a query is more than 3 minutes! Redundancy in calculation:Tables Staff and Grad.Student will be joined for all the children refinements A way to fix:make the join only once and save necessary information for all further calculations Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math
Grad.Student Department Staff Grad.Student GPA >3.9 Speed Up Method. Sufficient tables Specialization=math Query associated with the selection graph: selectS.Salary, count(distinct S.Staff_ID) fromS group by S.Salary
Summary • A general approach for speeding up MRDM framework • MRDTL algorithm is a competitive algorithm for learning from RDB in terms of both accuracy and time Future work • techniques for handling missing values • pruning techniques or complexity regularizations • use of the aggregates for the attribute values • more extensive evaluation of MRDTL on real-world data sets