310 likes | 506 Views
CrossMine: Efficient Classification Across Multiple Database Relations. Xiaoxin Yin, Jiawei Han, Jiong Yang University of Illinois at Urbana-Champaign Philip S. Yu IBM T. J. Watson Research Center. Roadmap. Introduction, definitions Problem definition - preliminaries Tuple ID Propagation
E N D
CrossMine: Efficient Classification Across Multiple Database Relations Xiaoxin Yin, Jiawei Han, Jiong Yang University of Illinois at Urbana-Champaign Philip S. Yu IBM T. J. Watson Research Center
Roadmap • Introduction, definitions • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Introduction, definitions • Most real-world data are stored in relational databases • Multirelational classification – procedure of building a classifier based on information stored in multiple relational databases • ILP most widely used, but are not scalable • Multi-relational classification: Automatically classifying objects using multiple relations
An Example: Loan Applications Ask the backend database Approve or not? Apply for loan
The Backend Database Account District Loan account-id district-id district-id dist-name Target relation: Each tuple has a class label, indicating whether a loan is paid on time. loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id How to make decisions to loan applications?
Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Preliminaries • Target relation • Class labels • Predicates • Rules • Decision Trees • Searching for Predicates by Joins
The problem The joined realation Loan, Account, Order, Transaction (“x-y” represents attribute y in relation x)
Rule Generation • Search for good predicates across multiple relations Applicant #1 Loan Applications Applicant #2 Orders Applicant #3 Accounts Applicant #4 Other relations Districts
Previous Approaches • Inductive Logic Programming (ILP) • To build a rule • Repeatedly find the best predicate • To evaluate a predicate on relation R, first join target relation with R • Not scalable because • Huge search space (numerous candidate predicates) • Not efficient to evaluate each predicate • To evaluate a predicate Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?) first join loan relation with account relation • CrossMine is more scalable and more than one hundred times faster on datasets with reasonable sizes
CrossMine: An Efficient and Accurate Multi-relational Classifier • Tuple-ID propagation: an efficient and flexible method for virtually joining relations • Confine the rule search process in promising directions • Look-one-ahead: a more powerful search strategy • Negative tuple sampling: improve efficiency while maintaining accuracy
Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Tuple ID Propagation Instead of performing physical join, the IDs and class labels of target tuples can be propagated to Account relation
Tuple ID Propagation Applicant #1 Account ID Frequency Open date Propagated ID Labels Applicant #2 124 monthly 02/27/93 1, 2 2+, 0– 108 weekly 09/23/97 3 0+, 1– 45 monthly 12/09/96 4 0+, 1– 67 weekly 01/01/97 Null 0+, 0– Applicant #3 • Possible predicates: • Frequency=‘monthly’: 2 +, 1 – • Open date < 01/01/95: 2 +, 0 – Applicant #4 • Propagate tuple IDs of target relation to non-target relations • Virtually join relations to avoid the high cost of physical joins
Tuple ID Propagation (cont.) • Efficient • Only propagate the tuple IDs • Time and space usage is low • Flexible • Can propagate IDs among non-target relations • Many sets of IDs can be kept on one relation, which are propagated from different join paths
Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Overall Procedure • Sequential covering algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples
Rule Generation • To generate a rule while(true) find the best predicate p if foil-gain(p)>threshold then add p to current rule else break A3=1 A3=1&&A1=2 A3=1&&A1=2 &&A8=5 Positive examples Negative examples
Evaluating Predicates • All predicates in a relation can be evaluated based on propagated IDs • Use foil-gain to evaluate predicates • Suppose current rule is r. For a predicate p, foil-gain(p) = • Categorical Attributes • Compute foil-gain directly • Numerical Attributes • Discretize with every possible value
Rule Generation • Start from the target relation • Only the target relation is active • Repeat • Search in all active relations • Search in all relations joinable to active relations • Add the best predicate to the current rule • Set the involved relation to active • Until • The best predicate does not have enough gain • Current rule is too long
Account District account-id district-id Loan district-id dist-name loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id Rule Generation: Example Target relation First predicate Second predicate Range of Search Add best predicate to rule
Look-one-ahead in Rule Generation • Two types of relations: Entity and Relationship • Often cannot find useful predicates on relations of relationship No good predicate Target Relation • Solution of CrossMine: • When propagating IDs to a relation of relationship, propagate one more step to next relation of entity.
Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Negative Tuple Sampling • A rule covers some positive examples • Positive examples are removed after covered • After generating many rules, there are much less positive examples than negative ones – + – + – – – + + + – – + + – + + – + – – – + – – + – + + + – – +
Negative Tuple Sampling (cont.) • When there are much more negative examples than positive ones • Cannot build good rules (low support) • Still time consuming (large number of negative examples) • Make sampling on negative examples • Improve efficiency without affecting rule quality • T(-) < Neg_Pos_Ratio x T(+) and T(-) < Max_Num_Negtive – – – – – – + – – – – – – – – – + + + – – +
Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study
Performance study • 1.7GHz P4 PC – Windows2000 • For CrossMine-Rule parameters: • Min_Foil_Gain = 2.5 • Max_Rule_Length = 6 • Neg_Pos_Ratio = 1 • Max_Num_Negative = 600
Performance study • Synthetic relational databases are generated • Use different • Number of relations • Number of tuples in each relation • Number of foreign keys • The running time and accuracy are compared • CrossMine can be performed efficiently on data stored on disks (real applications) too.
Synthetic datasets: Scalability w.r.t. number of relations Scalability w.r.t. number of tuples
Real Dataset • PKDD Cup 99 dataset – Loan Application • Mutagenesis dataset (4 relations)
References • H. Blockeel, L. De Raedt and J. Ramon. Top-down induction of logical decision trees. In Proc. of the Fifteenth Int. Conf. of Machine Learning, Madison, WI, 1998. • C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998. • L. Dehaspe and H. Toivonen. Discovery of Relational Association Rules. In Relational Data Mining, Springer-Verlag, 2000. • L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. In Proc. 18th International Conf. on Machine Learning, Williamtown, MA, 2001. • H. A. Leiva. MRDTL: a multi-relational decision tree learning algorithm. M.S. thesis, Iowa State Univ., 2002. • T. Mitchell. Machine Learning. McGraw Hill, 1996. • S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 1995. • S. Muggleton and C. Feng. Efficient induction of logic programs. In Proc. of First Conf. on Algorithmic Learning Theory, Tokyo, Japan, 1990. • A. Popescul, L. Ungar, S. Lawrence, and M. Pennock. Towards Structural Logistic Regression: Combining Relational and Statistical Learning. In Proc. of Multi-Relational Data Mining Workshop, Alberta, Canada, 2002. • J. R. Quinlan. FOIL: A midterm report. In Proc. of the sixth European Conf. on Machine Learning, Springer-Verlag, 1993. • J. R. Quilan. C4.5: Programs for Machine Learning. In Morgan Kaufmann series in machine learning, Morgan Kaufmann, 1993. • B. Taskar, E. Segal, and D. Koller. Probabilistic Classification and Clustering in Relational Data. in Proc. of 17th Int. Joint Conf. on Artificial Intelligence, Seattle, WA, 2001.