CrossMine: Efficient Classification Across Multiple Database Relations

CrossMine: Efficient Classification Across Multiple Database Relations Xiaoxin Yin, Jiawei Han, Jiong Yang University of Illinois at Urbana-Champaign Philip S. Yu IBM T. J. Watson Research Center

Roadmap • Introduction, definitions • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

Introduction, definitions • Most real-world data are stored in relational databases • Multirelational classification – procedure of building a classifier based on information stored in multiple relational databases • ILP most widely used, but are not scalable • Multi-relational classification: Automatically classifying objects using multiple relations

An Example: Loan Applications Ask the backend database Approve or not? Apply for loan

The Backend Database Account District Loan account-id district-id district-id dist-name Target relation: Each tuple has a class label, indicating whether a loan is paid on time. loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id How to make decisions to loan applications?

Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

Preliminaries • Target relation • Class labels • Predicates • Rules • Decision Trees • Searching for Predicates by Joins

The problem The joined realation Loan, Account, Order, Transaction (“x-y” represents attribute y in relation x)

Rule Generation • Search for good predicates across multiple relations Applicant #1 Loan Applications Applicant #2 Orders Applicant #3 Accounts Applicant #4 Other relations Districts

Previous Approaches • Inductive Logic Programming (ILP) • To build a rule • Repeatedly find the best predicate • To evaluate a predicate on relation R, first join target relation with R • Not scalable because • Huge search space (numerous candidate predicates) • Not efficient to evaluate each predicate • To evaluate a predicate Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?) first join loan relation with account relation • CrossMine is more scalable and more than one hundred times faster on datasets with reasonable sizes

CrossMine: An Efficient and Accurate Multi-relational Classifier • Tuple-ID propagation: an efficient and flexible method for virtually joining relations • Confine the rule search process in promising directions • Look-one-ahead: a more powerful search strategy • Negative tuple sampling: improve efficiency while maintaining accuracy

Tuple ID Propagation Instead of performing physical join, the IDs and class labels of target tuples can be propagated to Account relation

Tuple ID Propagation Applicant #1 Account ID Frequency Open date Propagated ID Labels Applicant #2 124 monthly 02/27/93 1, 2 2+, 0– 108 weekly 09/23/97 3 0+, 1– 45 monthly 12/09/96 4 0+, 1– 67 weekly 01/01/97 Null 0+, 0– Applicant #3 • Possible predicates: • Frequency=‘monthly’: 2 +, 1 – • Open date < 01/01/95: 2 +, 0 – Applicant #4 • Propagate tuple IDs of target relation to non-target relations • Virtually join relations to avoid the high cost of physical joins

Tuple ID Propagation (cont.) • Efficient • Only propagate the tuple IDs • Time and space usage is low • Flexible • Can propagate IDs among non-target relations • Many sets of IDs can be kept on one relation, which are propagated from different join paths

Overall Procedure • Sequential covering algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples

Rule Generation • To generate a rule while(true) find the best predicate p if foil-gain(p)>threshold then add p to current rule else break A3=1 A3=1&&A1=2 A3=1&&A1=2 &&A8=5 Positive examples Negative examples

Evaluating Predicates • All predicates in a relation can be evaluated based on propagated IDs • Use foil-gain to evaluate predicates • Suppose current rule is r. For a predicate p, foil-gain(p) = • Categorical Attributes • Compute foil-gain directly • Numerical Attributes • Discretize with every possible value

Rule Generation • Start from the target relation • Only the target relation is active • Repeat • Search in all active relations • Search in all relations joinable to active relations • Add the best predicate to the current rule • Set the involved relation to active • Until • The best predicate does not have enough gain • Current rule is too long

Account District account-id district-id Loan district-id dist-name loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id Rule Generation: Example Target relation First predicate Second predicate Range of Search Add best predicate to rule

Look-one-ahead in Rule Generation • Two types of relations: Entity and Relationship • Often cannot find useful predicates on relations of relationship No good predicate Target Relation • Solution of CrossMine: • When propagating IDs to a relation of relationship, propagate one more step to next relation of entity.

Negative Tuple Sampling • A rule covers some positive examples • Positive examples are removed after covered • After generating many rules, there are much less positive examples than negative ones – + – + – – – + + + – – + + – + + – + – – – + – – + – + + + – – +

Negative Tuple Sampling (cont.) • When there are much more negative examples than positive ones • Cannot build good rules (low support) • Still time consuming (large number of negative examples) • Make sampling on negative examples • Improve efficiency without affecting rule quality • T(-) < Neg_Pos_Ratio x T(+) and T(-) < Max_Num_Negtive – – – – – – + – – – – – – – – – + + + – – +

Performance study • 1.7GHz P4 PC – Windows2000 • For CrossMine-Rule parameters: • Min_Foil_Gain = 2.5 • Max_Rule_Length = 6 • Neg_Pos_Ratio = 1 • Max_Num_Negative = 600

Performance study • Synthetic relational databases are generated • Use different • Number of relations • Number of tuples in each relation • Number of foreign keys • The running time and accuracy are compared • CrossMine can be performed efficiently on data stored on disks (real applications) too.

Synthetic datasets: Scalability w.r.t. number of relations Scalability w.r.t. number of tuples

Real Dataset • PKDD Cup 99 dataset – Loan Application • Mutagenesis dataset (4 relations)

References • H. Blockeel, L. De Raedt and J. Ramon. Top-down induction of logical decision trees. In Proc. of the Fifteenth Int. Conf. of Machine Learning, Madison, WI, 1998. • C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998. • L. Dehaspe and H. Toivonen. Discovery of Relational Association Rules. In Relational Data Mining, Springer-Verlag, 2000. • L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. In Proc. 18th International Conf. on Machine Learning, Williamtown, MA, 2001. • H. A. Leiva. MRDTL: a multi-relational decision tree learning algorithm. M.S. thesis, Iowa State Univ., 2002. • T. Mitchell. Machine Learning. McGraw Hill, 1996. • S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 1995. • S. Muggleton and C. Feng. Efficient induction of logic programs. In Proc. of First Conf. on Algorithmic Learning Theory, Tokyo, Japan, 1990. • A. Popescul, L. Ungar, S. Lawrence, and M. Pennock. Towards Structural Logistic Regression: Combining Relational and Statistical Learning. In Proc. of Multi-Relational Data Mining Workshop, Alberta, Canada, 2002. • J. R. Quinlan. FOIL: A midterm report. In Proc. of the sixth European Conf. on Machine Learning, Springer-Verlag, 1993. • J. R. Quilan. C4.5: Programs for Machine Learning. In Morgan Kaufmann series in machine learning, Morgan Kaufmann, 1993. • B. Taskar, E. Segal, and D. Koller. Probabilistic Classification and Clustering in Relational Data. in Proc. of 17th Int. Joint Conf. on Artificial Intelligence, Seattle, WA, 2001.

CrossMine: Efficient Classification Across Multiple Database Relations

CrossMine: Efficient Classification Across Multiple Database Relations

Presentation Transcript

Unsupervised Sentiment Classification Across Domains

Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources

Efficient classification for metric data

Classification with Multiple Decision Trees

Efficient Execution of Single-thread Programs across Multiple Cores

Online Multiple Kernel Classification

Gaming across Multiple Devices

Packet classification on Multiple Fields

Population dynamics across multiple sites

Global Classification of (Plant) Proteins across Multiple Species

Packet Classification On Multiple Fields

Packet Classification on Multiple Fields

Efficient classification for metric data

Database Relations

Social Relations Model: Multiple Variables

Packet Classification on Multiple Fields

Efficient Distribution Mining and Classification

Query Construction across Multiple Terminologies

Efficient packet classification using TCAMs

Buy USA b2c Database to Develop an Efficient Customer Relations Management

Multiple Table Database Review

Efficient Comment Classification through NLP and Fuzzy Classification