460 likes | 563 Views
15th International Database Engineering & Applications Symposium. Lisbon , Portugal, 21-23 September , 2011. Lucantonio Ghionna , Gianluigi Greco. B oosting tuple propagation in multi- relational classification. Dept . of Mathematics, University of Calabria, Italy. Outline.
E N D
15th International Database Engineering & Applications Symposium Lisbon, Portugal, 21-23 September, 2011 Lucantonio Ghionna, Gianluigi Greco Boostingtuplepropagationinmulti-relationalclassification Dept. of Mathematics, University of Calabria, Italy
Outline • Background • Multi-RelationalClassification • Problem Complexity • Tractability Islands • Heuristic Approaches • DBMS Implementation • System Design • Experiments • ConclusionRemarks
Multi-RelationalClassification Account District Loan account-id district-id district-id dist-name Target relation: Each tuple has a class label, indicating whether a loan is paid on time. loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance How to makedecision on loangranting? den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id
Multi-RelationalClassification • Search for good predicates across multiple relations Do goodpayersaccesstheir account with a "monthly" frequency? Applicant #1 Loan Applications Applicant #2 Orders Applicant #3 Accounts Applicant #4 Other relations Districts
Solving CLP: State-of-Art • Flatteningapproach [Krogel03] • Build the universal relation throughjoins • Combinatorialexplosition of data, large tables with manyattributes [Mugg92] • Upgradingapproach [Xu06] • Keepthe universal relationvirtualby propagatinglabelsthroughforeignkeys • Global Perspective [Xu06] • Local Perspective [Blockheel03,Yin04,Xu06]
Contributions • We show that the propagation problem can effectively be solved on databases whose hypergraphsare nearly-acyclic • We design effectivealgorithms for the global/localperspectives • Weprovide an implementation of a complete JDBC basedsystem for tuplepropagation • Experiments
Problem Complexity • Tractability Islands • Heuristic Approaches
Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] p1(X,Y) p2(X,Z,W) p5(Y,T,X)
Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Bottom up p1(X,Y) p2(X,Z,W) p5(Y,T,X)
Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Bottom up p1(X,Y) p2(X,Z,W) p5(Y,T,X)
Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Top down p1(X,Y) p2(X,Z,W) p5(Y,T,X)
Global Perspective: TractabilityIslands of CLP • Good news Exponentially large universal relations does not imply CLP intractability [Xu06] Top down p1(X,Y) p2(X,Z,W) p5(Y,T,X) CLP tractable on dependency graphs whose undirected versions are (forests of) trees[Xu06]
TractabilityIslands of CLP. Are treesenough? Q=R1 (B1,A1, …, Am), R2(B2,A1, …, Am), …, R1(B1,A1, …, Am),…, R’1 (A1), R’2(A2),…,R’m(Am) R1(B1,A1, …, Am) R2 (B2,A1, …, Am) Rn (Bn,A1, …, Am) ….. ….. ….. ….. R’1 (A1) R’2(A2) R’m(Am) ….. The (undirected) dependencygraphis a bipartite clique of size m × n, and hence it is not a tree and the result in [XU06] does not apply CLP isstilltractable !
TractabilityIslands of CLP. HypertreeDecompositions Q=R1 (B1,A1, …, Am), R2(B2,A1, …, Am), …, R1(B1,A1, …, Am),…, R’1 (A1), R’2(A2),…,R’m(Am) R2 {B1, …, Bm ,A1, …, Am} R1, R2, …, Rm B2 R’1 R’2 R’m B1 Bm A2 Am A1 Rm … R1 {A2}R’2 {A1} R’1 {Am} R’m ……. • For fixedk, • decidingwhetherhw(Q) kis in P [Gottlob02] • computinghypertreedecompositionsis in P [Gottlob02]
TractabilityIslands of CLP. HypertreeDecompositions • Cyclic dependency graph…… • ….bounded width!
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Bottom up Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order Account Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> <1,1> Card Client District
CLPonDBkalgorithm • Top Down Phase Loan <1,1> <1,1> Transaction Order <1,1> Account <1,1> Account,Disposition <1,1> <1,1> Card Client <1,1> CLPonDBk solves CLP in time O(|D| × max RiD||Ri||k+3), on the class of those instances whose associated hypergraphshave hypertree width bounded by k. District
L-CLP: Local Perspectiveon PropagationProblem In several multi-relational approaches, CLP is heuristically restricted to portions of the database • Reducing the search space can pragmatically speed-up the computation • Still, joining many relations may be challenging from a computational viewpoint.
L-CLP: NTtoT_onDBMS and TtoNT_onDBMS Propagation path from R1 to Rm only requires joining pairs of “adjacent” relations “Target to Non-Target” Propagation (TtoNTonDBMS) Propagate information from R1to Rm, evaluate C on the result “Non-Target to Target” Propagation (NTtoTonDBMS) Start by filtering Rmwith the condition C, by joining the result with Rm-1, and by iterating the process back to R1
L-CLP: NTtoT_onDBMS and TtoNT_onDBMS TtoNT_onDBMS NTtoT_onDBMS
DBMS Implementation • System Design • Experiments
ExperimentationSettings Scenario: • CROSSMINE + NTtoT_onDBMS • CROSSMINE + TtoNT_onDBMS • CROSSMINE + TupleIDPropagation Parameters: • The number m of relations • Thenumber ||target || of tuples in the target relation; • The “propagation ratio” ||target ||/||R|| • The selectivity s of each join attribute Environment: 2.1GHz Centrino PC, 1 Gb RAM, 5400 rpm hard disk (Windows XP Professional)
Computation Time and Propagation Time m=5; ||target ||/||R||=1; s=50% • Dramaticimprovements w.r.t. standard Crossmine • Effectivescaling for large relations • ….
Gains w.r.t. Crossmine m=5; s=50% NTtoT_onDBMS or TtoNT_onDBMS ? • Gain on propagation up to 95 % • Gain on computation time up to 90 % • ……
NTtoT_onDBMS vs TtoNT_onDBMS ||target ||=100000;m=5; s=50% ||target ||=100000;m=5; s=50% ||target ||/R=1 • TtoNT_onDBMSis the best with lowpropagation ratio • NTtoT_onDBMSis the best whentarget relation is much larger than other relations • Semi-joins operators are a winning choice in practical database applications
Conclusion and Discussion CLP problemis a challenging task which can be effectivelyaskedusing state-of-art query-optimization methods • Propagation over large class of nearly-acyclic database schemas is in fact tractable (polynomial upper bound guarantee) • Result in [Xu06] emerges as a special case • Database implementation of local-perspective methods shows tremendous benefits w.r.t. standard in-memory strategies Potential benefits for many classifications algorithms, such as Bayesian classifiers[Getoor01], probabilistic models [Taskar02], and decisiontree learningmethods[Leiva03].
References • P. A. Bernstein and N. Goodman. Power of natural semijoins. SIAM Journal on Computing, 10(4):751–771, 1981. • H. Blockeel and L. De Raedt. Top-down Induction of First-Order Logical Decision Trees. Artificial Intelligence, 101(1-2):285–297, 1998. • H. Blockeel and M. Sebag. Scalability and Efficiency in Multi-relational Data Mining. SIGKDD ExplorationsNewsletters, 5(1):17–30, 2003. • M. Ceci and D. Malerba. Mr-SBC: a Multi-Relational Naive Bayes Classifier. In Proc. of PKDD’03, pages 95–106, 2003. • S. Dˇzeroski. Multi-relational Data Mining: an Introduction. SIGKDD ExplorationsNewsletters, 5(1):1–16, 2003. • P. A. Flach and N. Lachiche. IBC2: A True First-Order Bayesian Classifier. In Proc. of ILP’02, pages133–148, 2002. • R. Frank and F.M.M. Ester. A Method for Multi-relational Classification Using Single and Multi-feature Aggregation Functions. In Proc. Of PKDD’07, pages 430–437, 2007. • L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning Probabilistic Models of Relational Structure. In Proc. of ICML’01, pages 170–177, 2001. • G. Gottlob, N. Leone, and F. Scarcello. Hypertreedecomposition and tractable queries. Journal of Computer and System Sciences, 64:579–627, 2002. • G. Gottlob, Z. Miklos, and T. Schwentick. Generalized hypertreedecompositions: Np-hardness and tractable variants. In Proc. of PODS’07, pages 13–22, 2007. • H. Guo and H. L. Viktor. Multirelationalclassification: a multiple view approach. Knowledge and Information Systems, 17(3):287–312, 2008.
References • G. Jing-Feng, L. Jing, and B. Wei-Feng. An Efficient RelationalDecision Tree ClassificationAlgorithm. In Proc. of ICNC’07, pages 530–534, 2007. • M. A. Krogel, S. Rawles, F. Zelezny, P. A. Flach, N. Lavrac, and S. Wrobel. Comparative Evaluation of Approaches to Propositionalization. In In Proc. Of ILP’03, pages 197–214, 2003. • H. Leiva, A. Atramentov, and V. Honavar. A Multi-relational Decision Tree Learning Algorithm. In Proc. of ILP’03, pages 97–112, 2002. • H. Liu, X. Yin, and J. Han. An efficient Multi-relational Na¨ıve Bayesian classifier based on Semantic Relationship Graph. In Proc. of MRDM’05, pages 39–48, 2005. • S. Muggleton. Inductive Logic Programming. Academic Press, New York, 1992. J. Neville, D. Jensen, L. Friedland, and M. Hay. Learning Relational Probability Trees. In Proc. Of KDD’03, pages 625–630, 2003. • J. Neville, D. Jensen, and B. Gallagher. Simple Estimators for Relational Bayesian Classifiers. In Proc. of ICDM’03, page 609, 2003. • U. Pompe and I. Kononenko. NaiveBayesianClassifier within ILP-R. In Proc. of ILP’95, pages 417–436, 1995. • B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. Of UAI’02, 2002. • K. Wang, Y. Xu, P.S. Yu, and R. She. Building Decision Trees on Records Linked through Key References. In Proc. of SDM’05, 2005. • Y. Xu, K. Wang, A. Wai-Chee Fu, R. She, and J. Pei. Classification Spanning Correlated Data Streams. In Proc. of CIKM’06, pages 132–141, 2006. • M. Yannakakis. Algorithms for acyclic database schemes. In Proc. of VLDB’81, pages 82–94. • X. Yin, J. Han, J. Yang, and P.S. Yu. CrossMine: EfficientClassificationAcross Multiple Database Relations. In Proc. of t ICDE’04, page 399, 2004.
Multi-RelationalClassification Formal Framework • Input: D (with target having attribute CL), I, a class label ‘l’, and a condition C over the attributes of some relation RD; • Output: key[target] C^target.CL=‘l’R(D, I)
{account-id,district-id} {Account} {transaction-id,account-id} {Transaction} {account-id,disp-id,client-id,district-id} {Account,Disposition} {loan-id,account-id} {Loan} {disp-id,card-id} {card} {client-id,district-id} {Client} {order-id,account-id} {Order} {district-id} {District}