1 / 23

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database. Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (30 th September, 2007). Overview. Introduction Objectives Experimental Design

tierra
Download Presentation

ADBIS 2007 Discretization Numbers for Multiple-Instances Problem in Relational Database

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ADBIS 2007Discretization Numbers for Multiple-Instances Problem in Relational Database Rayner Alfred Dimitar Kazakov Artificial Intelligence Group, Computer Science Department, York University (30th September, 2007)

  2. Overview • Introduction • Objectives • Experimental Design • Data Pre-processing: Discretization • Data Summarization (DARA) • Experimental Evaluation • Experimental Results • Conclusions ADBIS 2007, Varna, Bulgaria

  3. Introduction • Handling numerical data stored in a relational database is unique • due to the multiple occurrences of an individual record in the non-target table and • non-determinate relations between tables. • Most traditional data mining methods deal with a single table and discretization process is based on a single table. • In a relational database, multiple records from one table with numerical attributes are associated with a single structured individual stored in the target table. • Numbers in multi-relational data mining (MRDM) are often discretized, after considering the schema of the relational database ADBIS 2007, Varna, Bulgaria

  4. Introduction • This paper considers different alternatives for dealing with continuous attributes in MRDM • The discretization procedures considered in this paper include algorithms • that do not depend on the multi-relational structure and also • that are sensitive to this structure. • A few discretization methods implemented, including the proposed entropy-instance-based discretization, embedded in DARA algorithm ADBIS 2007, Varna, Bulgaria

  5. Objectives • To study the effects of taking the one-to-many association issue into consideration in the process of discretizing continuous numbers. • Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm • In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm • We demonstrate on the empirical results obtained that discretization can be improved by taking into consideration the multiple-instance problem ADBIS 2007, Varna, Bulgaria

  6. Experimental Design • Data Pre-processing • Discretization of Continuous Attributes in Multi-relational setting using Entropy-Instance-Based Algorithm • Data Aggregation • Data summarization using DARA as a mean of data summarization based on Cluster dispersion and Impurity • Evaluation of the discretization methods using C4.5 classifiers Categorical Data Relational Data Summarized Data Discretization of Continuous Attributes Using Entropy-Instance-Based Algorithm Data Summarization using DARA based on Cluster Dispersion and Impurity Learning can be done using any traditional AV data mining methods ADBIS 2007, Varna, Bulgaria

  7. Data Pre-processing: Discretization • To study the effects of one-to-many association issue in the process of discretizing continuous numbers. • Propose the entropy-instance-based discretization method, which is embedded in DARA algorithm • In DARA algorithm, we employ several methods of discretization in conjunction with C4.5 classifier, as an induction algorithm • Equal Height – each bin has same number of samples • Equal Weight - considers the distribution of numeric values present and the groups they appear in • Entropy-Based – uses the class information entropy • Entropy-Instance-based - uses the class information entropy and individual information entropy • We demonstrate that discretization can be improved by considering the one-to-many problem ADBIS 2007, Varna, Bulgaria

  8. Entropy-Instance-Based (EIB) Discretization • Background • Based on the entropy-based multi-interval discretization method (Fayyad and Irani 1993) • Given a set of instances S, two samples of S, S1 and S2, a feature A, and a partition boundary T, the class information entropy is • So, for k bins, the class information entropy for multi-interval entropy-based discretization is ADBIS 2007, Varna, Bulgaria

  9. Entropy-Instance-Based (EIB) Discretization • In EIB, besides the class information entropy, another measure that uses individual information entropy is added to select multi-interval boundaries for discretization • Given n individuals, the individual information entropy of a subset S is IndEnt(S) = where p(Ii, S) is the probability of the i-th individual in the subset S • The total individual information entropy for all partitions is ADBIS 2007, Varna, Bulgaria

  10. Entropy-Instance-Based (EIB) Discretization • As a result, by minimizing the function Ind_I(A,T,S,k), that consists of two sub-functions, I(A,T,S,k) and Ind(A,T,S,k), we are discretizing the attribute’s values based on the class and individual information entropy. + Ind_I(A,T,S,k) = = ADBIS 2007, Varna, Bulgaria

  11. Entropy-Instance-Based (EIB) Discretization • One of the main problems with this discretization criterion is that it is relatively expensive • Use a GA-based discretization to obtain a multi-interval discretization for continuous attributes, consists of • an initialization step • the iterative generations of the • reproduction phase, • the crossover phase and • mutation phase ADBIS 2007, Varna, Bulgaria

  12. Entropy-Instance-Based (EIB) Discretization • An initialization step • a set of strings (chromosomes), where each string consists of b-1 continuous values representing the b partitions, is randomly generated within the attribute’s values of min and max • For instance, given minimum and maximum values of 1.5 and 20.5 for a continuous field, we have (2.5,5.5,9.3,12.6,15.5,20.5) • The fitness function for genetic entropy-instance-based discretization is defined as f = 1/ Ind_I(A,T,S,k) ADBIS 2007, Varna, Bulgaria

  13. Entropy-Instance-Based (EIB) Discretization • the iterative generations of • the reproduction phase • Roulette wheel selection is used • the crossover phase and • a crossover probability pc of 0.50 is used • mutation phase • a fixed probability pm of 0.10 is used ADBIS 2007, Varna, Bulgaria

  14. Data Summarization (DARA) • Data summarization based on Information Retrieval (IR) Theory • Dynamic Aggregation of Relational Attributes (DARA) – categorizes objects with similar patterns based on tf-idf weights, borrowed from IR theory • Scalable and produce interpretable rules T= Target table NT = Non-target table = Data Summarization NT NT NT NT NT T NT NT NT ADBIS 2007, Varna, Bulgaria

  15. Data Summarization (DARA) • Data summarization based on Information Retrieval (IR) Theory • TF-IDF (term frequency-inverse document frequency) - a weight often used in information retrieval and text mining • A statistical measure used to evaluate how important a word is to a document in a corpus • The importance of term increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. ADBIS 2007, Varna, Bulgaria

  16. Data Summarization (DARA) • In a multi-relational setting, • an object (a single record) is considered as a document • All corresponding values of attributes stored in multiple tables are considered as terms that describe the characteristics of the object (the record) • DARA transforms data representation in a relational model into a vector space model and employs TF-IDF weighting scheme to cluster and summarize them ADBIS 2007, Varna, Bulgaria

  17. Data Summarization (DARA) • tfi∙idfi (term frequency-inverse document frequency) where ni is the number of occurrences of the considered term, and the denominator is the number of occurrences of all terms. • The inverse document frequency is a measure of the general importance of the term with |D|: total number of documents in the corpus and d is the number of documents where the term tiappears ADBIS 2007, Varna, Bulgaria

  18. Data Summarization (DARA) Data Summarization Stages • Information Propagation Stage • Propagates the record ID and classes from the target concepts to the non-target tables • Data Aggregation Stage • Summarize each record to become a single tuple • Uses a clustering technique based on the TF-IDF weight, in which each record can be represented as • The cosine similarity method is used to compute the similarity between two records Ri and Rj , cos(Ri,Rj) = Ri·Rj/(||Ri||·|||Rj||) (tf1 log(n/df1), tf2 log(n/df2), . . . , tfmlog(n/dfm)) ADBIS 2007, Varna, Bulgaria

  19. Experimental Evaluation • Implement the discretization methods in the DARA algorithm, in conjunction with the C4.5 classifier, as an induction algorithm that is run on the DARA’s discretized and transformed data representation • chose three varieties of a well-known datasets, the Mutagenesis relational database • The data describes 188 molecules falling in two classes, mutagenic (active) and non-mutagenic (inactive) and 125 of these molecules are mutagenic. ADBIS 2007, Varna, Bulgaria

  20. Experimental Evaluation • three different sets of background knowledge (referred to as experiment B1, B2 and B3). • B1: The atoms in the molecule are given, as well as the bonds between them, the type of each bond, the element and type of each atom. • B2: Besides B1, the charge of atoms are added • B3: Besides B2, the log of the compound octanol/water partition coefficient (logP), and energy of the compounds lowest unoccupied molecular orbital (ЄLUMO) are added • Perform a leave-one-out cross validation using C4.5 for different number of bins, b, tested for B1, B2 and B3. ADBIS 2007, Varna, Bulgaria

  21. Experimental Results • Performance (%) of leave-one-out cross validation of C4.5 on Mutagenesis dataset • The predictive accuracy for EqualHeight and EqualWeight is lower on datasets B1 and B2, when the number of bins is smaller • the accuracy of entropy and entropy-instance based discretization is lower when the number of bins is smaller on dataset B3 • The result of entropy-based and entropy-instance-based discretization on B1, B2 and B3 are virtually identical, (five out of nine tests EIB performs better than EB) ADBIS 2007, Varna, Bulgaria

  22. Conclusions • presented a method called dynamic aggregation of relational attributes (DARA) with entropy-instance-based discretization to propositionalise a multi-relational database • The DARA method has shown a good performance on three well-known datasets in term of performance accuracy. • The entropy-instance-based and entropy-based discretization methods are recommended for discretization of attribute values in multi-relational datasets • Disadvantage – computation is expensive when the number of bins is large ADBIS 2007, Varna, Bulgaria

  23. Thank YouDiscretization Numbers for Multiple-Instances Problem in Relational Database

More Related