260 likes | 410 Views
Incremental Maintenance of Ontology-Exploiting Association Rules. Ming-Cheng Tseng 1 , Wen-Yang Lin 2 and Rong Jeng 3 1, 3 Institute of Information Engineering, I-Shou University, Taiwan 2 Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Taiwan August 20, 2007. Outline.
E N D
Incremental Maintenance of Ontology-Exploiting Association Rules Ming-Cheng Tseng1, Wen-Yang Lin2 and Rong Jeng3 1, 3 Institute of Information Engineering, I-Shou University, Taiwan 2 Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, Taiwan August 20, 2007
Outline • Introduction • Problem description • The proposed algorithm • Performance evaluation • Conclusions
Introduction • Motivation • In general, there exist lots of semantic relationships (domain knowledge) among items • It is natural to incorporate domain ontology into the process of data mining to explore more innovative rules • The source databases are changing over time • E.g., insertion, deletion, modification • The discovered knowledge (rules) has to be updated to reflect new situation
Introduction (cont.) • Association rules • Given: • A database of customer transactions • Each transaction is a set of items • Find all rules XY that correlate the presence of one set of items X with another set of items Y • Example: Sony VAIOHP LaserJet 1300 (Sup. = 30%, Conf.= 60%)
Introduction (cont.) • Strong association rules • Given: • User’s specified constraints • Minimum support (min_sup) • minimum confidence (min_conf) • Finding rules XY with support and confidence larger than the user’s specified minimum values • Example: • min_sup = 25%, min_conf = 50% Sony VAIOHP LaserJet 1300 (Sup. = 30%, Conf.= 60%)
Introduction (cont.) • Frequent itemsets (patterns) mining • The association mining problem can be reduced to the problem of mining frequent itemsets, i.e., itemsets with support larger than min_sup • Example • min_sup = 25%, min_conf = 50% sup({Sony VAIO, HP LaserJet 1300}) = 30% sup({Sony VAIO}) = 50% Sony VAIOHP LaserJet 1300 (Sup. = 30%, Conf.= 60%)
Introduction (cont.) • Ontology • W3C Web Ontology Working Group “An ontology formally defines a common set of terms that are used to describe and represent a domain knowledge.” • e.g., taxonomy: a kind of ontology presenting classification relationship among objects
Introduction (cont.) • Ontology-exploiting association rules IBM 60GB HD => HP DeskJet
Problem Description • Incremental maintenance of ontology-exploiting association rules • Given: • A database of customer transactions DB • An incremental database db • An item ontology T • Discovered frequent itemsets in DB,L • minimum support, ms, and minimum confidence, mc • Find all frequent itemsets in UD = DB + db w.r.t. ms • Construct all strong rules from the frequent itemsets w.r.t. mc
Problem Description (cont.) -- Example Customer transactions DB Item ontology G minsup = 70% (algorithms AROC, AROS) Discovered frequent itemsets L
Problem Description (cont.) • Example Item ontology G Customer transactions DB minsup = 70% Updated frequent itemsets L’ ?? Incremental transactions db
ABCD ABC ABD ACD BCD AB AC AD BC BD CD A B C D The Proposed Algorithm – IMARO • Basic scheme • An Apriori-based maintenance algorithm • Employing a bottom-up, level-wise searching strategy • Starting from frequent 1-itemset, L1, then L2, …, Lk, etc.
The Proposed Algorithm – IMARO (cont.) • Terminology
The Proposed Algorithm – IMARO (cont.) • Example
The Proposed Algorithm – IMARO (cont.) • Note on database extension • A component item may exist as a primitive item itself • To clarify the meaning of associations involving such an item, we have to differentiate the role this item play e.g., IBM TP => Ink Cartridge buy an IBM TP notebook, also buy an Ink Cartridge buy an IBM TP notebook, also buy an product composed of Ink Cartridge
The Proposed Algorithm – IMARO (cont.) • Process flow for updating frequent k-itemsets e.g., AROC or AROS
The Proposed Algorithm – IMARO (cont.) • Frequent/infrequent itemsets inference
The Proposed Algorithm – IMARO (cont.) • Optimization 1: Candidate pruning • Any candidate itemset that contains both an item and anyone of its extensions (generalized item or component) is pruned. {Epson EPL, Printer} {Epson EPL, Toner Cartridge*}
Printer PC - - - - HP Epson Sony IBM DeskJet EPL VAIO TP - - - - Ink Photo Toner S RAM IBM Cartridge Conductor Cartridge 60GB 256MB 60GB The Proposed Algorithm – IMARO (cont.) • Optimization 2: Extension filtering • The extension of an item can be added only if that item does appear in at least one candidate itemset being counted currently
Performance Evaluation • Compared with applying our proposed algorithms, AROC and AROS, to the whole database DB+db with T • Test data • A synthetic dataset generated by the IBM data generator with artificially–built ontology
Performance Evaluation (cont.) • Varying minimum supports |db| = 40,000
Performance Evaluation (cont.) • Varying incremental transaction size ms = 1.5%
Conclusions • We have investigated the problem of updating ontology-exploiting association rules when new transactions are inserted into the database • An Apriori-based algorithm is proposed • Other issues • More complicatedsemantic relationships and knowledge • Non-uniform minimum support • Generalized item or composite item occurs more frequently • Towards a total solution for evolving environments • Ontology evolution, database update • Interactive refinement of support constraints • …
*source: 1993, Veda C. Storey, VLDB journal Conclusions (cont.) • Taxonomy of semantic relationships
Related Work • Comparison with previous work