160 likes | 305 Views
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints. Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino. Outline. Motivations
E N D
Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino
Outline • Motivations • Knowledge Discovery from Database (KDD), Inductive Databases • Constraint-Based Mining • Incremental Constraint Evaluation • Association Rule Mining • Incremental Algorithms • Constraints properties • Item Dependent Constraints (IDC) • Context Dependent Constraints (CDC) • Incremental Algorithms for IDC and CDC • Performance results and Conclusions
constrained-based queries Motivations: KDD process and Inductive Databases (IDB) • KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data • Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD • KDD is an interactive and iterative process • Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data. • users can query the inductive database with an advanced, ad-hoc data mining query language
Motivations: Constraint-Based Mining and Incrementality • Why constraints? • can be pushed in the pattern computation and pruning the search space; • provide to the user a tool to express her interests (both in data and in knowledge). • In IDB constraint-basedqueries are very often a refinement of previous ones • Explorative process • Reconciling backgroung and extracted knowledge • Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]
, , T T , G , G , I , I , (M) , (M) • extraction from a source table • extraction from a source table • of set of items (itemsets) (on some schema I) of set of items (itemsets) (on some schema I) • from the groups of the database (grouping constraints) • from the groups of the database (grouping constraints) • satisfying some user defined constraints (mining constraints) • satisfying some user defined constraints (mining constraints) • The number of such groups must be sufficient • (user defined statistical evaluation measures, such as support) • The number of such groups must be sufficient • (user defined statistical evaluation measures, such as support) A Generic Mining Language R=Q( ) • A very generic constraint-based mining query requests: • In our case R contains association rules
Mining Query itemset support_count R body head frequency confidence {jacket} 3 jacket ski pants 2/3 2/3 {ski pants} 2 {jacket, ski pants} 2 ski pants jacket 2/3 1 An Example • R=Q(purchase,customer,product,price>100,support_count>=2) purchase
We propose two newly developed incremental algorithms which allow the exploitation of past results in the two cases (IDC and CDC) Incremental Algorithms • We studied an incremental approach to answer new constraint-based queries which makes use of the information (rules with support and confidence) contained in previous results • We individuated two classes of query constraints: • item dependent (IDC) • context dependent (CDC)
Relationships between two queries We can speed up the execution time of a new query using results of previous queries. Which previous queries? • Query equivalence: R1=R2 no computation is needed [FQAS’04] • Query containment: [This paper] • Inclusion: R2 R1 and common elements have the same statistical measures. R2=C(R1) • Dominance: R2 R1 but common elements do not have the same statistical measures. R2C(R1) How can we recongnize inclusion or dominance between two constraints-based queries?
2 2 2 IDC vs CDC • Item Dependent Constraints (IDC ) • are functionally dependent on the item extracted • are satisfied for a given itemset either for all the groups in the database or for none • if an itemset is common to R1 and R2, it will have the same support: inclusion • Context Dependent Constraints (CDC ) • depend on the transactions in the database • might be satisfied for a given itemset only for some groups in the database • a common itemset to R1 and R2 might not have the same support: dominance IDC: price > 150 CDC: qty > 1
Previous query Q1 ….. Constraint: price > 5 ….. Item Domain Table Rules in memory BODY HEAD SUPP CONF item category price A B R1 2 1 hi-tech A 12 A C … … hi-tech B 14 … … … … Fail C 8 housing item C belongs to a row that does not satisfy the new IDC constraint delete from R1 all rules containing item C BODY HEAD SUPP CONF A B 2 1 R2 … … … … (R2=P(R1)) Incremental Algorithm for IDC Current query Q2 ….. Constraint: price >10 …..
Previous query Current query Q1 Q2 ….. ….. … Constraint: qty > 5 Constraint: qty >10 ….. ….. Rules in memory read the DB find groups • in which new constraints are satisfied • containing items belonging to BHF BODY HEAD SUPP CONF R1 … … … … build BHF update support counters in BHF BODY HEAD SUPP CONF R2 … … … … Incremental Algorithm for CDC
rule: a f g rule support = 3 confidence = 3/4 Body-Head Forest (BHF) • body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule • an itemset is represented as a single path in the tree and vice versa • each path in the body (head) tree is associated to a counter representing the body (rule) support a (4) f g (3) g m
CD algorithm (c) (d) Experiments (1): IC vs CD algorithm execution time vs constraint selectivity execution time vs volume of previous result ID algorithm (a) (b)
Experiments(2): CARE vs Incremental (a) (b) execution time vs cardinality of previous result execution time vs support threshold (c) execution time vs selectivity of constraints
Conclusions and future works • We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries. • The first algorithm deals with item dependent constraints, while the second one with context dependent ones. • We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time. An interesting direction for future research: integration of condensed representations with these incremental techniques
the end questions??