Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino

Outline • Motivations • Knowledge Discovery from Database (KDD), Inductive Databases • Constraint-Based Mining • Incremental Constraint Evaluation • Association Rule Mining • Incremental Algorithms • Constraints properties • Item Dependent Constraints (IDC) • Context Dependent Constraints (CDC) • Incremental Algorithms for IDC and CDC • Performance results and Conclusions

constrained-based queries Motivations: KDD process and Inductive Databases (IDB) • KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data • Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD • KDD is an interactive and iterative process • Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data. • users can query the inductive database with an advanced, ad-hoc data mining query language

Motivations: Constraint-Based Mining and Incrementality • Why constraints? • can be pushed in the pattern computation and pruning the search space; • provide to the user a tool to express her interests (both in data and in knowledge). • In IDB constraint-basedqueries are very often a refinement of previous ones • Explorative process • Reconciling backgroung and extracted knowledge • Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]

, , T T , G , G , I , I , (M) , (M) • extraction from a source table • extraction from a source table • of set of items (itemsets) (on some schema I) of set of items (itemsets) (on some schema I) • from the groups of the database (grouping constraints) • from the groups of the database (grouping constraints) • satisfying some user defined constraints (mining constraints) • satisfying some user defined constraints (mining constraints) • The number of such groups must be sufficient • (user defined statistical evaluation measures, such as support) • The number of such groups must be sufficient • (user defined statistical evaluation measures, such as support) A Generic Mining Language R=Q( ) • A very generic constraint-based mining query requests: • In our case R contains association rules

Mining Query itemset support_count R body head frequency confidence {jacket} 3 jacket ski pants 2/3 2/3 {ski pants} 2 {jacket, ski pants} 2 ski pants jacket 2/3 1 An Example • R=Q(purchase,customer,product,price>100,support_count>=2) purchase

We propose two newly developed incremental algorithms which allow the exploitation of past results in the two cases (IDC and CDC) Incremental Algorithms • We studied an incremental approach to answer new constraint-based queries which makes use of the information (rules with support and confidence) contained in previous results • We individuated two classes of query constraints: • item dependent (IDC) • context dependent (CDC)

Relationships between two queries We can speed up the execution time of a new query using results of previous queries. Which previous queries? • Query equivalence: R1=R2 no computation is needed [FQAS’04] • Query containment: [This paper] • Inclusion: R2 R1 and common elements have the same statistical measures. R2=C(R1) • Dominance: R2 R1 but common elements do not have the same statistical measures. R2C(R1) How can we recongnize inclusion or dominance between two constraints-based queries?

2 2 2 IDC vs CDC • Item Dependent Constraints (IDC ) • are functionally dependent on the item extracted • are satisfied for a given itemset either for all the groups in the database or for none • if an itemset is common to R1 and R2, it will have the same support: inclusion • Context Dependent Constraints (CDC ) • depend on the transactions in the database • might be satisfied for a given itemset only for some groups in the database • a common itemset to R1 and R2 might not have the same support: dominance IDC: price > 150 CDC: qty > 1

Previous query Q1 ….. Constraint: price > 5 ….. Item Domain Table Rules in memory BODY HEAD SUPP CONF item category price A B R1 2 1 hi-tech A 12 A C … … hi-tech B 14 … … … … Fail C 8 housing item C belongs to a row that does not satisfy the new IDC constraint delete from R1 all rules containing item C BODY HEAD SUPP CONF A B 2 1 R2 … … … … (R2=P(R1)) Incremental Algorithm for IDC Current query Q2 ….. Constraint: price >10 …..

Previous query Current query Q1 Q2 ….. ….. … Constraint: qty > 5 Constraint: qty >10 ….. ….. Rules in memory read the DB find groups • in which new constraints are satisfied • containing items belonging to BHF BODY HEAD SUPP CONF R1 … … … … build BHF update support counters in BHF BODY HEAD SUPP CONF R2 … … … … Incremental Algorithm for CDC

rule: a f g rule support = 3 confidence = 3/4 Body-Head Forest (BHF) • body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule • an itemset is represented as a single path in the tree and vice versa • each path in the body (head) tree is associated to a counter representing the body (rule) support a (4) f g (3) g m

CD algorithm (c) (d) Experiments (1): IC vs CD algorithm execution time vs constraint selectivity execution time vs volume of previous result ID algorithm (a) (b)

Experiments(2): CARE vs Incremental (a) (b) execution time vs cardinality of previous result execution time vs support threshold (c) execution time vs selectivity of constraints

Conclusions and future works • We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries. • The first algorithm deals with item dependent constraints, while the second one with context dependent ones. • We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time. An interesting direction for future research: integration of condensed representations with these incremental techniques

the end questions??

Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta