250 likes | 387 Views
Data Mining in Clinical Databases by using Association Rules. Department of Computing Charles Lo. Outline. What is Association Rule ? Previous Works Target Problems Methodology and Algorithm Experiment and Discussion Q & A. What is Association Rule ? (1) .
E N D
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo
Outline • What is Association Rule ? • Previous Works • Target Problems • Methodology and Algorithm • Experiment and Discussion • Q & A
What is Association Rule ? (1) It was introduced in “Agrawal, Imielinski, & Swami 1993”. Database A, B C 30% of the transactions that contain A and B also contain C, 5% of all the transactions contain all of them.
What is Association Rule (2) • In a supermarket, 20% of transactions that contain Coke Cola also contain Pepsi, 3% of all transactions contain both items. • 20% is the confidence of the rule • 3% is the support of the rule • Association rule can be applied in • Decision Support • Market Strategy • Financial Forecast
Related Work (1) In 1993, Agrawal, Imielinski and Swami • Generate all significant association rules between items • Algorithm Apriori • Pruning Techniques • Buffer management Significant association rule if support > min support if confidence min confidence
AC AB ABC BC Related Work (2) • Pruning Technique • Frequency Constraint • Memory Management • Memory to store any itemset and all its 1-extensions
Related Work (3) In 1997, Srikant, Vu and Agrawal • Consider constraints that are boolean expression over the presence or absence of items in the rules • Incomplete candidate generation AC ABC AB The boolean constraint: (BC) (X Y)
Related Work (4) • Selected Items approaches 1. generate a set of selected items • for B= (1 2) 3 2. only count candidates that contain selected items 3. Discard frequent itemsets that do not satisfy the boolean expression 1,2,3,4,5 1,3 2,3 • any (non-empty) itemset that satisfies B will contain • an item from this set
Related Work (5) In 1998, Ng, Lakshmanan, Han and Pang • Achieved a maximized degree of pruning for different categories of constraints. • Two critical properties to pruning • Anti-monotonicity • Succinctness • Algorithm CAP 1. Both anti-monotone and succinct 2. Succinct but Non-anti-monotone 3. Anti-monotone and Non-Succinct 4. Non-anti-monotone and Non-succinct
min(S) v, max(S) v, count(s) v, sum(s) v S v, S = v, S v, S V Related Work (6) • Anti-Monotone Constraint • S S’ & S satisfied C S’ satisfied C Domain Constraint Aggregate Constraint
min(S) v, min(S) v, max(S) v, max(S) v, count(s) v, sum(s) v S v, S V S V S v, S = v, Domain Constraint Aggregate Constraint Related Work (7) • Succinct Constraint • pruning can be done once-and-for-all before any iteration take place
Target Problems (1) • Association of quantitative items satisfy a given inequality constraint which are composed of either (+ , -) or (* , /) • ( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C 1. size m 2. size n 3. + ( * ) 4. - [ /] 5. (<, >, =, ] 6. constant C • (3,2,+,-,>,100) • (1,1,0,/,=,2)
A B C A B C D Target Problems (2) • Temporal aspect of the data • Hierarchies over the data A B C Serial pattern Parallel pattern Sequence pattern Computer Engineering Civil PolyU Arts
Problem Statement • V= I1I2, . . . , IM , a set of quantitative items • T , the transactions of a database D • t[k] > 0 means t contain item Ik t[k] = 0 means Ik does not exist • Association of items which satisfy ( Ii1 Ii2 . . . Iim ) ( Ij1 Ij2 . . . Ijn ) C where is + ( * ) , is - [ /] , is (<, >, =, ] and cis a scalar value
Application in Clinical Database • Relationship between the treatments and clinical diagonsis • nursing : 100, clinical test : 30, pharmacies : 165, . . . • nursing : 120, injection : 130, pharmacies : 100, . . . • Operation : 220, injection : 542, clinical test : 60, . . . • (X + Y ) - Z> 100 • X / Y = 2
QMIC (1) • QMIC (Quantitative Mining under Inequality Constraints) • Candidate generation • reduce the number of itemsets • Max_Min pruning • Support counting • reduce the iteration of database scanning • Generation sequence • Memory requirement • limitation of the available memory
L L L L L L 1 2 3 4 5 8 . . . QMIC (2) • Skip generation steps by the pre-defined size m and n • Generation Steps • Algorithm Apriori : Lk-1 Lk • Algorithm QMIC : LK/2 Lk
QMIC (3) • Candidate itemsets generation
QMIC (4) • why in this sequence ? • How about using 3, 4 or larger factor ? • Or even the power series ? • Memory Management • keep the previous L’s to generate next level of large itemsets • Only limited memory is available • In QMIC, only three previous L’s are need in order to generate the next level of large itemsets in the generation sequence.
QMIC (5) • What is the trade off of generation sequence ? • more number of candidate itemsets • longer process time in pruning • Max_Min Pruning • involve the inequality constraint to the pruning • Maximum value itemset list (maxlst) • Sorted list in a descending order according to the maximum value of sum (product) • Minimum value itemset list (minlst] • Sorted list in an ascending order according to the minimum value of sum (product)
QMIC (6) • Max_Pruning • = { , >} • A B C where A = ( Ii1 Ii2 . . . Iim ), B=( Ij1 Ij2 . . . Ijn ) • Minimum value of A • Over pruning ? • Using maxlst • Sliding Window with size m+ . . . . . . Window of maxlst1 stop sliding if total sum of inside items is smaller than C
QMIC (7) • Max_Pruning procedure
Experiments (1) • Number of items
Experiments (2) • Number of transactions
Future Plan • Association Rules of Sequence Patterns • Time constraint • Association Rules of Multi-layer data