360 likes | 529 Views
FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES. By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA 70803-6409
E N D
FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA 70803-6409 Email: hpham15@lsu.edu, ieliao@lsu.edu, and trianta@lsu.edu
Outline • Introduction • Background • A fuzzy approach for mining associate rules • Experimental evaluation • Conclusions
Introduction • Associate analysis is a new and attractive research area in data mining • The Apriori algorithm (R. Agrawal, IBM 1993) is a key technique for Associate analysis • Though the Apriori principle allows us to considerably reduce the search space, the technique still requires a huge computation, particularly for large databases • This research proposes an approach for finding fuzzy sets for quantitative attributes in a database by using clustering techniques and then employs techniques for mining of fuzzy Associate rules .
Outline • Introduction • Background • Associate rules and the Apriori algorithm • Necessity to find fuzzy sets for quantitative attributes • A fuzzy approach for fuzzy mining associate rules • Experimental evaluation • Conclusions
Associate rules: Market basket analysis • Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets” (in the form X Y, where X and Y are sets of items) • I = {I1=beer, I2=cake, I3=onigiri} • A transactional database • An Associate rule: {I1} {I3} How often people buy candy and beer together? TID1: {I1, I2, I3} TID2: {I1, I2} TID3: {I2, I3} TID4: {I2} TID5: {I1, I2}
Rule measures: Support and Confidence • Associate rule: X Y • support s = probability that a transaction contains X and Y • confidence c =conditional probability that a transaction having X also contains Y • A C (s=50%, c=66.6%) • C A (s=50%, c=100%) Customer buys both Customer buys beer Customer buys onigiri
Associate mining: the Apriori algorithm It is composed of two steps: • Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count • Generate strong Associate rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence (Agrawal, 1993)
Associate mining: the Apriori principle Min. support 50% Min. confidence 50% For rule A C support = support({A and C}) = 50% confidence = support({A and C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent (if an itemset is not frequent, neither are its supersets)
The Apriori algorithm: Finding frequent itemsets using candidate generation • Find the frequent itemsets: the sets of items that have support higher than the minimum support • A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemsets • Iteratively find frequent itemsets Lk with cardinality from 1 to k (k-itemset) from candidate itemsets Ck (Lk Ck) • Use the frequent itemsets to generate Associate rules. C1 … Li-1 Ci Li Ci+1 … Lk
Example (min_sup_count = 2) Scan D for count of each candidate Compare candidate support count with minimum support count Transactional data TID List of items_IDs T100 I1, I2, I5 T200 I2, I4 T300 I2, I3 T400 I1, I2, I4 T500 I1, I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3, I5 T900 I1, I2, I3 C1 L1 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2 Itemset Sup.Count {I1} 6 {I2} 7 {I3} 6 {I4} 2 {I5} 2
Example (min_sup_count = 2) C2 C2 Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 {I3, I4} 0 {I3, I5} 1 {I4, I5} 0 Compare candidate support count with minimum support count Itemset {I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5} Generate candidates C2 from L1 by using the Apriori principle Scan D for count of each candidate L2 Itemset S.count {I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4 {I2, I4} 2 {I2, I5} 2 Compare candidate support count with minimum support count Generate candidates C3 from L2 by using the Apriori principle C3 L3 Scan D for count of each candidate Itemset {I1, I2, I3} {I1, I2, I5} Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2 Itemset Sc {I1, I2, I3} 2 {I1, I2, I5} 2
A quantitative associate rule with min_sup= min_conf =50% (Age = 33 or 39) and (Married = Yes) -> (NumCars =2) A quantitative associate rule with min_sup= min_conf=50% (Age = 33..39) and (Married = Yes) -> (NumCars =2) A fuzzy associate rule with min_sup= min_conf =50% (Age = middle-aged) and (Married = Yes) -> (NumCars =2) Necessity to find fuzzy sets for quantitative attributes
Solution: Shape boundary intervals • It is composed of two steps: • Partition the attribute domains into small intervals and combine adjacent intervals into larger ones such that the combined intervals will have enough supports • Replace the original attribute by its attribute-interval pairs, the quantitative problem can be transformed to a Boolean one. (Srikant and Agrawal, 1996)
Transaction ID Age: 18-30 Age: 31-39 Married NumCars:0-1 NumCars:2-3 100 No Yes Yes No Yes 200 No Yes Yes No Yes 300 No Yes No Yes No 400 Yes No No Yes No • Algorithms ignore or over-emphasize the elements near the boundary of the intervals in the mining process • The use of shape boundary interval is also not intuitive with respect to human perception Example: Shape boundary intervals
Fuzzy sets and their corresponding membership functions provided by experts may not be suitable for mining fuzzy Associate rules in the database Solution: Experts • An user or expert must provide to this algorithm the required fuzzy sets of the quantitative attributes and their corresponding membership functions
Lose association between attributes in the mining approach Solution: Fuzzy sets for quantitative attributes It is composed of three steps: Step 1: Transform the original database into positive integer Step 2: For each attribute Cluster values of the attribute ith into k medoids Classify the attribute ith into k fuzzy sets Generate membership functions for each fuzzy set End for Step 3: Transform the database based on fuzzy sets (Ada, 1998)
Outline • Introduction • Background • A fuzzy approach for fuzzy mining associate rules • Fuzzy approach • Fuzzy mining associate rules • Experimental evaluation • Conclusions
Fuzzy approach It is composed of five steps: Step 1: Transform the original database into one with positive integers Step 2: Cluster values of attributes into k medoids. Step 3: Classify attributes into k fuzzy sets Step 4: Generate membership functions for each fuzzy set Step 5: Transform the database based on fuzzy sets
Do not lose association between attributes in the mining approach Fuzzy approach: Step 2 • Clustering: • The clustering method considers the search space of a database with n attributes as an n-dimensional space • Use the Matlab fuzzy tool box
Fuzzy approach: Step 3 Classify: • Let {m1, m2, …, mk} be k medoids found from step 2, where mi = {ai1, ai2, …, ain} is the medoid ith. • Let the attribute jth have a range [minj, maxj] and {a1j, a2j, …, akj} be set of mid-points of the attribute jth. The k fuzzy sets of this attribute will be ranged in [minj, a2j], [a1j, a3j], …, [a(i-1)j, a(i+1)j], …, and [a(k-1)j, maxj] a(i-1)j aij a(i+1)j minj maxj Fuzzy set
Fuzzy approach: Step 4 Generate membership functions (triangular function):
Fuzzy approach: Step 5 Transform the database based on fuzzy sets: • Let Tij be the value of the ith transaction at the jth attribute Tij = fuzzy label ith if fij(Tij) = max(fkj(Tij))
Fuzzy label Range Mid-point Fuzzy label Range Mid-point Low_I 50 – 120 100 Low_S 4000 – 10000 7000 Medium_I 100 – 165 140 Medium_S 7000 – 20000 15000 High_I 140 – 200 183 High_S 15000 – 32000 30000 ID Salary’s membership IQ’s membership ID Salary IQ 1 Low_S Low_I 1 0.71 0.8 2 0.71 0.83 2 Low_S Low_I 3 High_S High_I 3 0.37 0.67 4 0.86 0.86 4 Low_S Low_I 5 0.83 0.74 5 Medium_S Medium_I 6 0.56 0.74 6 Medium_S Medium_I 7 0.14 0.31 7 Low_S Low_I Example of fuzzy approach Step 2 Steps 3, 4, 5
Fuzzy mining Associate rules • It is composed of two steps: • Find all itemsets that have fuzzy support (FS<X,A>) above the user specified minimum support. These itemsets are called frequent itemsets. • Use the frequent itemsets to generate the desired rules. Let X and Y be frequent itemsets. We can determine if the rule X => Y holds by computing the fuzzy confidence FC<<X,A>,<Y,B>> and this value is larger than the user specified minimum confidence value. (Attilia, 2000)
Fuzzy mining Associate rules - cont • D = {t1, t2, …, tn}: transactions • <X,A> with X is attributes and A is the corresponding fuzzy sets in X • Z = X U Y, C = A U B
Outline • Introduction • Background • A fuzzy approach for fuzzy mining associate rules • Experimental evaluation • Conclusions
Name |D| |T| Size (MB) D100k.T10 100K 10 3M D100k.T20 100K 20 6M D320k.T30 320K 30 18M |D| = Number of transactions |T| = Average amount of items on transactions Experiments:Synthetic datasets • Using synthetic datasets of varying sizes:
Experiment environment • Software • Database : Microsoft Access 2003 • Language: C++ and Visual Basic, Matlab • Platform: Windows • Hardware • PC Pentium IV-2.66 GMhz, RAM 1GB
Evaluate mean of rules From database Salary and IQ, we have rules from the approach with minimum support=43% and minimum confidence = 50% as follows: Rule 1: If 1st variable is low approximately 7000 [ 4000, 10000] then 2nd variable is low approximately 100 [50, 120] Rule 2: If 1st variable is medium approximately 15000 [7000, 20000] then 2nd variable is medium approximately 140 [ 100, 165]
ID ID Salary’s membership Salary’s membership IQ’s membership IQ’s membership 1 1 0.74 0.71 0.85 0.8 2 2 0.71 0.91 0.83 0.93 3 3 0.37 0.57 0.67 0.67 4 4 0.9 0.86 0.9 0.86 5 5 0.83 0.83 0.74 0.84 6 6 0.56 0.66 0.74 0.84 7 7 0.34 0.14 0.51 0.31 The new approach is fuzzier than Ada Evaluate fuzziness Ada New approach • Using the Yager’s fuzziness with p = 1 • Ada_fuzziness_Salary ≈ 0.357 ≤ NewApproach_fuzziness_Salary ≈ 0.425 • Ada_fuzziness_IQ ≈ 0.51 ≤ NewApproach_fuzziness_IQ ≈ 0.59
In Ada’s Approach, mid points of ranges are moved out centre values. This leads to change mean of frequent itemsets. Evaluate fuzziness - cont
Execution time (sec.) with different minimum support thresholds *: do not include the transfer time
Execution time (sec.) with different minimum support thresholds - cont • Execution time (transfer + mining time) of the fuzzy method is better than the Apriori. • Moreover, mean of rules is more “Understandable”
Conclusions • Proposed an approach to find fuzzy sets for quantitative attributes for mining associate rules • An experimental evaluation shows that the mean of rules and execution time when using the fuzzy approach in mining Associate rules are better than that of other algorithms • Future work: • Improve the fuzzy mining approach • Develop incremental algorithms for associate analysis using Support Vector Machines
THANK YOU H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial Engineering 3128 CEBA Building Louisiana State University Baton Rouge, LA 70803-6409 Email: hpham15@lsu.edu, ieliao@lsu.edu, and trianta@lsu.edu