1.17k likes | 1.19k Views
第三章 关联规则挖掘 Association Rule Mining. 背景简介( Motivation ). 超市购物: 商场经理可能想了解顾客的购物习惯。例如:“顾客多半会在一次购物时买哪些商品?”。分析的结果可用于市场规划、广告策划和分类设计。 文本分类: 个性化新闻推荐系统希望对新闻进行分类,推进用户感兴趣类别的新闻内容给用户。系统可以通过挖掘哪些关键词与某个类别经常联系在一起,找出文档的分类标准。 信息推荐: 电子商务网站推荐用户所需的信息。如:下载某种类型音乐的用户通常具有什么样的特点
E N D
背景简介(Motivation) • 超市购物:商场经理可能想了解顾客的购物习惯。例如:“顾客多半会在一次购物时买哪些商品?”。分析的结果可用于市场规划、广告策划和分类设计。 • 文本分类:个性化新闻推荐系统希望对新闻进行分类,推进用户感兴趣类别的新闻内容给用户。系统可以通过挖掘哪些关键词与某个类别经常联系在一起,找出文档的分类标准。 • 信息推荐:电子商务网站推荐用户所需的信息。如:下载某种类型音乐的用户通常具有什么样的特点 • 解决这些问题的一种有效途径就是“Association Rule Mining” (关联规则挖掘)
Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper} {Beer},{Milk, Bread} {Eggs,Coke},{Beer, Bread} {Milk}, Implication means co-occurrence, not causality!
Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold
Example: Definition: Association Rule • Association Rule • An implication expression of the form X Y, where X and Y are itemsets • Example: {Milk, Diaper} {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X
Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds Computationally prohibitive!
Mining Association Rules Example of Rules: {Milk,Diaper} {Beer} (s=0.4, c=0.67){Milk,Beer} {Diaper} (s=0.4, c=1.0) {Diaper,Beer} {Milk} (s=0.4, c=0.67) {Beer} {Milk,Diaper} (s=0.4, c=0.67) {Diaper} {Milk,Beer} (s=0.4, c=0.5) {Milk} {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements
Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive
Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets
Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!!
Computational Complexity • Given d unique items: • Total number of itemsets = 2d • Total number of possible association rules: If d=6, R = 602 rules
Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction
Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support
Illustrating Apriori Principle Found to be Infrequent Pruned supersets
Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13
Apriori Algorithm • Method: • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent
AprioriAlgorithm • L1 = {frequent 1-itemset}; • for (k=2; Lk-1; k++) do begin • Ck = apriori-gen(Lk-1); // 生成大小为k的候选项集合 Ck为大小k的候选项集构成的集合 Lk-1 • for all transactions t D do begin • Ct = subset(Ck,t); // 事务数据t中包含的候选项集 • for all candidates c Ctdo • c.count ++; • end • Lk = {cCk |c.count minsup} • end • Answer =
生成候选项集合 • Apriori算法是根据有关频繁项集性质的先验知识Apriori Principal命名的。该算法使用一种逐层搜索的迭代方法,利用k-项集产生(k+1)-项集。 • 具体做法:首先找出频繁1-项集的集合,记为L1;再用L1找频繁2-项集的集合L2;再用L2找L3…如此下去,直到不能找到频繁k-项集为止。找每个Lk需要一次数据库扫描。
连接步产生候选项集 剪枝步确定频繁项集 生成候选项集合 整个过程由连接和剪枝两步组成,即: • (1)连接步 为找Lk,可通过Lk-1与自己连接,产生一个候选k-项集的集合,该候选项集的集合记作Ck。
Lk-1 Lk-1 生成候选项集合 设l1和l2是Lk-1中的项集,记号li[j]表示li 的第j 项。为方便计,假定事务或项集中的项按字典次序排序。 执行连接 , 其中Lk-1的元素是可连接的,如果它们前(k-2)个项相同。
生成候选项集合 即,Lk-1的元素l1和l2是可连接的,若: ( l1[1] = l2[1] ∧ l1[2] = l2 [2]∧…∧l1[k-2] = l2[k-2] ∧ l1[k-1] < l2[k-1] ) 而条件(l1[k-1] < l2[k-1])可确保不产生重复的项集。
生成候选项集合 • (2)剪枝步 Ck是Lk的超集,即它的成员不一定都是频繁项集,但所有的频繁k-项集都包含在Ck中。 扫描数据库,确定Ck中每个候选项集的计数,从而确定Lk。然而, Ck可能很大,这样所涉及的计算量就很大。
生成候选项集合 为了压缩Ck,可利用Apriori性质:任何非频繁的(k-1)-项集都不可能是频繁k-项集的子集。因此,若一个候选k-项集的(k-1)-项子集不在Lk-1中,则该候选也不可能是频繁的,从而可以从Ck中删除。
生成候选项集合 【例 】一个Apriori的具体例子,该例基于右图某商店的事务DB。DB中有9个事务, Apriori假定事务中的项按字典次序存放。
扫描D,对每个候选计数 (1)在算法的第一次迭代,每个项都是候选1-项集的集合C1的成员。算法简单地扫描所有的事务,对每个项的出现次数计数。 C1
比较候选支持度计数与最小支持度计数 (2)设最小支持计数为2,可以确定频繁1-项集的集合L1。它由具有最小支持 度的候选1-项集组成。 L1
L1 L1 由L1产生候选C2 (3)为发现频繁2-项集的集合L2,算法 使用 产生候选2-项集集合C2。 C2
扫描D,对每个候选计数 (4)扫描D中事务,计算C2中每个候选项 集的支持计数。 C2
比较候选支持度计数与最小支持度计数 (5)确定频繁2-项集的集合L2 ,它由具 有最小支持度的C2中的候选2-项集组成。 L2
L2 L2 (6)候选3-项集的集合C3 的产生如下: ①连接:C3= = {{I1, I2},{I1, I3},{I1, I5},{I2, I3},{I2, I4},{I2, I5}} {{I1, I2},{I1, I3},{I1, I5},{I2, I3},{I2, I4},{I2,I5}} ={{I1, I2, I3},{I1, I2, I5},{I1, I3, I5},{I2, I3, I4},{I2, I3, I5},{I2, I4, I5}}
②利用Apriori性质剪枝:频繁项集的所有子集必须是频繁的。存在候选项集,判断其子集是否频繁。②利用Apriori性质剪枝:频繁项集的所有子集必须是频繁的。存在候选项集,判断其子集是否频繁。 • {I1,I2,I3}的2-项子集是{I1,I2},{I1,I3}和{I2,I3}, 它们都是L2的元素。因此保留{I1,I2,I3}在C3中。 • {I1,I2,I5}的2-项子集是{I1,I2},{I1,I5}和{I2,I5}, 它们都是L2的元素。因此保留{I1,I2,I5}在C3中。 • {I1,I3,I5}的2-项子集是{I1,I3},{I1,I5}和{I3,I5}, {I3,I5}不是L2的元素,因而不是频繁的,由C3中删除{I1,I3,I5}。
{I2,I3,I4}的2-项子集是{I2,I3},{I2,I4}和{I3,I4}, 其中{I3,I4}不是L2的元素,因而不是频繁的,由C3中删除{I2,I3,I4}。 • {I2,I3,I5}的2-项子集是{I2,I3},{I2,I5}和{I3,I5}, 其中{I3,I5}不是L2的元素,因而不是频繁的, 由C3中删除 {I2,I3,I5}。 • {I2,I4,I5}的2-项子集是{I2,I4},{I2,I5}和{I4,I5}, 其中{I4,I5}不是L2的元素,因而不是频繁的, 由C3中删除{I2,I4,I5} 。
由L2产生候选C3 扫描D,对每个候选计数 ③这样,剪枝后C3={{I1,I2,I3},{I1,I2,I5}}。(7)扫描D中事务,以确定L3 ,它由C3中具有 最小支持度的的候选3-项集组成。 C3 C3
L3 L3 比较候选支持度计数与最小支持度计数 (8)算法使用 产生候选4-项集的集合C4。尽管连接产生结果 {{I1,I2,I3,I5}},这个项集被剪去,因为它的子集{I2,I3,I5}不是频繁的。则 C4= ,因此算法终止,找出了所有的频繁项集。 ψ L3
候选项目集的生成函数apriori-gen • 以Lk-1作为输入,输出全部频繁k-项目集的一个超集。该函数包含两个操作,连接(join)与修剪(prune)。连接操作将Lk-1中的频繁项目集按如下方式进行拼接: • insert into Ck • select p.item1, p.item2, …, p.itemk-1, q. itemk-1 • from Lk-1p, Lk-1q • where p. item1=q. item1,…, p. itemk-2=q. itemk-2, p. itemk-1 < q. itemk-1;
修剪操作 • 对Ck中任一候选项集c,若c的某个大小为k-1的子集不属于Lk-1,则将其从Ck中删除。 forall itemsets c Ck do forall (k-1)-subsets s of c do if (sLk-1) then delete c from Ck ;
例子 • 假设集合L3={{1 2 3},{1 2 4},{1 3 4},{1 3 5},{2 3 4}}, • 通过连接得到C4 ={{1 2 3 4},{1 3 4 5}}。 • 由于项集{1 4 5}不在L3中,所以修剪操作将{1 3 4 5}从C4中删除。
Reducing Number of Comparisons • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Generate Hash Tree Hash function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 • Suppose you have 15 candidate itemsets of length 3: • {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} • You need: • Hash function (MOD 3) • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)
2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 3 6 8 3 6 7 3 4 5 1 3 6 14 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7
2 3 4 1 25 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 58 3 6 8 3 6 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8
2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 36 8 36 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9
Subset Operation Given a transaction t, what are the possible subsets of size 3?
Hash Function 3 + 2 + 1 + 5 6 3 5 6 1 2 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 1 3 6 3 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 6 8 9 3 5 7 3 5 6 5 6 7 4 5 7 1 5 9 Subset Operation Using Hash Tree transaction
Hash Function 2 + 1 + 1 5 + 3 + 1 3 + 1 2 + 6 5 6 5 6 1 2 3 5 6 3 5 6 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 6 3 5 7 6 8 9 5 6 7 4 5 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9
Hash Function 2 + 1 5 + 1 + 3 + 1 3 + 1 2 + 6 3 5 6 5 6 5 6 1 2 3 5 6 2 3 5 6 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 7 3 5 6 6 8 9 4 5 7 5 6 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9 Match transaction against 11 out of 15 candidates
Factors Affecting Complexity • Choice of minimum support threshold • lowering support threshold results in more frequent itemsets • this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set • more space is needed to store support count of each item • if number of frequent items also increases, both computation and I/O costs may also increase • Size of database • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width • transaction width increases with denser data sets • This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)