第三章关联规则挖掘 Association Rule Mining

第三章关联规则挖掘Association Rule Mining

背景简介（Motivation） • 超市购物：商场经理可能想了解顾客的购物习惯。例如：“顾客多半会在一次购物时买哪些商品？”。分析的结果可用于市场规划、广告策划和分类设计。 • 文本分类：个性化新闻推荐系统希望对新闻进行分类，推进用户感兴趣类别的新闻内容给用户。系统可以通过挖掘哪些关键词与某个类别经常联系在一起，找出文档的分类标准。 • 信息推荐：电子商务网站推荐用户所需的信息。如：下载某种类型音乐的用户通常具有什么样的特点 • 解决这些问题的一种有效途径就是“Association Rule Mining” (关联规则挖掘)

Association Rule Mining • Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction Market-Basket transactions Example of Association Rules {Diaper}  {Beer},{Milk, Bread}  {Eggs,Coke},{Beer, Bread}  {Milk}, Implication means co-occurrence, not causality!

Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of an itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold

Example: Definition: Association Rule • Association Rule • An implication expression of the form X  Y, where X and Y are itemsets • Example: {Milk, Diaper}  {Beer} • Rule Evaluation Metrics • Support (s) • Fraction of transactions that contain both X and Y • Confidence (c) • Measures how often items in Y appear in transactions thatcontain X

Association Rule Mining Task • Given a set of transactions T, the goal of association rule mining is to find all rules having • support ≥ minsup threshold • confidence ≥ minconf threshold • Brute-force approach: • List all possible association rules • Compute the support and confidence for each rule • Prune rules that fail the minsup and minconf thresholds  Computationally prohibitive!

Mining Association Rules Example of Rules: {Milk,Diaper}  {Beer} (s=0.4, c=0.67){Milk,Beer}  {Diaper} (s=0.4, c=1.0) {Diaper,Beer}  {Milk} (s=0.4, c=0.67) {Beer}  {Milk,Diaper} (s=0.4, c=0.67) {Diaper}  {Milk,Beer} (s=0.4, c=0.5) {Milk}  {Diaper,Beer} (s=0.4, c=0.5) • Observations: • All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} • Rules originating from the same itemset have identical support but can have different confidence • Thus, we may decouple the support and confidence requirements

Mining Association Rules • Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support  minsup • Rule Generation • Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset • Frequent itemset generation is still computationally expensive

Frequent Itemset Generation Given d items, there are 2d possible candidate itemsets

Frequent Itemset Generation • Brute-force approach: • Each itemset in the lattice is a candidate frequent itemset • Count the support of each candidate by scanning the database • Match each transaction against every candidate • Complexity ~ O(NMw) => Expensive since M = 2d!!!

Computational Complexity • Given d unique items: • Total number of itemsets = 2d • Total number of possible association rules: If d=6, R = 602 rules

Frequent Itemset Generation Strategies • Reduce the number of candidates (M) • Complete search: M=2d • Use pruning techniques to reduce M • Reduce the number of transactions (N) • Reduce size of N as the size of itemset increases • Used by DHP and vertical-based mining algorithms • Reduce the number of comparisons (NM) • Use efficient data structures to store the candidates or transactions • No need to match every candidate against every transaction

Reducing Number of Candidates • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle holds due to the following property of the support measure: • Support of an itemset never exceeds the support of its subsets • This is known as the anti-monotone property of support

Illustrating Apriori Principle Found to be Infrequent Pruned supersets

Illustrating Apriori Principle Items (1-itemsets) Pairs (2-itemsets) (No need to generatecandidates involving Cokeor Eggs) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, 6 + 6 + 1 = 13

Apriori Algorithm • Method: • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified • Generate length (k+1) candidate itemsets from length k frequent itemsets • Prune candidate itemsets containing subsets of length k that are infrequent • Count the support of each candidate by scanning the DB • Eliminate candidates that are infrequent, leaving only those that are frequent

AprioriAlgorithm • L1 = {frequent 1-itemset}; • for (k=2; Lk-1; k++) do begin • Ck = apriori-gen(Lk-1); // 生成大小为k的候选项集合 Ck为大小k的候选项集构成的集合 Lk-1 • for all transactions t D do begin • Ct = subset(Ck，t); // 事务数据t中包含的候选项集 • for all candidates c Ctdo • c.count ++; • end • Lk = {cCk |c.count  minsup} • end • Answer =

生成候选项集合 • Apriori算法是根据有关频繁项集性质的先验知识Apriori Principal命名的。该算法使用一种逐层搜索的迭代方法，利用k-项集产生(k+1)-项集。 • 具体做法：首先找出频繁1-项集的集合，记为L1；再用L1找频繁2-项集的集合L2；再用L2找L3…如此下去，直到不能找到频繁k-项集为止。找每个Lk需要一次数据库扫描。

连接步产生候选项集 剪枝步确定频繁项集生成候选项集合整个过程由连接和剪枝两步组成，即： • (1)连接步为找Lk，可通过Lk-1与自己连接，产生一个候选k-项集的集合，该候选项集的集合记作Ck。

Lk-1 Lk-1 生成候选项集合设l1和l2是Lk-1中的项集，记号li[j]表示li 的第j 项。为方便计，假定事务或项集中的项按字典次序排序。执行连接，其中Lk-1的元素是可连接的，如果它们前(k-2)个项相同。

生成候选项集合 即，Lk-1的元素l1和l2是可连接的，若：（ l1[1] = l2[1] ∧ l1[2] = l2 [2]∧…∧l1[k-2] = l2[k-2] ∧ l1[k-1] < l2[k-1] ）而条件（l1[k-1] < l2[k-1]）可确保不产生重复的项集。

生成候选项集合 • (2)剪枝步 Ck是Lk的超集，即它的成员不一定都是频繁项集，但所有的频繁k-项集都包含在Ck中。扫描数据库，确定Ck中每个候选项集的计数，从而确定Lk。然而， Ck可能很大，这样所涉及的计算量就很大。

生成候选项集合 为了压缩Ck，可利用Apriori性质：任何非频繁的(k-1)-项集都不可能是频繁k-项集的子集。因此，若一个候选k-项集的(k-1)-项子集不在Lk-1中，则该候选也不可能是频繁的，从而可以从Ck中删除。

生成候选项集合 【例】一个Apriori的具体例子，该例基于右图某商店的事务DB。DB中有9个事务， Apriori假定事务中的项按字典次序存放。

扫描D,对每个候选计数 （1）在算法的第一次迭代，每个项都是候选1-项集的集合C1的成员。算法简单地扫描所有的事务，对每个项的出现次数计数。 C1

比较候选支持度计数与最小支持度计数 （2）设最小支持计数为2，可以确定频繁1-项集的集合L1。它由具有最小支持度的候选1-项集组成。 L1

L1 L1 由L1产生候选C2 （3）为发现频繁2-项集的集合L2，算法使用产生候选2-项集集合C2。 C2

扫描D,对每个候选计数 （4）扫描D中事务，计算C2中每个候选项集的支持计数。 C2

比较候选支持度计数与最小支持度计数 （5）确定频繁2-项集的集合L2 ，它由具有最小支持度的C2中的候选2-项集组成。 L2

L2 L2 （6）候选3-项集的集合C3 的产生如下： ①连接：C3= = {{I1, I2}，{I1, I3}，{I1, I5}，{I2, I3}，{I2, I4}，{I2, I5}} {{I1, I2}，{I1, I3}，{I1, I5}，{I2, I3}，{I2, I4}，{I2,I5}} ={{I1, I2, I3}，{I1, I2, I5}，{I1, I3, I5}，{I2, I3, I4}，{I2, I3, I5}，{I2, I4, I5}}

②利用Apriori性质剪枝：频繁项集的所有子集必须是频繁的。存在候选项集，判断其子集是否频繁。②利用Apriori性质剪枝：频繁项集的所有子集必须是频繁的。存在候选项集，判断其子集是否频繁。 • {I1,I2,I3}的2-项子集是{I1,I2}，{I1,I3}和{I2,I3}, 它们都是L2的元素。因此保留{I1,I2,I3}在C3中。 • {I1,I2,I5}的2-项子集是{I1,I2}，{I1,I5}和{I2,I5}, 它们都是L2的元素。因此保留{I1,I2,I5}在C3中。 • {I1,I3,I5}的2-项子集是{I1,I3}，{I1,I5}和{I3,I5}, {I3,I5}不是L2的元素，因而不是频繁的，由C3中删除{I1,I3,I5}。

{I2,I3,I4}的2-项子集是{I2,I3}，{I2,I4}和{I3,I4}, 其中{I3,I4}不是L2的元素，因而不是频繁的，由C3中删除{I2,I3,I4}。 • {I2,I3,I5}的2-项子集是{I2,I3}，{I2,I5}和{I3,I5}, 其中{I3,I5}不是L2的元素，因而不是频繁的，由C3中删除 {I2,I3,I5}。 • {I2,I4,I5}的2-项子集是{I2,I4}，{I2,I5}和{I4,I5}, 其中{I4,I5}不是L2的元素，因而不是频繁的，由C3中删除{I2,I4,I5} 。

由L2产生候选C3 扫描D,对每个候选计数 ③这样，剪枝后C3={{I1,I2,I3}，{I1,I2,I5}}。（7）扫描D中事务，以确定L3 ，它由C3中具有最小支持度的的候选3-项集组成。 C3 C3

L3 L3 比较候选支持度计数与最小支持度计数（8）算法使用产生候选4-项集的集合C4。尽管连接产生结果 {{I1,I2,I3,I5}},这个项集被剪去，因为它的子集{I2,I3,I5}不是频繁的。则 C4= ,因此算法终止，找出了所有的频繁项集。 ψ L3

候选项目集的生成函数apriori-gen • 以Lk-1作为输入，输出全部频繁k-项目集的一个超集。该函数包含两个操作，连接(join)与修剪(prune)。连接操作将Lk-1中的频繁项目集按如下方式进行拼接： • insert into Ck • select p.item1, p.item2, …, p.itemk-1, q. itemk-1 • from Lk-1p, Lk-1q • where p. item1=q. item1,…, p. itemk-2=q. itemk-2, p. itemk-1 < q. itemk-1;

修剪操作 • 对Ck中任一候选项集c，若c的某个大小为k-1的子集不属于Lk-1，则将其从Ck中删除。 forall itemsets c Ck do forall (k-1)-subsets s of c do if (sLk-1) then delete c from Ck ;

例子 • 假设集合L3={{1 2 3}，{1 2 4}，{1 3 4}，{1 3 5}，{2 3 4}}， • 通过连接得到C4 ={{1 2 3 4}，{1 3 4 5}}。 • 由于项集{1 4 5}不在L3中，所以修剪操作将{1 3 4 5}从C4中删除。

例子（设min_sup = 50%）

Reducing Number of Comparisons • Candidate counting: • Scan the database of transactions to determine the support of each candidate itemset • To reduce the number of comparisons, store the candidates in a hash structure • Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Generate Hash Tree Hash function 3,6,9 1,4,7 2,5,8 2 3 4 5 6 7 3 6 7 3 6 8 1 4 5 3 5 6 3 5 7 6 8 9 3 4 5 1 3 6 1 2 4 4 5 7 1 2 5 4 5 8 1 5 9 • Suppose you have 15 candidate itemsets of length 3: • {1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8} • You need: • Hash function （MOD 3） • Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node)

2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 3 6 8 3 6 7 3 4 5 1 3 6 14 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 1, 4 or 7

2 3 4 1 25 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 58 3 6 8 3 6 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 2, 5 or 8

2 3 4 1 2 5 4 5 7 1 2 4 5 6 7 6 8 9 3 5 7 4 5 8 36 8 36 7 3 4 5 1 3 6 1 4 5 1 5 9 3 5 6 Association Rule Discovery: Hash tree Hash Function Candidate Hash Tree 1,4,7 3,6,9 2,5,8 Hash on 3, 6 or 9

Subset Operation Given a transaction t, what are the possible subsets of size 3?

Hash Function 3 + 2 + 1 + 5 6 3 5 6 1 2 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 1 3 6 3 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 6 8 9 3 5 7 3 5 6 5 6 7 4 5 7 1 5 9 Subset Operation Using Hash Tree transaction

Hash Function 2 + 1 + 1 5 + 3 + 1 3 + 1 2 + 6 5 6 5 6 1 2 3 5 6 3 5 6 3 5 6 2 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 6 3 5 7 6 8 9 5 6 7 4 5 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9

Hash Function 2 + 1 5 + 1 + 3 + 1 3 + 1 2 + 6 3 5 6 5 6 5 6 1 2 3 5 6 2 3 5 6 3 5 6 1,4,7 3,6,9 2,5,8 1 4 5 4 5 8 1 2 4 2 3 4 3 6 8 3 6 7 1 2 5 3 5 7 3 5 6 6 8 9 4 5 7 5 6 7 Subset Operation Using Hash Tree transaction 1 3 6 3 4 5 1 5 9 Match transaction against 11 out of 15 candidates

Factors Affecting Complexity • Choice of minimum support threshold • lowering support threshold results in more frequent itemsets • this may increase number of candidates and max length of frequent itemsets • Dimensionality (number of items) of the data set • more space is needed to store support count of each item • if number of frequent items also increases, both computation and I/O costs may also increase • Size of database • since Apriori makes multiple passes, run time of algorithm may increase with number of transactions • Average transaction width • transaction width increases with denser data sets • This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

AprioriTid算法

第三章关联规则挖掘 Association Rule Mining