CSG230 Summary

CSG230 Summary Donghui Zhang Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • Clustering • Classification • Data warehousing • Additional Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • frequent itemsets (Apriori, FP-growth) • max and closed itemsets • association rules • essential rules • generalized itemsets • Sequential pattern • Clustering • Classification • Data warehousing • Additional Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • Clustering • k-means • Birch (based on CF-tree) • DBSCAN • CURE • Classification • Data warehousing • Additional Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • Clustering • Classification • decision tree • naïve Baysian classifier • Baysian network • neural net and SVM • Data warehousing • Additional Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • Clustering • Classification • Data warehousing • concept, schema • data cube & operations (rollup, …) • cube computation: multi-way array aggregation • iceberg cube • dynamic data cube • Additional Data Mining: Concepts and Techniques

What we learned? • Frequent pattern & association • Clustering • Classification • Data warehousing • Additional • lattice (of itemsets, g-itemsets, rules, cuboids) • distance-based indexing Data Mining: Concepts and Techniques

Frequent pattern & association • frequent itemsets (Apriori, FP-growth) • max and closed itemsets • association rules • essential rules • generalized itemsets • Sequential pattern Data Mining: Concepts and Techniques

Customer buys both Customer buys diaper Customer buys beer Basic Concepts: Frequent Patterns and Association Rules • Itemset X={x1, …, xk} • Find all the rules XYwith min confidence and support • support, s, probability that a transaction contains XY • confidence, c,conditional probability that a transaction having X also contains Y. • Let min_support = 50%, min_conf = 50%: • A  C (50%, 66.7%) • C  A (50%, 100%) Data Mining: Concepts and Techniques

From Mining Association Rules to Mining Frequent Patterns (i.e. Frequent Itemsets) • Given a frequent itemset X, how to find association rules? • Examine every subset S of X. • Confidence(SX – S ) = support(X)/support(S) • Compare with min_conf • An optimization is possible (refer to exercises 6.1, 6.2). Data Mining: Concepts and Techniques

The Apriori Algorithm—An Example Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan Data Mining: Concepts and Techniques

Important Details of Apriori • How to generate candidates? • Step 1: self-joining Lk • Step 2: pruning • How to count supports of candidates? • Example of Candidate-generation • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 • abcd from abc and abd • acde from acd and ace • Pruning: • acde is removed because ade is not in L3 • C4={abcd} Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree F-list=f-c-a-b-m-p Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree f:1 c:1 a:1 m:1 F-list=f-c-a-b-m-p p:1 Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree f:2 c:2 a:2 m:1 b:1 F-list=f-c-a-b-m-p p:1 m:1 Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree f:3 c:2 b:1 a:2 m:1 b:1 F-list=f-c-a-b-m-p p:1 m:1 Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree f:3 c:1 c:2 b:1 b:1 a:2 p:1 m:1 b:1 F-list=f-c-a-b-m-p p:1 m:1 Data Mining: Concepts and Techniques

Construct FP-tree from a Transaction Database TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p}{f, c, a, m, p} 200 {a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500{a, f, c, e, l, p, m, n}{f, c, a, m, p} min_support = 3 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 • Scan DB once, find frequent 1-itemset (single item pattern) • Sort frequent items in frequency descending order, f-list • Scan DB again, construct FP-tree f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 F-list=f-c-a-b-m-p p:2 m:1 Data Mining: Concepts and Techniques

{} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Find Patterns Having P From P-conditional Database • Starting at the frequent item header table in the FP-tree • Traverse the FP-tree by following the link of each frequent item p • Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 Data Mining: Concepts and Techniques

Max-patterns • Frequent pattern {a1, …, a100}  (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 frequent sub-patterns! • Max-pattern: frequent patterns without proper frequent super pattern • BCDE, ACD are max-patterns • BCD is not a max-pattern Min_sup=2 Data Mining: Concepts and Techniques

B (CDE) C (DE) D (E) E () A (BCDE) Example  (ABCDEF) Min_sup=2 Max patterns: Data Mining: Concepts and Techniques

 (ABCDEF) Example B (CDE) C (DE) D (E) E () A (BCDE) Min_sup=2 Node A Max patterns: Data Mining: Concepts and Techniques

 (ABCDEF) Example B (CDE) C (DE) D (E) E () A (BCDE) Min_sup=2 Node B Max patterns: Data Mining: Concepts and Techniques

 (ABCDEF) Example B (CDE) C (DE) D (E) E () A (BCDE) Min_sup=2 Max patterns: Data Mining: Concepts and Techniques

A Critical Observation • A→BC has smaller support and confidence than the other rules, independent to the TDB. • Rules AB → C, AC→ B, A→ B and A→ C are redundant with regard to A → BC. • While mining association rules, a large percentage of rules may be redundant. Data Mining: Concepts and Techniques

Formal Definition of Essential Rule • Definition 1Rule r1implies another rule r2 if support(r1)≤support(r2) and confidence(r1)≤ confidence(r2) independent to TDB. • Denote as r1  r2 • Definition 2Rule r1 is an essential rule if r1 is strong and  r2 s.t. r2  r1 . Data Mining: Concepts and Techniques

Example of a Lattice of rules ABC CAB BAC ABC AC AB ABC ACB BC BA BCA CB CA • Generate the child nodes: move or delete from the consequent. • To find essential rules: start from each max itemset; browse top-down; prune a sub-tree whenever a rule is confident. Data Mining: Concepts and Techniques

Frequent generalized itemsets • A taxonomy of items. • TDB involves leaf items in the taxonomy. • A g-itemset may contain g-items, but cannot contain an ancestor and a descendant at the same time. • !! A descendant g-item is a “superset”!! • Anyone who bought {milk, bread} also bought {milk}. • Anyone who bought {A} also bought {W}. • ?? how to find frequent g-itemsets? • Browse (and prune) a lattice of g-itemsets! • To get children, replace one item by its ancestor (if conflicts, remove instead.) Data Mining: Concepts and Techniques

What Is Sequential Pattern Mining? • Given a set of sequences, find the complete set of frequent subsequences A sequence database Given support thresholdmin_sup =2, <(ab)c> is a sequential pattern Data Mining: Concepts and Techniques

Mining Sequential Patterns by Prefix Projections • Step 1: find length-1 sequential patterns • <a>, <b>, <c>, <d>, <e>, <f> • Step 2: divide search space. The complete set of seq. pat. can be partitioned into 6 subsets: • The ones having prefix <a>; • The ones having prefix <b>; • … • The ones having prefix <f> Data Mining: Concepts and Techniques

Finding Seq. Patterns with Prefix <a> • Only need to consider projections w.r.t. <a> • <a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc> • Find all the length-2 seq. pat. Having prefix <a>: <aa>, <ab>, <(ab)>, <ac>, <ad>, <af>, by checking the frequency of items like a and _a. • Further partition into 6 subsets • Having prefix <aa>; • … • Having prefix <af> Data Mining: Concepts and Techniques

2. Clustering • k-means • Birch (based on CF-tree) • DBSCAN • CURE Data Mining: Concepts and Techniques

The K-Means Clustering Method • Pick k objects as initial seed points • Assign each object to the cluster with the nearest seed point • Re-compute each seed point as the centroid (or mean point) of its cluster • Go back to Step 2, stop when no more new assignment Not optimal. A counter example? Data Mining: Concepts and Techniques

BIRCH (1996) Balanced Iterative Reducing and Clustering using Hierarchies • Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96) • Incrementally construct a CF (Clustering Feature) tree, a hierarchical data structure for multiphase clustering • Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) • Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree • Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans • Weakness: handles only numeric data, and sensitive to the order of the data record. Data Mining: Concepts and Techniques

Clustering Feature Vector Clustering Feature:CF = (N, LS, SS) N: Number of data points LS: Ni=1 Xi SS: Ni=1 (Xi )2 CF = (5, (16,30),244) (3,4) (2,6) (4,5) (4,7) (3,8) Data Mining: Concepts and Techniques

Some Characteristics of CF • Two CF can be aggregated. • Given CF1=(N1, LS1, SS1), CF2 = (N2, LS2, SS2), • If combined into one cluster, CF=(N1+N2, LS1+LS2, SS1+SS2). • The centroid and radius can both be computed from CF. • centroid is the center of the cluster • radius is the average distance between an object and the centroid. • how? Data Mining: Concepts and Techniques

Some Characteristics of CF Data Mining: Concepts and Techniques

CF-Tree in BIRCH • Clustering feature: • summary of the statistics for a given subcluster: the 0-th, 1st and 2nd moments of the subcluster from the statistical point of view. • registers crucial measurements for computing cluster and utilizes storage efficiently • A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering • A nonleaf node in a tree has descendants or “children” • The nonleaf nodes store sums of the CFs of their children • A CF tree has two parameters • Branching factor: specify the maximum number of children. • threshold T: max radius of sub-clusters stored at the leaf nodes Data Mining: Concepts and Techniques

Insertion in a CF-Tree • To insert an object o to a CF-tree, insert to the root node of the CF-tree. • To insert o into an index node, insert into the child node whose centroid is the closest to o. • To insert o into a leaf node, • If an existing leaf entry can “absorb” it (i.e. new radius <= T), let it be; • Otherwise, create a new leaf entry. • Split: • Choose two entries whose centroids are the farthest away; • Assign them to two different groups; • Assign the remaining entries to one of these groups. Data Mining: Concepts and Techniques

Density-Based Clustering: Background (II) • Density-reachable: • A point p is density-reachable from a point q wrt. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi • Density-connected • A point p is density-connected to a point q wrt. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. Eps and MinPts. p p1 q p q o Data Mining: Concepts and Techniques

Outlier Border Eps = 1cm MinPts = 5 Core DBSCAN: Density Based Spatial Clustering of Applications with Noise • Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points • Discovers clusters of arbitrary shape in spatial databases with noise Data Mining: Concepts and Techniques

DBSCAN: The Algorithm • Arbitrary select a point p • Retrieve all points density-reachable from p wrt Eps and MinPts. • If p is a core point, a cluster is formed. • If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. • Continue the process until all of the points have been processed. Data Mining: Concepts and Techniques

Motivation for CURE • k-means does not perform well on this; • AGNES + dmin has single-link effect! Data Mining: Concepts and Techniques

Cure: The Basic Version • Initially, insert to PQ every object as a cluster. • Every cluster in PQ has: • (Up to) C representative points • Pointer to closest cluster (dist between two clusters = min{dist(rep1, rep2)}. • While PQ has more than k clusters, Merge the top cluster with its closest cluster. Data Mining: Concepts and Techniques

Representative points • Step 1: choose up to C points. • If a cluster has no more than C points, all of them. • Otherwise, choose the first point as the farthest from the mean. Choose the others as the farthest from the chosen ones. • Step 2: shrink each point towards mean: • p’ = p +  * (mean – p) • [0,1]. Larger  means shrinking more. • Reason for shrink: avoid outlier, as faraway objects are shrunk more. Data Mining: Concepts and Techniques

3. Classification • decision tree • naïve Baysian classifier • Baysian net • neural net and SVM Data Mining: Concepts and Techniques

Training Dataset This follows an example from Quinlan’s ID3 Data Mining: Concepts and Techniques

Output: A Decision Tree for “buys_computer” age? <=30 overcast >40 30..40 student? credit rating? yes no yes fair excellent no yes no yes Data Mining: Concepts and Techniques

CSG230 Summary

CSG230 Summary

Presentation Transcript

Summary

SUMMARY

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

SUMMARY

SUMMARY

summary

SUMMARY

Summary

SUMMARY

Summary

Summary

Summary

Summary

SUMMARY