750 likes | 763 Views
This guide covers topics for the final exam including evaluating classification, predicting performance, and association analysis. Learn about Apriori algorithm, FP-Growth, mining sequences, cluster analysis, and more. Understand the two-step approach for mining association rules, Apriori principle, rule generation process, and computational requirements. Explore concepts like maximal frequent itemsets, closed itemsets, interest factor, and deriving frequent itemsets. Get insights into different algorithms used in data mining.
E N D
Topics to review for the final exam • Evaluation of classification • predicting performance, confidence intervals • ROC analysis • Precision, recall, F-Measure • Association Analysis • APRIORI • FP-Tree/FPGrowth • Maximal, Closed frequent itemsets • Cross support, h-measure • Confidence vs. interestingness • Mining sequences • Mining graphs • Cluster Analysis • K-Means, Bisecting K-means • SOM • DBSCAN • Hierarchical clustering • Web search • IR • Reputation ranking Single-side help sheet allowed.
Mining Association Rules Two-step approach: • Frequent Itemset Generation • Generate all itemsets whose support minsup (these itemsets are called frequent itemsets) • Rule Generation • Generate high confidence rules from each frequent itemset • The computational requirements for frequent itemset generation are more expensive than those of rule generation. Candidate itemsets are generated and then tested against the database to see whether they are frequent. • Apriori principle: • If an itemset is frequent, then all of its subsets must also be frequent • Apriori principle conversely said: • If an itemset is infrequent, then all of its supersets must be infrequent too.
Illustrating Apriori Principle Found to be Infrequent Pruned supersets
Apriori Algorithm • Method: • Let k=1 • Generate frequent itemsets of length 1 • Repeat until no new frequent itemsets are identified k=k+1 • Generate length k candidate itemsets from length k-1 frequent itemsets • Prune candidate itemsets containing subsets of length k-1 that are infrequent • Count the support of each candidate by scanning the DB and eliminate candidates that are infrequent, leaving only those that are frequent
Fk-1Fk-1 Method • Merge a pair of frequent (k-1)itemsets only if their first k-2 items are identical. • E.g. frequent itemsets {Bread, Diapers} and {Bread, Milk} are merged to form a candidate 3itemset {Bread, Diapers, Milk}.
Fk-1Fk-1 • Completeness • We don’t merge {Beer, Diapers} with {Diapers, Milk} because the first item in both itemsets is different. • Do we loose {Beer, Diapers, Milk}? • Prunning • Before checking a candidate against the DB, a candidate pruning step is needed to ensure that the remaining subsets of k-1 elements are frequent. • Counting • Finally, the surviving candidates are tested (counted) on the DB.
Rule Generation • Computing the confidence of an association rule does not require additional scans of the transactions. • Consider {1, 2}{3}. • The rule confidence is ({1, 2, 3}) / ({1, 2}) • Because {1, 2, 3} is frequent, the antimonotone property of support ensures that {1, 2} must be frequent, too, and we do know the supports of frequent itemsets. • Initially, all the highconfidence rules that have only one item in the rule consequent are extracted. • These rules are then used to generate new candidate rules. • For example, if • {acd} {b} and {abd} {c} are highconfidence rules, then the candidate rule {ad} {bc} is generated by merging the consequents of both rules. • Then the candidate rules are checked for confidence.
Other concepts and algorithms • FP-Tree/FP-Growth See corresponding slide set and Assignment 2 solution. • Maximal Frequent Itemsets • Closed Itemset • Interest factor • Mining Sequences
Maximal Frequent Itemsets An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Maximal frequent itemsets form the smallest set of itemsets from which all frequent itemsets can be derived. Infrequent Itemsets Border
Closed Itemsets • Despite providing a compact representation, maximal frequent itemsets do not contain the support information of their subsets. • An additional pass over the data set is needed to determine the support counts of the nonmaximal frequent itemsets. • It might be desirable to have a minimal representation of frequent itemsets that preserves the support information. • Such representation is the set of the closed frequent itemsets. • An itemset is closed if none of its immediate supersets has the same support as the itemset. • An itemset is a closed frequent itemset if it is closed and its support is greater than or equal to minsup.
Maximal vs Closed Frequent Itemsets Closed but not maximal Transaction Ids Minimum support = 2 Closed and maximal # Closed = 9 # Maximal = 4 Not supported by any transactions
ACD Deriving Frequent Itemsets From Closed Frequent Itemsets • E.g., consider the frequent itemset {a, d}. Because the itemset is not closed, its support count must be identical to one of its immediate supersets. • The key is to determine which superset among {a, b, d}, {a, c, d}, or {a, d, e} has exactly the same support count as {a, d}. • By the Apriori principle the support for {a, d} must be equal to the largest support among its supersets. • So, the support for {a, d} must be identical to the support for {a, c, d}. null 124 123 1234 245 345 A B C D E 12 124 24 123 4 2 3 24 34 45 AB AC AD AE BC BD BE CD CE DE 12 24 2 2 4 4 3 4 ABD ABE ACE ADE BCD BCE BDE CDE ABC 4 2 ABCD ABCE ABDE ACDE BCDE ABCDE
Support counting using closed frequent itemsets Let C denote the set of closed frequent itemsets Let kmax denote the maximum length of closed frequent itemsets Fkmax ={f | fC, | f | = kmax } {Frequent itemsets of size kmax} fork = kmax – 1 downto 1 do Set Fk to be all sub-itemsets of length k from the frequent itemsets in Fk+1 for eachfFkdo iffCthen f.support = max{f’.support | f’Fk+1, f f’} end if end for end for
f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y Contingency Table • Given a rule X Y, the information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y
Pitfall of Confidence • Consider association rule:Tea Coffee • Confidence= • P(Coffee,Tea)/P(Tea) = P(Coffee|Tea) = 150/200 = 0.75 (seems quite high) • But, P(Coffee) = 0.9 • Thus knowing that a person is a tea drinker actually decreases his/her probability of being a coffee drinker from 90% to 75%! • Although confidence is high, rule is misleading In fact P(Coffee|Tea) = P(Coffee, Tea)/P(Tea) = 750/900 = 0.83
Interest Factor • Measure that takes into account statistical dependence • f11/N is an estimate for the joint probability P(A,B) • f1+ /N and f+1 /N are the estimates for P(A) and P(B), respectively. • If A and B are statistically independent, • then P(A,B)=P(A)×P(B), thus the Interest is 1.
Example: Interest Association Rule: Tea Coffee Interest = 150*1100 / (200*900)= 0.92 (< 1, therefore they are negatively correlated)
Crosssupport patterns • They are patterns that relate a highfrequency item such as milk to a lowfrequency item such as caviar. • Likely to be spurious because their correlations tend to be weak. • E.g. confidence of {caviar}{milk} is likely to be high, but still the pattern is spurious, since there isn’t probably any correlation between caviar and milk. • Observation: On the other hand, the confidence of {milk}{caviar} is very low. • Crosssupport patterns can be detected and eliminated by examining the lowest confidence rule that can be extracted from a given itemset. • Such confidence should be above certain level for the pattern to not be cross-support one.
Finding lowest confidence • Recall the antimonotone property of confidence: conf( {i1 ,i2}{i3,i4,…,ik} ) conf( {i1 ,i2 ,i3}{i4,…,ik} ) • This property suggests that confidence never increases as we shift more items from the left to the righthand side of an association rule. • Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its lefthand side.
Finding lowest confidence • Given a frequent itemset {i1,i2,i3,i4,…,ik}, the rule {ij}{i1 ,i2 ,i3, ij-1, ij+1, i4,…,ik} has the lowest confidence if s(ij) = max {s(i1), s(i2),…,s(ik)} • This follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.
Finding lowest confidence • Summarizing, the lowest confidence attainable from a frequent itemset {i1,i2,i3,i4,…,ik}, is • This is also known as the h-confidence measure or all-confidence measure. • Crosssupport patterns can be eliminated by ensuring that the hconfidence values for the patterns exceed some user specified threshold hc. • h-confidence is antimonotone, i.e., • hconfidence({i1,i2,…, ik}) hconfidence({i1,i2,…, ik+1 }) • and thus can be incorporated directly into the mining algorithm.
Examples of Sequence • Web sequence: {Homepage} {Electronics} {Digital Cameras} {Canon Digital Camera} {Shopping Cart} {Order Confirmation} {Return to Shopping} • Purchase history of a given customer {Java in a Nutshell, Intro to Servlets} {EJB Patterns},… • Sequence of classes taken by a computer science major: {Algorithms and Data Structures, Introduction to Operating Systems} {Database Systems, Computer Architecture} {Computer Networks, Software Engineering} {Computer Graphics, Parallel Programming} …
Element (Transaction) Event (Item) E1E2 E1E3 E2 E3E4 E2 Sequence Formal Definition of a Sequence • A sequence is an ordered list of elements (transactions) s = < e1 e2 e3 … > • Each element contains a collection of events (items) ei = {i1, i2, …, ik} • Each element is attributed to a specific time or location • A k-sequence is a sequence that contains k events (items)
Formal Definition of a Subsequence • A sequence a1 a2 … an is contained in another sequence b1 b2 … bm (m ≥ n) if there exist integers i1 < i2 < … < in such that a1 bi1 , a2bi2, …, anbin • Support of a subsequence w is the fraction of data sequences that contain w • A sequential pattern is a frequent subsequence (i.e., a subsequence whose support is ≥ minsup)
APRIORI-like Algorithm • Make the first pass over the sequence database to yield all the 1-element frequent sequences • Repeat until no new frequent sequences are found Candidate Generation: • Merge pairs of frequent subsequences found in the (k-1)th pass to generate candidate sequences that contain k items Candidate Pruning: • Prune candidate k-sequences that contain infrequent (k-1)-subsequences Support Counting: • Make a new pass over the sequence database to find the support for these candidate sequences • Eliminate candidate k-sequences whose actual support is less than minsup
Candidate Generation • Base case (k=2): • Merging two frequent 1-sequences <{i1}> and <{i2}> will produce four candidate 2-sequences: • <{i1}, {i2}>, <{i2}, {i1}>, <{i1,i2}>, <{i2,i1}> • General case (k>2): • A frequent (k-1)-sequence w1 is merged with another frequent (k-1)-sequence w2 to produce a candidate k-sequence if the subsequence obtained by removing the first event in w1 is the same as the subsequence obtained by removing the last event in w2 • The resulting candidate after merging is given by the sequence w1 extended with the last event of w2. • If the last two events in w2 belong to the same element, then the last event in w2 becomes part of the last element in w1 • Otherwise, the last event in w2 becomes a separate element appended to the end of w1
Candidate Generation Examples • Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4 5}> will produce the candidate sequence < {1} {2 3} {4 5}> because the last two events in w2(4 and 5) belong to the same element • Merging the sequences w1=<{1} {2 3} {4}> and w2 =<{2 3} {4} {5}> will produce the candidate sequence < {1} {2 3} {4} {5}> because the last two events in w2(4 and 5) do not belong to the same element • Finally, the sequences <{1}{2}{3}> and <{1}{2, 5}> don’t have to be merged (Why?) • Because removing the first event from the first sequence doesn’t give the same subsequence as removing the last event from the second sequence. • If <{1}{2,5}{3}> is a viable candidate, it will be generated by merging a different pair of sequences, <{1}{2,5}> and <{2,5}{3}>.
Example Frequent 3-sequences Candidate < {1} {2} {3} > Generation < {1} {2 5} > < {1} {5} {3} > Candidate < {1} {2} {3} {4} > < {2} {3} {4} > Pruning < {1} {2 5} {3} > < {2 5} {3} > < {1} {5} {3 4} > < {3} {4} {5} > < {2} {3} {4} {5} > < {5} {3 4} > < {1} {2 5} {3} > < {2 5} {3 4} >
{A B} {C} {D E} <= max-gap <= max-span Timing Constraints max-gap = 2, max-span= 4
Mining Sequential Patterns with Timing Constraints • Approach 1: • Mine sequential patterns without timing constraints • Postprocess the discovered patterns • Approach 2: • Modify algorithm to directly prune candidates that violate timing constraints • Question: • Does APRIORI principle still hold?
APRIORI Principle for Sequence Data Suppose: max-gap = 1 max-span = 5 <{2} {5}> support = 40% but <{2} {3} {5}> support = 60% !! (APRIORI doesn’t hold) Problem exists because of max-gap constraint This problem can avoided by using the concept of a contiguous subsequence.
Contiguous Subsequences • s is a contiguous subsequence of w = <e1, e2 ,…, ek>if any of the following conditions holds: • s is obtained from w by deleting an item from either e1 or ek • s is obtained from w by deleting an item from any element ei that contains at least 2 items • s is a contiguous subsequence of s’ and s’ is a contiguous subsequence of w (recursive definition) • Examples: s = < {1} {2} > • is a contiguous subsequence of < {1} {2 3}>, < {1 2} {2} {3}>, and < {3 4} {1 2} {2 3} {4} > • is not a contiguous subsequence of< {1} {3} {2}> and < {2} {1} {3} {2}>
Modified Candidate Pruning Step • Modified APRIORI Principle • If a k-sequence is frequent, then all of its contiguous (k-1)-subsequences must also be frequent • Candidate generation doesn’t change. Only pruning changes. • Without maxgap constraint: • A candidate k-sequence is pruned if at least one of its (k-1)-subsequences is infrequent • With maxgap constraint: • A candidate k-sequence is pruned if at least one of its contiguous(k-1)-subsequences is infrequent
Inter-cluster distances are maximized Intra-cluster distances are minimized Cluster Analysis • Find groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • typically the mean of the points in the cluster. • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple
Solutions to Initial Centroids Problem • Multiple runs • Helps, but probability is not on your side • Bisecting K-means • Not as susceptible to initialization issues
Bisecting Kmeans Straightforward extension of the basic Kmeans algorithm. Simple idea: To obtain K clusters, split the set of points into two clusters, select one of these clusters to split, and so on, until K clusters have been produced. Algorithm Initialize the list of clusters to contain the cluster consisting of all points. repeat Remove a cluster from the list of clusters. //Perform several ``trial'' bisections of the chosen cluster. fori = 1 to number of trials do Bisect the selected cluster using basic Kmeans (i.e. 2-means). end for Select the two clusters from the bisection with the lowest total SSE. Add these two clusters to the list of clusters. until the list of clusters contains K clusters.
Limitations of K-means • K-means has problems when clusters are of differing • Sizes • Densities • Non-globular shapes • K-means has problems when the data contains outliers.
Exercise • For each figure, could you use K-means to find the patterns represented by the nose, eyes, and mouth? • Only for (b) and (d). • For (b), K-means would find the nose, eyes, and mouth, but the lower density points would also be included. • For (d), K-means would find the nose, eyes, and mouth straightforwardly as long as the number of clusters was set to 4. • What limitation does clustering have in detecting all the patterns formed by the points in figure c? • Clustering techniques can only find patterns of points, not of empty spaces.
Agglomerative Clustering Algorithm Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains • Key operation is the computation of the proximity of two clusters • Different approaches to defining the distance between clusters distinguish the different algorithms
Cluster Similarity: MIN • Similarity of two clusters is based on the two most similar (closest) points in the different clusters • Determined by one pair of points
5 1 3 5 2 1 2 3 6 4 4 Hierarchical Clustering: MIN Nested Clusters Dendrogram
Two Clusters Strength of MIN Original Points Can handle non-globular shapes
Original Points Four clusters Three clusters: The yellow points got wrongly merged with the red ones, as opposed to the green one. Limitations of MIN Sensitive to noise and outliers
Cluster Similarity: MAX • Similarity of two clusters is based on the two least similar (most distant) points in the different clusters • Determined by all pairs of points in the two clusters
4 1 2 5 5 2 3 6 3 1 4 Hierarchical Clustering: MAX Nested Clusters Dendrogram