240 likes | 258 Views
This article discusses the evaluation of association patterns generated by association analysis algorithms. It explores the need for well-accepted criteria and presents both statistical and subjective arguments for evaluating the quality of association patterns. It also highlights the pitfalls of relying solely on confidence measures and introduces measures like interest factor and Simpson's paradox. Proper stratification and the consideration of support distribution are emphasized in order to avoid generating spurious patterns.
E N D
Evaluation of Association Patterns • Association analysis algorithms have the potential to generate a large number of patterns. • In real commercial databases we could easily end up with thousands or even millions of patterns, many of which might not be interesting. • Very important to establish a set of wellaccepted criteria for evaluating the quality of association patterns. • First set of criteria can be established through statistical arguments. • Second set of criteria can be established through subjective arguments.
Subjective Arguments • A pattern is considered subjectively uninteresting unless it reveals unexpected information about the data. • E.g., the rule {Butter} {Bread} isn’t interesting, despite having high support and confidence values. • On the other hand, the rule {Diapers} {Beer} is interesting because the relationship is quite unexpected and may suggest a new crossselling opportunity for retailers. • Drawback: Incorporating subjective knowledge into pattern evaluation is a difficult task because it requires a considerable amount of prior information from the domain experts.
f11: support of X and Yf10: support of X and Yf01: support of X and Yf00: support of X and Y Computing Interestingness Measures • Given a rule X Y, the information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Used to define various measures
Pitfall of Confidence The pitfall of confidence can be traced to the fact that the measure ignores the support of the itemset in the rule consequent. • Consider association rule:Tea Coffee • Confidence= • P(Coffee,Tea)/P(Tea) = P(Coffee|Tea) = 150/200 = 0.75 (seems quite high) • But, P(Coffee) = 0.9 • Thus knowing that a person is a tea drinker actually decreases his/her probability of being a coffee drinker from 90% to 75%! • Although confidence is high, rule is misleading In fact P(Coffee|Tea) = P(Coffee, Tea)/P(Tea) = 750/900 = 0.83
Statistical Independence • Population of 1000 students • 600 students know how to swim (S) • 700 students know how to bike (B) • 420 students know how to swim and bike (S,B) • P(S|B) = P(S) ( P(SB)/P(B) = .42 / .7 = .6 = P(S)) • P(SB)/P(B) = P(S) • P(SB) = P(S) P(B) => Statistical independence • P(SB) > P(S) P(B) => Positively correlated • i.e. if someone knows how to swim, then it is more probable he knows how to bike, and vice versa • P(SB) < P(S) P(B) => Negatively correlated • i.e. if someone knows how to swim, then it is less probable he/she knows how to bike, and vice versa
Interest Factor • Measure that takes into account statistical dependence • Interest factor compares the frequency of a pattern against a baseline frequency computed under the statistical independence assumption. • The baseline frequency for a pair of mutually independent variables is: Or equivalently
Interest Equation • Fraction f11/N is an estimate for the joint probability P(A,B), while f1+ /N and f+1 /N are the estimates for P(A) and P(B), respectively. • If A and B are statistically independent, then P(AB)=P(A)×P(B), thus the Interest is 1.
Example: Interest Association Rule: Tea Coffee Interest = 150*1100 / (200*900)= 0.92 (< 1, therefore they are negatively correlated)
Some other example… • What’s the confidence of the following rules: (rule 1) {HDTV=Yes} {Exercise machine = Yes} (rule 2) {HDTV=No} {Exercise machine = Yes} ? Confidence of rule 1 = 99/180 = 55% Confidence of rule 2 = 54/120 = 45% • So, “Customers who buy high-definition televisions are more likely to buy exercise machines that those who don’t buy high-definition televisions.” Right? • Well, maybe not…
Stratification: Simpson paradox • Consider this more detailed table: • What’s the confidence of the rules for each strata: • (rule 1) {HDTV=Yes} {Exercise machine = Yes} • (rule 2) {HDTV=No} {Exercise machine = Yes} ? • College students: • Confidence of rule 1 = 1/10 = 10% • Confidence of rule 2 = 4/34 = 11.8% • Working Adults: • Confidence of rule 1 = 98/170 = 57.7% • Confidence of rule 2 = 50/86 = 58.1% The rules suggest that, for each group, customers who don’t buy HDTV are more likely to buy exercise machines, which contradict the previous conclusion when data from the two customer groups are pooled together.
Importance of Stratification • The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson's paradox. For example • Market basket data from a major supermarket chain should be stratified according to store locations, while • Medical records from various patients should be stratified according to confounding factors such as age and gender.
Effect of Support Distribution • Many real data sets have skewed support distribution where most of the items have relatively low to moderate frequencies, but a small number of them have very high frequencies.
Skewed distribution • Tricky to choose the right support threshold for mining such data sets. • If we set the threshold too high (e.g., 20%), then we may miss many interesting patterns involving the low support items from G1. • Such low support items may correspond to expensive products (such as jewelry) that are seldom bought by customers, but whose patterns are still interesting to retailers. • Conversely, when the threshold is set too low, there is the risk of generating spurious patterns that relate a highfrequency item such as milk to a lowfrequency item such as caviar.
Crosssupport patterns • Cross-support patterns are those that relate a highfrequency item such as milk to a lowfrequency item such as caviar. • Likely to be spurious because their correlations tend to be weak. • E.g. the confidence of {caviar}{milk} is likely to be high, but still the pattern is spurious, since there isn’t probably any correlation between caviar and milk. • However, we don’t want to use the Interest Factor during the computation of frequent itemsets because it doesn’t have the antimonotone property. • Interest factor is rather used as a post-processing step. • So, we want to detect cross-support pattern by looking at some antimonotone property.
Crosssupport patterns Definition A crosssupport pattern is an itemset X = {i1, i2 ,…, ik} whose support ratio is less than a userspecified threshold hc. Example Suppose the support for milk is 70%, while the support for sugar is 10% and caviar is 0.04% Given hc = 0.01, the frequent itemset {milk, sugar, caviar} is a crosssupport pattern because its support ratio is r = min {0.7, 0.1, 0.0004} / max {0.7, 0.1, 0.0004} = 0.0004 / 0.7 = 0.00058 < 0.01
Detecting crosssupport patterns • E.g. assuming that hc = 0.3, the itemsets {p,q}, {p,r}, and {p,q,r} are crosssupport patterns. • Because their support ratios, being equal to 0.2, are less than threshold hc. • We can apply a high support threshold, say, 20%, to eliminate the crosssupport patterns…but, this may come at the expense of discarding other interesting patterns such as the strongly correlated itemset {q,r} that has support equal to 16.7%.
Detecting crosssupport patterns • Confidence pruning also doesn’t help. • Confidence for {q}{p} is 80% even though {p, q} is a crosssupport pattern. • Meanwhile, rule {q} {r} also has high confidence even though {q, r} is not a crosssupport pattern. • These demonstrate the difficulty of using the confidence measure to distinguish between rules extracted from crosssupport and noncrosssupport patterns.
Lowest confidence rule • Notice that the rule {p}{q} has very low confidence because most of the transactions that contain p do not contain q. • This observation suggests that: Crosssupport patterns can be detected by examining the lowest confidence rule that can be extracted from a given itemset.
Finding lowest confidence • Recall the antimonotone property of confidence: conf( i1 ,i2i3,i4,…,ik ) conf( i1 ,i2 ,i3i4,…,ik ) • This property suggests that confidence never increases as we shift more items from the left to the righthand side of an association rule. • Hence, the lowest confidence rule that can be extracted from a frequent itemset contains only one item on its lefthand side.
Finding lowest confidence • Given a frequent itemset {i1,i2,i3,i4,…,ik}, the rule ij i1 ,i2 ,i3, ij-1, ij+1, i4,…,ik has the lowest confidence if ? s(ij) = max {s(i1), s(i2),…,s(ik)} • Follows directly from the definition of confidence as the ratio between the rule's support and the support of the rule antecedent.
Finding lowest confidence • Summarizing, the lowest confidence attainable from a frequent itemset {i1,i2,i3,i4,…,ik}, is • This is also known as the h-confidence measure or all-confidence measure.
hconfidence • Clearly, crosssupport patterns can be eliminated by ensuring that the hconfidence values for the patterns exceed some threshold hc. • Observe that the measure is also antimonotone, i.e., hconfidence({i1,i2,…, ik}) hconfidence({i1,i2,…, ik+1 }) and thus can be incorporated directly into the mining algorithm.