210 likes | 227 Views
Detecting Group Differences: Mining Contrast Sets. Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin. Outline. Motivation Objective Research Review Search for Contrast Sets Filtering for Summarizing Contrast Set Evaluation Conclusion. Motivation.
E N D
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin
Outline • Motivation • Objective • Research Review • Search for Contrast Sets • Filtering for Summarizing Contrast Set • Evaluation • Conclusion
Motivation • Learning group differences a central problem in many domains • Contrasting groups especially important in social science research
Objective • Automatically detect differences between contrasting groups from observational multivariate data
Research Review • time series research • multiple observations • traditional statistical methods • rule learner and decision tree • miss group differences • association rule mining • multiple group and different search criteria
Problem Definition • itemset concept extends to contrast set Definition 1: Let A1,A2,...,Ak be a set of k variables called attributes. Each Ai can take on values from the set {Vi1,Vi2,...Vim}. Contrast set a conjunction of attribute –value pairs defined on groups G1,G2,...,Gn with no Ai occurring more than once.
Define support of contrast set • Definition 2: • The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. • minimum support difference δ user defined threshold
Search for Contrast Sets • find contrast sets meet our criteria though search • explore all possible contrast sets return only sets meet our criteria • STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques
Framework • use set-enumeration trees • use breadth-first search • counting phase organize nodes into candidate groups
Finding Significant Contrast Sets • testing the null hypothesis across all groups • support counts from contingency tables
Controlling Search Error • data mining test many hypotheses • family of tests control Type I error • Bonferroni inequality:given any set of events e1,e2,...,en, the probability of their union is less than or equal to the sum of the individual probabilities
Pruning • prune when contrast sets fail to meet effect size or statistical significance criteria • prune when lead to uninteresting contrast sets • Effect Size Pruning • prune nodes when bound maximum support difference groups below δ • Statistical Significance Pruning • pruned when too few data or maximum value X2 too small
Interest Based Pruning • contrast sets are not interesting when have identical support or relation between groups is fixed • Specializations with Identical Support • marital-status=husband • marital-status=husband ^ Sex = male
Fixed Relations • Fixed Relations • prune node as contrast set specializations do not add new information
Relation to Itemset Mining • minimum support difference criterion implies constraints support levels in individual groups • eliminate large portions of the search space based on: • subset infrequency pruning • effect size pruning • superset frequency pruning • interest based pruning ab abc
Filtering for Summarizing Contrast Set • past approaches • limit the rules shown by constraint the variables or items • compare discovered rules, show only unexpected results • new methods • expectation based statistical approach • identify and select linear trend contrast sets
Statistical Surprise • show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets • IPF(Iterative Proportional Fitting) find maximum likelihood estimates
Detecting Linear Trends • identical to finding change over time • detect significant contrast set by using the chi-square test • use regression techniques to find the portion of the x2
Evaluation • three research points: • low support difference • few high support attribute-value pairs, lower bounds can’t take advantage • pruning rules • δ -> 0 statistical significance pruning is more important • filtering rules
Conclusion • STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets • STUCOO has • pruning rules efficient mining at low support differences • guaranteed control over false positives • linear trend detection • compact summarization of result