Detecting Group Differences: Mining Contrast Sets

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin

Outline • Motivation • Objective • Research Review • Search for Contrast Sets • Filtering for Summarizing Contrast Set • Evaluation • Conclusion

Motivation • Learning group differences a central problem in many domains • Contrasting groups especially important in social science research

Objective • Automatically detect differences between contrasting groups from observational multivariate data

Research Review • time series research • multiple observations • traditional statistical methods • rule learner and decision tree • miss group differences • association rule mining • multiple group and different search criteria

Problem Definition • itemset concept extends to contrast set Definition 1: Let A1,A2,...,Ak be a set of k variables called attributes. Each Ai can take on values from the set {Vi1,Vi2,...Vim}. Contrast set a conjunction of attribute –value pairs defined on groups G1,G2,...,Gn with no Ai occurring more than once.

Define support of contrast set • Definition 2: • The support of a contrast set with respect to a group G is the percentage of examples in G where the contrast set is true. • minimum support difference δ user defined threshold

Search for Contrast Sets • find contrast sets meet our criteria though search • explore all possible contrast sets return only sets meet our criteria • STUCCO (Search and Testing for Understandable Consistent Contrasts): breadth-first search incorporates several efficiently mining techniques

Framework • use set-enumeration trees • use breadth-first search • counting phase organize nodes into candidate groups

Finding Significant Contrast Sets • testing the null hypothesis across all groups • support counts from contingency tables

Controlling Search Error • data mining test many hypotheses • family of tests control Type I error • Bonferroni inequality:given any set of events e1,e2,...,en, the probability of their union is less than or equal to the sum of the individual probabilities

Pruning • prune when contrast sets fail to meet effect size or statistical significance criteria • prune when lead to uninteresting contrast sets • Effect Size Pruning • prune nodes when bound maximum support difference groups below δ • Statistical Significance Pruning • pruned when too few data or maximum value X2 too small

Interest Based Pruning • contrast sets are not interesting when have identical support or relation between groups is fixed • Specializations with Identical Support • marital-status=husband • marital-status=husband ^ Sex = male

Fixed Relations • Fixed Relations • prune node as contrast set specializations do not add new information

Relation to Itemset Mining • minimum support difference criterion implies constraints support levels in individual groups • eliminate large portions of the search space based on: • subset infrequency pruning • effect size pruning • superset frequency pruning • interest based pruning ab abc

Filtering for Summarizing Contrast Set • past approaches • limit the rules shown by constraint the variables or items • compare discovered rules, show only unexpected results • new methods • expectation based statistical approach • identify and select linear trend contrast sets

Statistical Surprise • show most general contrast sets first, more complicated conjunctions if surprising based on previously shown sets • IPF(Iterative Proportional Fitting) find maximum likelihood estimates

Detecting Linear Trends • identical to finding change over time • detect significant contrast set by using the chi-square test • use regression techniques to find the portion of the x2

Evaluation • three research points: • low support difference • few high support attribute-value pairs, lower bounds can’t take advantage • pruning rules • δ -> 0 statistical significance pruning is more important • filtering rules

Conclusion • STUCCO algorithm combined statistical hypothesis testing with search for mining contrast sets • STUCOO has • pruning rules efficient mining at low support differences • guaranteed control over false positives • linear trend detection • compact summarization of result

Detecting Group Differences: Mining Contrast Sets

Detecting Group Differences: Mining Contrast Sets

Presentation Transcript

COMPARE AND CONTRAST

Privacy Issues in Scientific Workflow Provenance

Factors affecting the radiographic image and film faults

Comparison Contrast

Comparison / Contrast Paragraph and Essay

Applications of Sketch Based Techniques to Data Mining Problems

Compare and Contrast

Comparison-and-Contrast Essay

Compare and Contrast

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Mining

COMPARISON/CONTRAST

Group A3: Java-Based Tools

Data Mining with Big data

A Simple Method for Detecting Protein Spots in 2D-GE Images Using Image Contrast

Outline

SETS

Data Mining in Spatial Data Sets

Data Mining

Exposition Strategy: Development by Comparison and Contrast

Mining for Empty Rectangles in Large Data Sets