Discriminative Pattern Mining

Discriminative Pattern Mining By Mohammad Hossain

Based on the paper Mining Low-Support Discriminative Patterns from Dense and High-Dimensional Data by 1. Gang Fang 2. GauravPandey 3. Wen Wang 4. Manish Gupta 4.Michael Steinbach 5.Vipin Kumar

What is Discriminative Pattern • A pattern is said to be Discriminative when its occurrence in two data sets (or in two different classes of a single data set) is significantly different. • One way to measure such discriminative power of a pattern is to find the difference between the supports of the pattern in two data sets. • When this support-difference (DiffSup) is greater then a threshold the the pattern is called discriminative.

An example If we consider the DiffSup =2 then the pattern C and ABC become interesting patterns.

Importance • Discriminative patterns have been shown to be useful for improving the classification performance for data sets where combinations of features have better discriminative power than the individual features • For example, for biomarker discovery from case-control data (e.g. disease vs. normal samples), it is important to identify groups of biological entities, such as genes and single-nucleotide polymorphisms (SNPs), that are collectively associated with a certain disease or other phenotypes

P1 = {i1, i2, i3} P2 = {i5, i6, i7} P3 = {i9, i10} P4 = {i12, i13, i14}. DiffSup is NOT Anti-monotonic As a result, it will not work in Apriori like framework.

Apriori: A Candidate Generation-and-Test Approach • Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! • Method: • Initially, scan DB once to get frequent 1-itemset • Generate length (k+1) candidate itemsets from length k frequent itemsets • Test the candidates against DB • Terminate when no frequent or candidate set can be generated

The Apriori Algorithm—An Example Supmin = 2 Database TDB L1 C1 1st scan C2 C2 L2 2nd scan L3 C3 3rd scan

But here we see, though the patterns AB and AC both have DiffSup < threshold (2) their super set ABC has DiffSup = 2 which is equal to threshold and thus becomes interesting. So AB, AC cannot be pruned.

BASIC TERMINOLOGY AND PROBLEM DEFINITION • Let D be a dataset with a set of m items, I = {i1, i2, ..., im}, two class labels S1 and S2. The instances of class S1 and S2 are denoted by D1 and D2. We have |D| = |D1| + |D2|. • For a pattern (itemset) α = {α1,α2,...,αl} the set of instances in D1 and D2 that contain α are denoted by Dα1 and Dα2. • The relative supports of α in classes S1 and S2 are RelSup1(α) = |Dα1 |/|D1| and RelSup2(α) = |Dα2 |/}D2| • The absolute difference of the relative supports of α in D1 and D2 is denoted as DiffSup(α) = |RelSup1(α) − RelSup2(α)|

New function • Some new functions are proposed that has anti-monotonic property and can be used in a apriori like frame work for pruning purpose. • One of them is BiggerSup defined as: BiggerSup(α) = max(RelSup1(α), RelSup2(α)). • BiggerSup is anti-monotonic and the upper bound of DiffSup. So we may use it for pruning in the apriori like frame work.

BiggerSup is a weak upper bound of DiffSup. • For instance, in the previous example if we want to use it to find discriminative patterns with thresold 4, • P3 can be pruned, because it has a BiggerSup of 3. • P2 can not be pruned (BiggerSup(P2) = 6), even though it is not discriminative (DiffSup(P2) = 0). • More generally, BiggerSup-based pruning can only prune infrequent non-discriminative patterns with relatively low support, but not frequent non- discriminative patterns.

A new measure: SupMaxK • The SupMaxK of an itemset α in D1 and D2 is defined as SupMaxK(α) = RelSup1(α) − maxβ⊆α(RelSup2(β)), where |β| = K • If K=1 then it is called SupMax1 and defined as SupMax1(α) = RelSup1(α) − maxa∈α(RelSup2({a})). • Similarly with K=2 we can define SupMax2 which is also called SupMaxPair.

Properties of the SupMaxK Family

Relationship between DiffSup, BiggerSup and the SupMaxK Family

SupMaxPair: A Special Member Suitable for High-Dimensional Data • In SupMaxK, as K increases we get more complete set of discriminative patterns. • But as K increased the complexity of calculation of SupMaxK also increases. • In fact the complexity of calculation of SupMaxK is O(mK). • So for high dimensional data (where m is large) high value of K (K>2)makes it infeasible. • In that case SupMaxPair can be used.

Discriminative Pattern Mining

Discriminative Pattern Mining

Presentation Transcript

Disiclass: Discriminative Frequent Pattern-based image classification

Frequent Pattern Mining

Trajectory Pattern Mining

Sequential Pattern Mining

Summarization of Frequent Pattern Mining

Trajectory Pattern Mining

Discriminative Frequent Pattern Analysis for Effective Classification

Multi-dimensional Sequential Pattern Mining

Discriminative Frequent Pattern Analysis for Effective Classification

Sequential Pattern Mining

Sequential Pattern Mining

Discriminative Frequent Pattern Analysis for Effective Classification

Multi-dimensional Sequential Pattern Mining

Diagonally Subgraphs Pattern Mining

Trajectory Pattern Mining

Mining Imprecise Discriminative Molecular Fragments

Discriminative Frequent Pattern Analysis for Effective Classification

Identifying Bug Signatures Using Discriminative Graph Mining

Mining Compressed Frequent-Pattern Sets

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets