Assessing the Quantitative Significance of Sequential Patterns

Assessing the Quantitative Significance of Sequential Patterns Yifan Cao, Dr. Byron Gao 5/31/11-8/1/11

Project Statement • We seek to find a method to quantitatively describe the significance of general sequential patterns

What is “significant” or “interesting?” What makes a pattern an interesting one? Naive answer: depth and length

P-Values • P-value(Pattern p) = Probability(p occurs naturally at least as often as it does in our data) • Smaller p-values mean more significant

Why would we care? • Almost all significance measures deal with non-sequential data • Those dealing with sequential data are incredibly data-specific • Identifies patterns that matter from products of the data set’s structure

Sequential vs. Non-sequential Data • Examples of Non-sequential Data: • Groceries purchased • Facebook friends • Top 5 favorite exotic fruits and vegetables • Examples of Sequential Data: • Words • DNA Sequences • Number of hours you sleep per night • Unclear/Could be both • Products purchased on Amazon (student prime!) • Books read

Structural Differences • Non-sequential Data- • Easily expressed as a matrix of supports • No problems with subsets having different sizes • Easy to construct similar data sets thru randomization • Sequential Data- • Cannot be expressed as a 2-D matrix of supports • Subsets of different lengths are problematic for matrix • Cannot carry out randomization on a matrix of items

Solution? Think Simpler! • We’re looking for a method for general sequential patterns • Proposal- • Randomize the ordering of items in each sequence • Obtain a probability of a pattern occurring for each sequence • Use such probabilities to generate a distribution for total number of pattern occurrences

Computing p-values • For each sequence in the data set, find the probability that if its ordering is randomized, the pattern will occur • With each sequence having a probability of containing a given pattern, construct the overall distribution of times said pattern occurs in the data set

Use combinatorics to analyze and compute the probability that a random ordering of a given sequence will contain pattern P • N = # of unique orderings = ( ) • For ABCDE: ( ) For ABCBA: ( ) • M = (sequence length – pattern length +1)( ) • For P=ABC and sequence ABCBA: M=(3)( ) • So the probability of ABCBA containing pattern P=ABC is M/N = 1/5 Sequence length Dictionary Values 5 5 2,2,1 1,1,1,1,1 Surplus length Surplus Dictionary Values 2 1, 1

Advantages • All work is probabilistic, finding p-values is very fast operation • Longer patterns’ significance can be built off of shorter patterns’ significance • Allows large, comprehensive sets of patterns to be judged in significance • Could lead to significance-based closed-frequent patter finding algorithm

Related Works • Randomization of real-valued matrices for assessing the signiﬁcance of data mining results by Markus Ojala • Ranking Sequential Patterns with Respect to Signiﬁcance by Robert Gwadera • Frequent Pattern Mining with Uncertain Data by CharuAggarwal

Further Study • Dealing with patterns occurring multiple times within one sequence • Modifying significance calculation to allow for more flexibility while maintaining overall structure of data • Algorithmic applications, especially in closed-frequent types of pattern finding algorithms

In Conclusion • Our method provides great accessibility to the field of sequential patterns • Combinatoric approach means it runs very fast • Significance calculation approach is highly scalable for huge sets of patterns

Thank you for listening!

Assessing the Quantitative Significance of Sequential Patterns