Frequent Pattern Mining

Frequent Pattern Mining Toon Calders Bart Goethals ADReM research group

Outline • What is data mining? • Definition • local patterns vs global models • Supervised vs Unsupervised • What do we do? • Frequent set mining • More complex data types

What is data mining? “the use of sophisticated data analysis tools to discover previously unknown, validpatterns and relationships in large data sets.” $ $ $ Data Information

Supervised vs Unsupervised • Supervised: • data has been annotated • well-defined task: learn to annotate new data E.g.: examples of good/bad customers • Unsupervised: • only data has been given • no annotation • « find knowledge » y n y x x x x x x x

Local vs Global • Local pattern: • tells something about a small subset of the data E.g. « 90% of the customers that purchase beer also buy chips » • Global model: • fits a global model to the data, a summary E.g. : there is a linear relationship between $ spent and the income of the customers

What do we do? • Pattern mining • Local • Unsupervised • Useful for • large datasets • exploration: « what is this data like? » • Less suitable for • well-studied and understood problem domains

Outline • What is data mining? • Frequent set mining • Market Basket analysis • Association rules • Interestingness measures • Numerical attributes • More complex data types

Market Basket Analysis • Data: collection of transactions of customers: • Goal: find sets of products frequently occuring together

Applications • Supermarket • product placement • special promotions • Websearch • which keywords often occur together in webpages? • Health care • frequent sets of symptoms for a disease

Applications • Basically works for all data that can be represented as a set of examples/objects having certain properties • patient / symptoms • movies / ratings • web pages / keywords • basket / products • …

Algorithms • Computationally a very hard problem • with n products, 2nsets of products • Hundreds of algorithms have been proposed • for sparse/dense data • many rows/columns • data fits/does not fit in memory • …

Association Rules • Conditional probabilities XY (c%): if X is in the transaction, then there is a probability of c% that Y is in it as well. • Based on the frequent sets, associations can be computed easily: { Beer, Chips }  { Snack nuts } 75% { adrem.html, cnts.html }  { islab.html } 80% { rain }  { overcast } 100%

Interestingness Measures • Not all association rules are interesting • Domain knowledge pregnant  female, rain  overcast • Redundancy A  B (100%) then: AC  B, AD  B, … • Independence 70% buys product A: XA(70%), YA(70%) • Too many rules

Interestingness Measures • Incorporating background knowledge • e.g., via Bayesian network • only produce rules that deviate from background knowledge • Redundancies • Condensed representations: produce only a non-redundant subset of patterns

Interestingness Measures • Independence • statistical significance tests • X2 • Careful with conclusions !!1000 tests with significance level 0.05 …(Bonferroni correction) • Too many rules • Constraints • Top-k mining

Numerical Attributes • Association rule mining is also possible for numerical attributes • discretization: make continuous attributes ordinal • information loss • not appropriate if the order between the values is important • other methods: • recently new method based on rank correlation measures

Complex Patterns • Sets • Sequences • Graphs • Relational Structures • Generation and Counting of such patterns becomes much more complex too!

Sequences CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA

Patterns in Sequences • Substrings • Regular expressions (bb|[^b]{2}) • Partial orders • Directed Acyclic Graphs

Graphs

Patterns in Graphs

Rules f: 7 f: 8 f: 5 0.5 0.8 0.57 f: 4 f: 4 f: 4

Relational Databases

Patterns in RDBs • Queries • Query 1: Select L.drinker, V.barFrom Likes L, Visits VWhere V.drinker = L.drinkerAnd L.beer = ‘Duvel’

Patterns in RDBs • Query 2: Select L.drinker, V.barFrom Likes L, Visits V, Serves SWhere V.drinker = L.drinkerAnd L.beer = ‘Duvel’And S.bar = V.barAnd S.beer = ‘Duvel’

Patterns in RDBs • Association Rule: Query 1 => Query 2 If a person that likes Duvel visits bar, then that bar serves Duvel

Frequent Pattern Mining

Frequent Pattern Mining

Presentation Transcript

Frequent Item Mining

Summarization of Frequent Pattern Mining

Frequent Structure Mining

Frequent Pattern Mining in Data Streams

On Frequent Chatters Mining

Mining Frequent Patterns

Frequent Subgraph Mining

Our New Progress on Frequent/Sequential Pattern Mining

Mining Frequent Subgraphs

Mining Frequent Subgraphs

Frequent-Pattern Tree

Frequent Subgraph Pattern Mining on Uncertain Graph Data

Cache-conscious Frequent Pattern Mining on a Modern Processor

Chapter 4 – Frequent Pattern Mining

Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms

Mining Compressed Frequent-Pattern Sets

A Systematic Literature Review of Frequent Pattern Mining Techniques

Frequent Pattern Mining

Mining Compressed Frequent-Pattern Sets

Frequent-Pattern Tree

Our New Progress on Frequent/Sequential Pattern Mining

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees