I don’t need a title slide for a lecture

I don’t need a title slide for a lecture Long long ago, in a galaxy far, far away…

Outline • Background • Data mining • Association Rules • Classification • Clustering • Sequential Patterns • Sequence Similarity

Knowledge Discovery in Databases (KDD) • What is it? • Finding useful patterns in data • Why do we need it? • Terabytes of data • Impractical to manually search for patterns • Where does data mining come in?

Steps of a KDD process • Learn the application domain • Create a target dataset • Clean and preprocess data • Choose type of data mining • Pick an algorithm • Perform data mining • Interpret results

Databases vs.Data warehousing • Data warehousing • Storage of all data • Details or summaries • Metadata • Data cleaning, integration • Databases • Queries over current data • Persistent storage • Atomic updates

Databases provide for: Queries over current data Persistent storage Atomic updates Data warehouses provide for: Storage of all data Meta data Data cleaning, integration Fast access to data Databases vs.Data warehouses

Who’s interested? • Databases - large amounts of data • Artificial Intelligence - search, planning, machine learning • Information Retrieval - searching for similar documents • Image Processing - finding similar images

Types of data mining Association Rules Classification Clustering Sequential Patterns Sequence Similarity

Association rules • What are they? • Looking for common causal relationships in basket data • Where are they used? • Store layout • Catalog design • Customer segmentation

Association rules example Find all itemsets that occur at least twice, and the causal relationship of each

Association rules metrics For a rule a b • support = a and b occur together in at least s% of the n baskets • confidence = of all of the baskets containing a, at least c% also contain b

Association rules algorithms • Focus on finding support for “itemsets” • The naïve method: • Combine itemsets of size k-1 that differ only on the last item to find Candidatesk • Measure support of itemsets from step 1 to form large itemsetk • Increase k and repeat until no new large itemsets

Itemsets of size 1 Looking for support of 2

Finding candidate set 2

Finding candidate set 3

Apriori algorithm • An itemset cannot be a large itemset unless all of its subsets are large itemsets • Reduces number of candidate itemsets considered

Research directions • Online construction of rules • CARMA (Berkeley) • Pre filtering the data • a posteriori (Limburgs Universitair Centrum)

Classification • What is it? • Rules that partition data into separate groups. • Where is it used? • to classify people as good/bad credit risks • weather prediction • fraud detection • Variation: best k of n (who to send flyers to)

Classification example

Possible solutions • Bayesian classification • Neural networks • Genetic algorithms • Decision Trees

Decision trees Salary < 25,000 no yes Graduate education? Accept no yes Accept Reject

Decision trees • Build the tree in two steps • Build a perfect tree on sample data • At each node, pick a “good” attribute • Split data according to attribute • Recursively build tree on children • Prune the tree • Minimum Description Length • Cost of encoding tree structure • Cost of encoding split attribute • Cost of encoding leaf data records

Research directions • Integrate building and pruning • PUBLIC (Bell Labs) • Incremental Updates • BOAT (University of Wisconsin)

Clustering • What is it? • Given n points, separate them into k clusters • Where is it used? • Information retrieval - text classification • Identify similar web documents • Mapping the universe

Clustering example

Traditional clustering algorithms • Partitional • Determine k partitions that optimize a function • Common function is the “square error function” • Hierarchical • Each point starts as a cluster • Clusters are merged until k clusters remain

Clustering difficulties

Research directions • Higher dimension subspace clustering • CLIQUE (IBM Almaden) • Incremental clustering • Incremental DBScan (University of Munich) • Remove problems with outliers • CURE (Bell Labs)

Sequential patterns • What is it? • Given a set of events, find frequently occurring patterns • Where is it used? • Analyzing basket data • Medical diagnosis

Sequential patterns example

AprioriAll • Create all large events that occur once • Map each subset to numbers • While there still are large itemsets: • Find candidate itemsets of length k • Find large itemsets of length k • Increase k

Mapping the itemsets

Research directions • Time limitations • WINEPI (Helsinki/Microsoft) • Itemsets over multiple transactions • CSP (IBM Almaden)

Sequence Similarity • What is it? • Given a number of data sets, look for similar trends • Where is it used? • Find stocks with similar price movements • Find geological irregularities

Example • Are the two sequences similar?

Basic algorithm • Scale data • Match all gap-free sequences • Form pairs of large similar sequences • Find the longest common subsequence

Research directions • Finding surprising patterns • IBM Almaden

Data mining directions • Sampling • Fractals • Pre-partitioning data • Making data mining more accessible • User defined aggregation support

References • General Data mining: http://www.almaden.ibm.com/cs/quest, www.bell-labs.com/project/serendip • Association Rules: “Fast Algorithms for Mining Association Rules”, Agrawal and Srikant; VLDB 94. • Classification: “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”, Rastogi and Shim; VLDB 98.

References (cont.) • Clustering: “CURE: An Efficient Clustering Algorithm for Large Databases”, Guha, Rastogi, Shim; SIGMOD 98. • Sequential Patterns: “Mining Sequential Patterns: Generalizations and Performance Improvements”, Srikant and Agrawal; EDBT 98. • Similarity Search: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Agrawal, Nin, Sawhney, and Shim; VLDB 95.

I don’t need a title slide for a lecture

I don’t need a title slide for a lecture

Presentation Transcript