Outline

Data Mining with Statistical LearningTheodoros Evgeniou Massachusetts Institute of Technology

Outline • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

Part I • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

What is Data Mining? Goal:To classifyor find trends in data in order to improve future decisions Examples: - financial data modeling - forecasting - customer profiling - fraud detection

data mining ? FRAUD? fraud system Age:.. Occ:.. OK? Example: Fraud Detection ………. …. Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Spend: $1300, Buy: … OK OK FRAUD OK …. ……….

Example: Customer Profiling ………. …. Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Spend: $1300, Buy: … NO NO BUY NO …. ………. data mining ? BUY? profiling system Age:.. Occ:.. NO?

Data Mining: More Examples • Sales analysis for inventory control • Diagnostics (manufacturing, health, …) • Information filtering/retrieval (e.g. emails, multimedia) • E-Customer Relationship Management : E-customer profiling (personalization, marketing…) E-customer support

US 1999: $12b credit card fraud, 50% on internet (IDC)(insurance, telecom…) Fraud detection using data mining: HNC/eHNC : 1999: ~ $500m M.C. 2000: ~ $2b M.C. • 20% of e-companies use internet customer info, 70% by 2001 (Forrester R.) Personalization, targeted marketing, collaborative filtering … (privacy?) engage, netperceptions…:~$10b Market Interest • Only 30% of Fortune 500 using email respond to it on time (IDC) Email filtering/response software: $20M now, $350M in 2003 (IDC) Kana, eGain, aptex…: ~$10b

Part II • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

An E-Support System companies need to respond efficiently and accurately to customers’ emails… …how can they manage this when they receive thousands of emails a day ? 1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)

data mining ? PROBLEM … new email e-support ACCOUNT An Email Classification System ………. …. …bought a piece of… some broken part… …would like to return… not satisfied with…. …send a receipt… previous payment… …request a copy of the report… balance of… PROBLEM PROBLEM ACCOUNT ACCOUNT ………. ….

An Image Mining System How can we detect objects in an image?

. . . ? . . . An Image Mining System data mining Pedestrian new image ….. Image System Car

? General System Architecture Example data data mining Decision A new data ….. System Decision B

STEP 1: Represent data in numerical form (feature vectors) Raw Data Features extraction Feature vector text images (Problem Specific) (12, 3, …) A Data Mining Process Data exist in many different forms (text, images, web clicks …)

Regression Clustering Classification A Data Mining Process (cont.) • STEP 2 : Statistical analysis of numerical data Numerical Data (featurevectors)

WHAT IS THE REPRESENTATION? • Bag of words • Bag of combinations of words • Natural language processing features • Yang, McCallum, Joachims, … Step 1: Text Representation …drive..far..see.. later… left.. drive.. (2, 0, 1, 1, 1, 1 , ….)

(12, 92, 74, 0, 12, …., 124) • WHAT IS THE REPRESENTATION? • Pixel Values • Projections on filters (Wavelets) • PCA • Feature selection Step 1: Image Representation (Papageorgiou et al, 1999, Evgeniou et al, 2000)

decision surface Step 2: “Learn” a Decision Surface (4,24,…) (7,33,…) (1,13,…) (41,11,…) (4,71,…) (92,10,…) (19,3,…)

Learning Methods • Other approaches: • Bayesian methods • Nearest Neighbor • Neural Networks • Decision Trees • Expert systems • New approach: • The Statistical Learning approach

Part III • What is data mining? - Industry – why data mining? • Data mining projects - E-support system - Detecting patterns in multimedia data • Mathematics for complex data mining - Statistical Learning Theory - Data mining tools • Concluding remarks

Roadmap • Formal setting of learning from examples • Standard learning methods • The Statistical Learning approach • Tools and contributions

Formal Setting of the Problem Given a set of l examples(data) Question: find function f such that is agood predictorof y for a future input x

The Ideal Solution What is “good predictor”? If data (x,y) appear according to an (unknown) probability distribution P(x,y), then we want our solution to: V(y, f (x)) : Loss function measuring “cost” from predicting f(x) instead of y (e.g. (y - f(x))2 )

Where? (I) Empirical Error Minimization We only have example data, so go for the obvious… … and hope that the solution has a small expected error

(II) Function Space Where do we choose f from? fcan be any constant function? f can be any polynomial?

Standard Learning Methods A standard way of building learning methods: • Step 1: define a function space H • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error

Standard Learning Methods A standard way of building learning methods: • Step 1: define a function space HHow? • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error Ok? Enough ?

The Central Questions • How do we choose the function space H ? • What if there are many solutions in H minimizing the empirical error (ill-posed problem) ? • Does a function f that minimizes the empirical error also minimize the expected error in H ?

Statistical Learning Approach (Vapnik, Chervonenkis, 1968- ) • Choose function space H according to its complexity. Formal measures are provided (i.e. VC-dimension). • With appropriate control of the complexity of the function space, the problem becomes well-posed : there is a unique solution. • The theory provides necessary and sufficient conditions for the uniform convergenceof the empirical error to the expected error in a function space in terms of the complexity of the space.

Important Bound (Vapnik, Chervonenkis, 1971) The theory provides bounds on the distance between the expected and empirical error : These bounds can be used to choose the function space H

Using the Bound Underfit Overfit

Using the Bound Error Expected Empirical hopt Complexity h

Standard Approaches A standard way of building learning methods: • Step 1: define a function space HHow? • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error Ok? Enough ?

The new way of building learning methods: Minimize:Empirical Error + Complexity By trying many H The Statistical Learning Approach

The Statistical Learning Approach Solves the problems of the standard methods • Step 1: define a function space H • Step 2: define the loss function V(y, f(x)) • Step 3: find fin H that minimizes the empirical error

Example

aka Perceptron (Neural Network)

Statistical Learning Approach What if we restrict the set of lines - function space? (therefore control complexity)

Statistical Learning Approach

Benefits of Statistical Learning • The problem becomes well-posed • Solution has smaller expected error

Empirical Error vs Complexity What if we further restrict complexity?

Benefits of Statistical Learning Avoid overfitting (Important for large dimensional data!)

Empirical Error Complexity Support Vector Machines (Vapnik, Cortes, 1995)

Non-linear Function Spaces Generally f can be any “linear” function in some very complex feature space:

Example: Second Order Features

Second Order Polynomials Using more complex features (second order features)

Reproducing Kernel Hilbert Space RKHS: A space of linear functions in a feature space satisfying some conditions (functional analysis…) Examples:

Support Vector Machines: General Empirical Error Complexity Training: Quadratic Programming

Kernel Machines Empirical Error Complexity Choices to make: V , f , l

Outline

Outline

Presentation Transcript

Outline

Outline

Outline

Outline

Outline

Outline

Outline

outline

outline

OUTLINE

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline

Outline:

Outline

Outline

OUTLINE: