Machine Learning & Data Mining Part 1: The Basics

Machine Learning & Data MiningPart 1: The Basics Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu © 2008, Jaime G Carbonell

Some Definitions (KBS vs ML) • Knowledge-Based Systems • Rules, procedures, semantic nets, Horn clauses • Inference: matching, inheritance, resolution • Acquisition: manually from human experts • Machine Learning • Data: tables, relations, attribute lists, … • Inference: rules, trees, decision functions, … • Acquisition: automated from data • Data Mining • Machine learning applied to large real problems • May be augmented with KBS © 2008, Jaime G. Carbonell

Ingredients for Machine Learning • “Historical” data (e.g. DB tables) • E.g. products (features, marketing, support, …) • E.g. competition (products, pricing, customers) • E.g. customers (demographics, purchases, …) • Objective function (to be predicted or optimized) • E.g. maximize revenue per customer • E.g. minimize manufacturing defects • Scalable machine learning method(s) • E.g. decision-tree induction, logistic regression • E.g. “active” learning, clustering © 2008, Jaime G. Carbonell

Sample ML/DM Applications I • Credit Scoring • Training: past applicant profiles, how much credit given, payback or default • Input: applicant profile (income, debts, …) • Objective: credit-score + max amount • Fraud Detection (e.g. credit-card transactions) • Training: past known legitimate & fraudulent transactions • Input: proposed transaction (loc, cust, $$, …) • Objective: approve/block decision © 2008, Jaime G. Carbonell

Sample ML/DM Applications II • Demographic Segmentation • Training: past customer profiles (age, gender, education, income,…) + product preferences • Input: new product description (features) • Objective: predict market segment affinity • Marketing/Advertisement Effectiveness • Training: past advertisement campaigns, demographic targets, product categories • Input: proposed advertisement campaign • Objective: project effectiveness (sales increase modulated by marketing cost) © 2008, Jaime G. Carbonell

Sample ML/DM Applications III • Product (or Part) Reliability • Training: past products/parts + specs at manufacturing + customer usage + maint rec • Input: new part + expected usage • Objective: mean-time-to-failure (replacement) • Manufacturing Tolerances • Training: past product/part manufacturing process, tolerances, inspections, … • Input: new part + expected usage • Objective: optimal manufacturing precision (minimize costs of failure + manufacture) © 2008, Jaime G. Carbonell

Sample ML/DM Applications IV • Mechanical Diagnosis • Training: past observed symptoms at (or prior to) breakdown + underlying cause • Input: current symptoms • Objective: predict cause of failure • Mechanical Repair • Training: cause of failure + product usage + repair (or PM) effectiveness • Input: new failure cause + product usage • Objective: recommended repair (or preventive maintenance operation) © 2008, Jaime G. Carbonell

Sample ML/DM Applications V • Billeting (job assignments) • Training: employee profiles, position profiles, employee performance in assigned position • Input: new employee or new position profile • Objective: predict performance in position • Text Mining & Routing (e.g. customer centers) • Training: electronic problem reports, customer requests + who should handle them • Input: new incoming texts • Objective: Assign category + route or reply © 2008, Jaime G. Carbonell

Preparing Historical Data • Extract a DB table with all the needed information • Select, join, project, aggregate, … • Filter out rows with significant missing data • Determine predictor attributes (columns) • Ask domain expert for relevant attributes, or • Start with all attributes and automatically sub-select most predictive ones (feature selection) • Determine to-be-predicted attribute (column) • Objective of the DM (number, decision, …) © 2008, Jaime G. Carbonell

Sample DB Table [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 110 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 1 Y © 2008, Jaime G. Carbonell

Supervised Learning on DB Table • Given: DB table • With identified predictor attributes x1, x2,… • And objective attribute y • Find: Prediction Function • Subject to: Error Minimization on data table M • Least-squares error, or L1-norm, or L-norm, … © 2008, Jaime G. Carbonell

Popular Predictor Functions • Linear Discriminators (next slides) • k-Nearest-Neighbors (lecture #2) • Decision Trees (lecture #5) • Linear & Logistic Regression (lecture #4) • Probabilistic Methods (Lecture #3) • Neural Networks • 2-layer  Logistic Regression • Multi-layer  Difficult to scale up • Classification Rule Induction (in a few slides) © 2008, Jaime G. Carbonell

Issues with Linear Discriminators • What is the “best” placement of the discriminator? • Maximize the margin • In general  Support Vector Machines • What if there are k classes (K > 2)? • Must learn k different discriminators • Each discriminates ki vs kji (all other classes) • What if it classes are not linearly separable? • Minimal error (L1 or L2) placement (regression) • Give up on linear discriminators ( other fk’s) © 2008, Jaime G. Carbonell

Minimizing Training Error • Optimal placing of maximum-margin separator • Quadratic programming (Support Vector Machines) • Slack variables to accommodate training errors • Minimizing error metrics • Number of errors • Magnitude of error • Squared error • Chevycheff norm © 2008, Jaime G. Carbonell

Symbolic Rule Induction General idea • Labeled instances are DB tuples • Rules are generalized tuples • Generalization occurs at terms in tuples • Generalize on new E+ not correctly predicted • Specialize on new E- not correctly predicted • Ignore predicted E+ or E- (error-driven learning) © 2008, Jaime G. Carbonell

Symbolic Rule Induction (2) Example term generalizations • Constant => disjunction e.g. if small portion value set seen • Constant => least-common-generalizer class e.g. if large portion of value set seen • Number (or ordinal) => range e.g. if dense sequential sampling © 2008, Jaime G. Carbonell

Symbolic Rule Induction Example (1) Age Gender Temp b-cult c-cult loc Skin disease 65 M 101 + .23 USA normal strep 25 M 102 + .00 CAN normal strep 65 M 102 - .78 BRA rash dengue 36 F 99 - .19 USA normal *none* 11 F 103 + .23 USA flush strep 88 F 98 + .21 CAN normal *none* 39 F 100 + .10 BRA normal strep 12 M 101 + .00 BRA normal strep 15 F 101 + .66 BRA flush dengue 20 F 98 + .00 USA rash *none* 81 M 98 - .99 BRA rash ec-12 87 F 100 - .89 USA rash ec-12 12 F 102 + ?? CAN normal strep 14 F 101 + .33 USA normal 67 M 102 + .77 BRA rash

Symbolic Rule Induction Example (2) Candidate Rules: IF age = [12,65] gender = *any* temp = [100,103] b-cult = + c-cult = [.00,.23] loc = *any* skin = (normal,flush) THEN: strep IF age = (15,65) gender = *any* temp = [101,102] b-cult = *any* c-cult = [.66,.78] loc = BRA skin = rash THEN: dengue Disclaimer: These are not real medical records or rules

Types of Data Mining • “Supervised” Methods (this DM course) • Training data has both predictor attributes & objective (to be predicted) attributes • Predict discrete classes  classification • Predict continuous values  regression • Duality: classification  regression • “Unsupervised” Methods • Training data without objective attributes • Goal: find novel & interesting patterns • Cutting-edge research, fewer success stories • Semi-supervised methods: market-basket, … © 2008, Jaime G. Carbonell

Machine Learning Application Process in a Nutshell • Choose problem where • Prediction is valuable and non-trivial • Sufficient historical data is available • The objective is measurable (incl in past data) • Prepare the data • Tabular form, clean, divide training & test sets • Select a Machine Learning algorithm • Human readable decision fn  rules, trees, … • Robust with noisy data  kNN, logistic reg, … © 2008, Jaime G. Carbonell

Machine Learning Application Process in a Nutshell (2) • Train ML Algorithm on Training Data Set • Each ML method has different training process • Training uses both predictor & objective att’s • Run Training ML Algorithm on Test Data Set • Test uses only predictor att’s & outputs predictions on objective attributes • Compare predictions vs actual objective att’s (see lecture 2 for evaluation metrics) • If Accuracy  threshold, done. • Else, try different ML algorithm, different parameter settings, get more training data, … © 2008, Jaime G. Carbonell

Sample DB Table (same) [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 100 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 10 Y © 2008, Jaime G. Carbonell

Feature Vector Representation • Predictor-attribute rows in DB tables can be represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are: R2 = [60 Y 3 2 Y 5] R4 = [95 Y 1 2 N 9] Converting to numbers (Y = 1, N = 0), we get: R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] © 2008, Jaime G. Carbonell

Vector Similarity • Suppose we have a new credit applicant R-new = [65 1 1 2 0 10] To which of R2 or R4 is she closer? R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] • What should we use as a SIMILARITY METRIC? • Should we first NORMALIZE the vectors? • If not, the largest component will dominate © 2008, Jaime G. Carbonell

Normalizing Vector Attributes • Linear Normalization (often sufficient) • Find max & min values for each attribute • Normalize each attribute by: • Apply to all vectors (historical + new) • …by normalizing each attribute, e.g.: © 2008, Jaime G. Carbonell

Normalizing Full Vectors • Normalizing the new applicant vector R-new = [65 1 1 2 0 10]  [.56 1 .17 .33 0 1] And normalizing the two past customer vectors R2 = [60 1 3 2 1 5]  [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9]  [.94 1 .17 .33 0 .90] • How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? • Weight accordingly, e.g. x2 for each • E.g., R-new-weighted: [1.12 1 .34 .33 0 1] © 2008, Jaime G. Carbonell

Similarity Functions (inverse dist) • Now that we have weighted normalized vectors, how do we tell exactly their degree of similarity? • Inverse sum of differences (L1) • Inverse Euclidean distance (L2) © 2008, Jaime G. Carbonell

Alternative: Similarity Matrix for Non-Numeric Attributes tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • Triangle inequality must hold • Transitive property must hold • Additivity/Compostionality need not hold © 2008, Jaime G. Carbonell

k-Nearest Neighbors Method • No explicit “training” phase • When new case arrives (vector of predictor att’s) • Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table) • k-neighbors vote for objective attribute • Unweighted majority vote, or • Similarity-weighted vote • Works for both discrete or continuous objective attributes © 2008, Jaime G. Carbonell

Applying kNN to Real Problems 1 • How does one choose the vector representation? • Easy: Vector = predictor attributes • What if attributes are not numerical? • Convert: (e.g. High=2, Med=1, Low=0), • Or, use similarity function over nominal values • E.g. equality or edit-distance on strings • How does one choose a distance function? • Hard: No magic recipe; try simpler ones first • This implies a need for systematic testing (discussed in coming slides) © 2008, Jaime G. Carbonell

Applying kNN to Real Problems 2 • How does one determine whether data should be normalized? • Normalization is usually a good idea • One can try kNN both ways to make sure • How does one determine “k” in kNN? • k is often determined empirically • Good start is: © 2008, Jaime G. Carbonell

Evaluating Machine Learning • Accuracy = Correct-Predictions/Total-Predictions • Simplest & most popular metric • But misleading on very-rare event prediction • Precision, recall & F1 • Borrowed from Information Retrieval • Applicable to very-rare event prediction • Correlation (between predicted & actual values) for continuous objective attributes • R2, kappa-coefficient, … © 2008, Jaime G. Carbonell

Rare Event Evaluation • Accuracy for example = 0.88 • …but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses • Alternative: Per-diagnosis (per-class) accuracy: • A(“shorted PS”) = 0/22 = 0 • A(“not plugged in”) = 160/184 = 0.87 © 2008, Jaime G. Carbonell

Machine Learning & Data Mining Part 1: The Basics