760 likes | 1.02k Views
Machine Learning & Data Mining Part 1: The Basics. Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu. Some Definitions (KBS vs ML). Knowledge-Based Systems Rules, procedures, semantic nets, Horn clauses
E N D
Machine Learning & Data MiningPart 1: The Basics Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu © 2008, Jaime G Carbonell
Some Definitions (KBS vs ML) • Knowledge-Based Systems • Rules, procedures, semantic nets, Horn clauses • Inference: matching, inheritance, resolution • Acquisition: manually from human experts • Machine Learning • Data: tables, relations, attribute lists, … • Inference: rules, trees, decision functions, … • Acquisition: automated from data • Data Mining • Machine learning applied to large real problems • May be augmented with KBS © 2008, Jaime G. Carbonell
Ingredients for Machine Learning • “Historical” data (e.g. DB tables) • E.g. products (features, marketing, support, …) • E.g. competition (products, pricing, customers) • E.g. customers (demographics, purchases, …) • Objective function (to be predicted or optimized) • E.g. maximize revenue per customer • E.g. minimize manufacturing defects • Scalable machine learning method(s) • E.g. decision-tree induction, logistic regression • E.g. “active” learning, clustering © 2008, Jaime G. Carbonell
Sample ML/DM Applications I • Credit Scoring • Training: past applicant profiles, how much credit given, payback or default • Input: applicant profile (income, debts, …) • Objective: credit-score + max amount • Fraud Detection (e.g. credit-card transactions) • Training: past known legitimate & fraudulent transactions • Input: proposed transaction (loc, cust, $$, …) • Objective: approve/block decision © 2008, Jaime G. Carbonell
Sample ML/DM Applications II • Demographic Segmentation • Training: past customer profiles (age, gender, education, income,…) + product preferences • Input: new product description (features) • Objective: predict market segment affinity • Marketing/Advertisement Effectiveness • Training: past advertisement campaigns, demographic targets, product categories • Input: proposed advertisement campaign • Objective: project effectiveness (sales increase modulated by marketing cost) © 2008, Jaime G. Carbonell
Sample ML/DM Applications III • Product (or Part) Reliability • Training: past products/parts + specs at manufacturing + customer usage + maint rec • Input: new part + expected usage • Objective: mean-time-to-failure (replacement) • Manufacturing Tolerances • Training: past product/part manufacturing process, tolerances, inspections, … • Input: new part + expected usage • Objective: optimal manufacturing precision (minimize costs of failure + manufacture) © 2008, Jaime G. Carbonell
Sample ML/DM Applications IV • Mechanical Diagnosis • Training: past observed symptoms at (or prior to) breakdown + underlying cause • Input: current symptoms • Objective: predict cause of failure • Mechanical Repair • Training: cause of failure + product usage + repair (or PM) effectiveness • Input: new failure cause + product usage • Objective: recommended repair (or preventive maintenance operation) © 2008, Jaime G. Carbonell
Sample ML/DM Applications V • Billeting (job assignments) • Training: employee profiles, position profiles, employee performance in assigned position • Input: new employee or new position profile • Objective: predict performance in position • Text Mining & Routing (e.g. customer centers) • Training: electronic problem reports, customer requests + who should handle them • Input: new incoming texts • Objective: Assign category + route or reply © 2008, Jaime G. Carbonell
Preparing Historical Data • Extract a DB table with all the needed information • Select, join, project, aggregate, … • Filter out rows with significant missing data • Determine predictor attributes (columns) • Ask domain expert for relevant attributes, or • Start with all attributes and automatically sub-select most predictive ones (feature selection) • Determine to-be-predicted attribute (column) • Objective of the DM (number, decision, …) © 2008, Jaime G. Carbonell
Sample DB Table [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 110 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 1 Y © 2008, Jaime G. Carbonell
Supervised Learning on DB Table • Given: DB table • With identified predictor attributes x1, x2,… • And objective attribute y • Find: Prediction Function • Subject to: Error Minimization on data table M • Least-squares error, or L1-norm, or L-norm, … © 2008, Jaime G. Carbonell
Popular Predictor Functions • Linear Discriminators (next slides) • k-Nearest-Neighbors (lecture #2) • Decision Trees (lecture #5) • Linear & Logistic Regression (lecture #4) • Probabilistic Methods (Lecture #3) • Neural Networks • 2-layer Logistic Regression • Multi-layer Difficult to scale up • Classification Rule Induction (in a few slides) © 2008, Jaime G. Carbonell
Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Linear Discriminator Functions x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Linear Discriminator Functions x2 Two class problem: y={ , } new x1 © 2008, Jaime G. Carbonell
Issues with Linear Discriminators • What is the “best” placement of the discriminator? • Maximize the margin • In general Support Vector Machines • What if there are k classes (K > 2)? • Must learn k different discriminators • Each discriminates ki vs kji (all other classes) • What if it classes are not linearly separable? • Minimal error (L1 or L2) placement (regression) • Give up on linear discriminators ( other fk’s) © 2008, Jaime G. Carbonell
Maximizing the Margin x2 margin Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Nearly-Separable Classes x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Nearly-Separable Classes x2 Two class problem: y={ , } x1 © 2008, Jaime G. Carbonell
Minimizing Training Error • Optimal placing of maximum-margin separator • Quadratic programming (Support Vector Machines) • Slack variables to accommodate training errors • Minimizing error metrics • Number of errors • Magnitude of error • Squared error • Chevycheff norm © 2008, Jaime G. Carbonell
Symbolic Rule Induction General idea • Labeled instances are DB tuples • Rules are generalized tuples • Generalization occurs at terms in tuples • Generalize on new E+ not correctly predicted • Specialize on new E- not correctly predicted • Ignore predicted E+ or E- (error-driven learning) © 2008, Jaime G. Carbonell
Symbolic Rule Induction (2) Example term generalizations • Constant => disjunction e.g. if small portion value set seen • Constant => least-common-generalizer class e.g. if large portion of value set seen • Number (or ordinal) => range e.g. if dense sequential sampling © 2008, Jaime G. Carbonell
Symbolic Rule Induction Example (1) Age Gender Temp b-cult c-cult loc Skin disease 65 M 101 + .23 USA normal strep 25 M 102 + .00 CAN normal strep 65 M 102 - .78 BRA rash dengue 36 F 99 - .19 USA normal *none* 11 F 103 + .23 USA flush strep 88 F 98 + .21 CAN normal *none* 39 F 100 + .10 BRA normal strep 12 M 101 + .00 BRA normal strep 15 F 101 + .66 BRA flush dengue 20 F 98 + .00 USA rash *none* 81 M 98 - .99 BRA rash ec-12 87 F 100 - .89 USA rash ec-12 12 F 102 + ?? CAN normal strep 14 F 101 + .33 USA normal 67 M 102 + .77 BRA rash
Symbolic Rule Induction Example (2) Candidate Rules: IF age = [12,65] gender = *any* temp = [100,103] b-cult = + c-cult = [.00,.23] loc = *any* skin = (normal,flush) THEN: strep IF age = (15,65) gender = *any* temp = [101,102] b-cult = *any* c-cult = [.66,.78] loc = BRA skin = rash THEN: dengue Disclaimer: These are not real medical records or rules
Types of Data Mining • “Supervised” Methods (this DM course) • Training data has both predictor attributes & objective (to be predicted) attributes • Predict discrete classes classification • Predict continuous values regression • Duality: classification regression • “Unsupervised” Methods • Training data without objective attributes • Goal: find novel & interesting patterns • Cutting-edge research, fewer success stories • Semi-supervised methods: market-basket, … © 2008, Jaime G. Carbonell
Machine Learning Application Process in a Nutshell • Choose problem where • Prediction is valuable and non-trivial • Sufficient historical data is available • The objective is measurable (incl in past data) • Prepare the data • Tabular form, clean, divide training & test sets • Select a Machine Learning algorithm • Human readable decision fn rules, trees, … • Robust with noisy data kNN, logistic reg, … © 2008, Jaime G. Carbonell
Machine Learning Application Process in a Nutshell (2) • Train ML Algorithm on Training Data Set • Each ML method has different training process • Training uses both predictor & objective att’s • Run Training ML Algorithm on Test Data Set • Test uses only predictor att’s & outputs predictions on objective attributes • Compare predictions vs actual objective att’s (see lecture 2 for evaluation metrics) • If Accuracy threshold, done. • Else, try different ML algorithm, different parameter settings, get more training data, … © 2008, Jaime G. Carbonell
Sample DB Table (same) [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------- 1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 100 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 10 Y © 2008, Jaime G. Carbonell
Feature Vector Representation • Predictor-attribute rows in DB tables can be represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are: R2 = [60 Y 3 2 Y 5] R4 = [95 Y 1 2 N 9] Converting to numbers (Y = 1, N = 0), we get: R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] © 2008, Jaime G. Carbonell
Vector Similarity • Suppose we have a new credit applicant R-new = [65 1 1 2 0 10] To which of R2 or R4 is she closer? R2 = [60 1 3 2 1 5] R4 = [95 1 1 2 0 9] • What should we use as a SIMILARITY METRIC? • Should we first NORMALIZE the vectors? • If not, the largest component will dominate © 2008, Jaime G. Carbonell
Normalizing Vector Attributes • Linear Normalization (often sufficient) • Find max & min values for each attribute • Normalize each attribute by: • Apply to all vectors (historical + new) • …by normalizing each attribute, e.g.: © 2008, Jaime G. Carbonell
Normalizing Full Vectors • Normalizing the new applicant vector R-new = [65 1 1 2 0 10] [.56 1 .17 .33 0 1] And normalizing the two past customer vectors R2 = [60 1 3 2 1 5] [.50 1 .50 .33 1 .50] R4 = [95 1 1 2 0 9] [.94 1 .17 .33 0 .90] • How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? • Weight accordingly, e.g. x2 for each • E.g., R-new-weighted: [1.12 1 .34 .33 0 1] © 2008, Jaime G. Carbonell
Similarity Functions (inverse dist) • Now that we have weighted normalized vectors, how do we tell exactly their degree of similarity? • Inverse sum of differences (L1) • Inverse Euclidean distance (L2) © 2008, Jaime G. Carbonell
Similarity Functions (direct) • Dot-Product Similarity • Cosine Similarity (dot product of unit vectors) © 2008, Jaime G. Carbonell
Alternative: Similarity Matrix for Non-Numeric Attributes tiny little small medium large huge tiny 1.0 0.8 0.7 0.5 0.2 0.0 little 1.0 0.9 0.7 0.3 0.1 small 1.0 0.7 0.3 0.2 medium 1.0 0.5 0.3 large 1.0 0.8 huge 1.0 • Diagonal must be 1.0 • Monotonicity property must hold • Triangle inequality must hold • Transitive property must hold • Additivity/Compostionality need not hold © 2008, Jaime G. Carbonell
k-Nearest Neighbors Method • No explicit “training” phase • When new case arrives (vector of predictor att’s) • Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table) • k-neighbors vote for objective attribute • Unweighted majority vote, or • Similarity-weighted vote • Works for both discrete or continuous objective attributes © 2008, Jaime G. Carbonell
Similarity-Weighted Voting in kNN • If the Objective Attribute is Discrete: • If the Objective Attribute is Continuous: © 2008, Jaime G. Carbonell
Applying kNN to Real Problems 1 • How does one choose the vector representation? • Easy: Vector = predictor attributes • What if attributes are not numerical? • Convert: (e.g. High=2, Med=1, Low=0), • Or, use similarity function over nominal values • E.g. equality or edit-distance on strings • How does one choose a distance function? • Hard: No magic recipe; try simpler ones first • This implies a need for systematic testing (discussed in coming slides) © 2008, Jaime G. Carbonell
Applying kNN to Real Problems 2 • How does one determine whether data should be normalized? • Normalization is usually a good idea • One can try kNN both ways to make sure • How does one determine “k” in kNN? • k is often determined empirically • Good start is: © 2008, Jaime G. Carbonell
Evaluating Machine Learning • Accuracy = Correct-Predictions/Total-Predictions • Simplest & most popular metric • But misleading on very-rare event prediction • Precision, recall & F1 • Borrowed from Information Retrieval • Applicable to very-rare event prediction • Correlation (between predicted & actual values) for continuous objective attributes • R2, kappa-coefficient, … © 2008, Jaime G. Carbonell
Sample Confusion Matrix True Diagnoses Predicted Diagnoses © 2008, Jaime G. Carbonell
Measuring Accuracy • Accuracy = correct/total • Error = incorrect/total • Hence: accuracy = 1 – error • For the diagnosis example: • A = 340/386 = 0.88, E = 1 – A = 0.12 © 2008, Jaime G. Carbonell
What About Rare Events? True Diagnoses Predicted Diagnoses © 2008, Jaime G. Carbonell
Rare Event Evaluation • Accuracy for example = 0.88 • …but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses • Alternative: Per-diagnosis (per-class) accuracy: • A(“shorted PS”) = 0/22 = 0 • A(“not plugged in”) = 160/184 = 0.87 © 2008, Jaime G. Carbonell
ROC Curves (ROC=Receiver Operating Characteristic) © 2008, Jaime G. Carbonell
ROC Curves (ROC=Receiver Operating Characteristic) Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) © 2008, Jaime G. Carbonell
evaluate measure error train If Plenty of data, evaluate with Holdout Set Data • Often also used for parameter optimization © 2008, Jaime G. Carbonell
Finite Cross-Validation Set • True error: • Test error: (true risk) D = all data (empirical risk) m = #test samples S = test data © 2008, Jaime G. Carbonell
Confidence Intervals If • S contains m examples, drawn independently • m 30 Then • With approximately 95% probability, the true error eD lies in the interval © 2008, Jaime G. Carbonell