180 likes | 199 Views
Overview of Methods. Data mining techniques What techniques do, examples, Advantages & disadvantages. History. Statistics AI: genetic algorithms, neural networks analogies with biology memory-based reasoning link analysis from graph theory. Techniques. Statistical
E N D
Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages
History • Statistics • AI: • genetic algorithms, neural networks • analogies with biology • memory-based reasoning • link analysis from graph theory
Techniques • Statistical • Market-Basket Analysis - find groups of items • Memory-Based Reasoning- case based • Cluster Detection - undirected (quantitative MBA) • Artificial Intelligence • Link Analysis - MCI’s Friends & Family • Decision Trees, Rule Induction - production rule • Neural Networks - automatic pattern detection • Genetic Algorithms- keep best parameters
Models • Regression: Y = a + bX • Classification: assign new record to class • Predictive: assign value to new record • Clustering: groups for data • Time-series: assign future value • Links: patterns in data
Fitting • Underfitting: not enough detail • leave out important variables • Overfitting: too much detail • memorizes training set, but doesn’t help with new data • data set too small • redundancy in data
Data Mining Functions • Classification • Identify categories in data • Prediction • Formula to predict future observations • Association • Rules using relationships among entities • Detection • Anomalies & irregularities (fraud detection)
Data Sets • Loan Applications • classification • Job Applications • classification • Insurance Fraud • detection • Expenditure Data • prediction
Loan Data • 650 observations • OUTCOMES (binary): • On-time cost of error: $300 • Late (default) cost of error: $2,000 • Variables • Age, Income, Assets, Debts, Want, Credit • Credit ordinal • Transform: Assets, Debts, & Want →Risk
Job Application Data • 500 observations • OUTCOMES (ordinal): • Unacceptable • Minimal • Acceptable • Excellent • Variables • Age, State, Degree, Major, Experience • State nominal; degree & major ordinal • State is superfluous
Insurance Claim Data • 5000 observations • OUTCOMES (binary): • OK cost of error $500 • Fraudulent cost of error $2,500 • Variables • Age, Gender, Claim, Tickets, Prior claims, Attorney • Gender & attorney nominal, tickets & prior claims categorical
Expenditure Data • 10,000 observations • OUTCOMES: • Could predict response in a number of categories • Others • Variables: • Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards • Churn, proportion of income spent on seven categories