180 likes | 203 Views
Discover a comprehensive overview of data mining methods, including statistical techniques, artificial intelligence applications, and models like regression and clustering. Learn about the advantages and drawbacks of various data mining approaches, along with the history, classification, and comparison of features. Explore real-world applications in finance, telecommunications, marketing, and web domains. The text provides detailed insights into using data mining for fraud detection, prediction, and classification in different industries, along with examples of data sets and outcomes analysis.
E N D
Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages
History • Statistics • AI: • genetic algorithms, neural networks • analogies with biology • memory-based reasoning • link analysis from graph theory
Techniques • Statistical • Market-Basket Analysis - find groups of items • Memory-Based Reasoning- case based • Cluster Detection - undirected (quantitative MBA) • Artificial Intelligence • Link Analysis - MCI’s Friends & Family • Decision Trees, Rule Induction - production rule • Neural Networks - automatic pattern detection • Genetic Algorithms- keep best parameters
Models • Regression: Y = a + bX • Classification: assign new record to class • Predictive: assign value to new record • Clustering: groups for data • Time-series: assign future value • Links: patterns in data
Fitting • Underfitting: not enough detail • leave out important variables • Overfitting: too much detail • memorizes training set, but doesn’t help with new data • data set too small • redundancy in data
Data Mining Functions • Classification • Identify categories in data • Prediction • Formula to predict future observations • Association • Rules using relationships among entities • Detection • Anomalies & irregularities (fraud detection)
Data Sets • Loan Applications • classification • Job Applications • classification • Insurance Fraud • detection • Expenditure Data • prediction
Loan Data • 650 observations • OUTCOMES (binary): • On-time cost of error: $300 • Late (default) cost of error: $2,000 • Variables • Age, Income, Assets, Debts, Want, Credit • Credit ordinal • Transform: Assets, Debts, & Want →Risk
Job Application Data • 500 observations • OUTCOMES (ordinal): • Unacceptable • Minimal • Acceptable • Excellent • Variables • Age, State, Degree, Major, Experience • State nominal; degree & major ordinal • State is superfluous
Insurance Claim Data • 5000 observations • OUTCOMES (binary): • OK cost of error $500 • Fraudulent cost of error $2,500 • Variables • Age, Gender, Claim, Tickets, Prior claims, Attorney • Gender & attorney nominal, tickets & prior claims categorical
Expenditure Data • 10,000 observations • OUTCOMES: • Could predict response in a number of categories • Others • Variables: • Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards • Churn, proportion of income spent on seven categories