270 likes | 373 Views
Section 1.1 . Background. Objectives. Discuss some of the history of data mining. Define data mining and its uses. Defining Characteristics. 1. The Data Massive, operational, and opportunistic 2. The Users and Sponsors Business decision support 3. The Methodology
E N D
Section 1.1 Background
Objectives • Discuss some of the history of data mining. • Define data mining and its uses.
Defining Characteristics • 1. The Data • Massive, operational, and opportunistic • 2. The Users and Sponsors • Business decision support • 3. The Methodology • Computer-intensive “ad hockery” • Multidisciplinary lineage
Data Mining, circa 1963 IBM 7090 600 cases “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.”
Since 1963 • Moore’s Law: • The information density on silicon-integrated circuits doubles every 18 to 24 months. • Parkinson’s Law: • Work expands to fill the time available for its completion.
Data Deluge hospital patient registries electronic point-of-sale data remote sensing images tax returns stock trades OLTP telephone calls airline reservations credit card charges catalog orders bank transactions
The Data ExperimentalOpportunistic Purpose Research Operational Value Scientific Commercial Generation Actively Passively controlled observed Size Small Massive Hygiene Clean Dirty State Static Dynamic
Business Decision Support • Database Marketing • Target marketing • Customer relationship management • Credit Risk Management • Credit scoring • Fraud Detection • Healthcare Informatics • Clinical decision support
Multidisciplinary Statistics Pattern Recognition Neurocomputing Machine Learning AI Data Mining Databases KDD
Tower of Babel • “Bias” STATISTICS: the expected difference between an estimator and what is being estimated NEUROCOMPUTING: the constant term in a linear combination MACHINE LEARNING: a reason for favoring any model that does not fit the data perfectly
Steps in Data Mining/Analysis • 1. Specific Objectives • In terms of the subject matter • 2. Translation into Analytical Methods • 3. Data Examination • Data capacity • Preliminary results • 4. Refinement and Reformulation
Required Expertise • Domain • Data • Analytical Methods
Nuggets “If you’ve got terabytes of data, and you’re relying on data mining to find interesting things in there for you, you’ve lost before you’ve even begun.” — Herb Edelstein
What Is Data Mining? • IT • Complicated database queries • ML • Inductive learning from examples • Stat • What we were taught not to do
Problem Translation • Predictive Modeling • Supervised classification • Cluster Analysis • Association Rules • Something Else
Predictive Modeling Inputs Target ... ... ... ... ... ... Cases ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ...
Types of Targets • Supervised Classification • Event/no event (binary target) • Class label (multiclass problem) • Regression • Continuous outcome • Survival Analysis • Time-to-event (possibly censored)
Section 1.2 SEMMA
Objectives • Define SEMMA. • Introduce the tools available in Enterprise Miner.
SEMMA • Sample • Explore • Modify • Model • Assess
Input Data Source Sampling Data Partition Sample
Explore Distribution Explorer Multiplot Insight Association Variable Selection Link Analysis
Data Set Attributes Transform Variables Filter Outliers Replacement Clustering SOM/Kohonen Time Series Modify
Regression Tree Neural Network Princomp/ Dmneural User Defined Model Ensemble Memory Based Reasoning Two-Stage Model Model
Assessment Reporter Assess
Score C*Score Other Types of Nodes – Scoring Nodes
Group Processing Data Mining Database SAS Code Control Point Subdiagram Other Types of Nodes – Utility Nodes