310 likes | 482 Views
Data Mining Introduction. TYNE SYSTEM Chun-hung, Chou 2003.12.09. Outline. 1. Data Mining Overview 2. Functionalities 3. Software 4. R function 5. Example 6. Q & A. Data Mining Overview. Knowledge Discovery Process. 1. Data cleaning - remove noise and inconsistent data
E N D
Data Mining Introduction TYNE SYSTEM Chun-hung, Chou 2003.12.09
Outline 1. Data Mining Overview 2. Functionalities 3. Software 4. R function 5. Example 6. Q & A
Knowledge Discovery Process 1. Data cleaning - remove noise and inconsistent data 2. Data integration - combine multiple data sources 3. Data selection - data relevant to the analysis task 4. Data transformation - the forms for mining 5. Data mining 6. Pattern evaluation - identify 7. Knowledge presentation
What is Data Mining? • Viewed as part of the Knowledge Discovery process. • Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data. • Uses tools from Computer Science and Artificial Intelligence as well as Statistics.
Why do we need data mining? • Large number of records (cases) (108-1012 bytes) • High dimensional data (variables) (10-104 attributes) • Only a small portion, typically 5% to 10%, of the collected data is ever analyzed. • Data that may never be explored continues to be collected out of fear that something that may prove important in the future may be missing. • Magnitude of data precludes most traditional analysis ANOVA/PC/
Potential Applications • Fraud Detection • Manufacturing Processes • Targeting Markets • Scientific Data Analysis • Risk Management • Web Intelligence • Bioinformation • …...
Data Mining Myths • Data mining tools need no guidance. • Data mining models explain behavior. • Data mining requires no data analysis skill. • Data mining tools are “different” from statistics • Data mining eliminates the need to understand your business and • your data • .
Data Mining Functionalities • Concept/Class Description • Association Analysis • Classification Analysis • Cluster Analysis • Outlier Analysis • Evolution Analysis
Concept Description Generate descriptions for characterization and comparison of data characterization : summarizes and describes a collection of data e.g. mean,distribution,percentile,.. comparison : summarizes and distinguishes one collection of data from other collection(s) of data
Concept Description Method: visualization: e.g. boxplot,bar chart, histogram,… statistics/tabulate: e.g. mean, std, proportion,contingency table…
Association Analysis • Goal: • find interesting relationships among items in • a given data set
Association Analysis Example: • Market Basket Analysis - An example of Rule-based Machine Learning • Customer Analysis • Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases • Product Analysis • Market Basket Analysis gives us insight into the merchandise by telling us which products tend to be purchased together and which are most amenable to purchase
Classification Analysis Goal: Build a model to describe a predetermined set of data classes or concepts and use the model as prediction
Classification Analysis Method: Decision Tree Bayesian network Bayesian belife network Neural network k-nearest neighbor case-based reasoning genetic algorithm rough sets fuzzy logic
Cluster Analysis Goal: grouping a set of physical or abstract objects into classes of similar objects
Cluster • Method: Partitioning methods :k-means Hierarchical methods :top-down,bottom-up Density-based methods :arbitrary shapes Grid-based methods :cells Model-based methods :best fit of given model
Outlier Analysis Outlier: the data can be considered as inconsistent in a given data set Goal: find an efficient method to mine the outliers
Outlier Analysis Method: - Statistical-Based Outlier Detection - Distance-Based Outlier Detection - Deviation-Based Outlier Detection
Evolution Analysis • Goal: Describe and models regularities or trends for objects whose behavior changes over time
Evolution Analysis • Method: Statistical Method Trend Analysis Similarity Search in Time-Series Analysis Sequential Pattern Mining Periodicity Analysis
Commercial Software • Full Suite
Example—Decision Tree • Decision Tree for Tools abnormal detection AWD030,AWD050 AWD080