1.04k likes | 1.59k Views
Data Mining. Dr. Mohsen Kahani Email: kahani@um.ac.ir http://www.um.ac.ir/~kahani/. Overview. Introduction Data Mining Functions and Models Data Mining Methodologies Data Mining Case Studies Final Remarks. Motivation: “Necessity is the Mother of Invention”. Data explosion problem:
E N D
Data Mining Dr. Mohsen Kahani Email: kahani@um.ac.ir http://www.um.ac.ir/~kahani/
Overview • Introduction • Data Mining Functions and Models • Data Mining Methodologies • Data Mining Case Studies • Final Remarks
Motivation: “Necessity is the Mother of Invention” • Data explosion problem: • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories • We are drowning in data, but starving for knowledge!
Data pyramid Wisdom Knowledge + experience Knowledge Information + rules Information Data + context Data
Related Fields Machine Learning Visualization Data Mining and Knowledge Discovery Statistics Databases
__ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Integration Interpretation & Evaluation Knowledge Data Mining Knowledge RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house
Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Definition of Data Mining “…The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data…” Fayyad,Piatetsky-Shapiro, Smyth [1996]
The Evolution of Data Analysis Evolutionary Step Business Question Enabling Product Providers Characteristics Technologies Data Collection "What was my total Computers, tapes, IBM, CDC Retrospective, (1960s) revenue in the last disks static data delivery five years?" Data A ccess "What were unit Relational Oracle, Sybase, Retrospective, (1980s) sales in New databases Informix, IBM, dynamic data England last (RDBMS), Microsoft delivery at record March?" Structured Query level Language (SQL), ODBC Data Warehousing "What were unit On - line analytic SPSS, Comshare, Retrospective, & Decis ion sales in New processing Arbor, Cognos, dynamic data Support England last (OLAP), Microstrategy,NCR d elivery at multiple (1990s) March? Drill down multidimensional levels to Boston." databases, data warehouses Data Mining "What’s likely to Advanced SPSS/Clementine, Prospective, (Emerging Today) happen to Boston algorithms, Lockheed, IBM, proactive unit sales next multiprocessor SGI, SAS, NCR, information month? Why?" computers, massive Oracle, numerous delivery databases s tartups
Need for Data Mining • Data accumulate and double every 9 months • There is a big gap from stored data to knowledge; and the transition won’t occur automatically. • Manual data analysis is not new but a bottleneck • Fast developing Computer Science and Engineering generates new demands • Seeking knowledge from massive data • Any personal experience?
When is DM useful • Data rich world • Large data (dimensionality and size) • Image data (size) • Gene chip data (dimensionality) • Little knowledge about data (exploratory data analysis) • What if we have some knowledge?
Challenges • Increasing data dimensionality and data size • Various data forms • New data types • Streaming data, multimedia data • Efficient search and access to data/knowledge • Intelligent update and integration
Data Mining Survey Industry Pioneers • 23% Manufacturing • 19% Financial Serv. • 17% Tele/Data communication • 13% Media • 12% Retail/Wholesaler Objectives • 21.4% Understanding Customer Segments and Preferences, • 19,5% Identifying Profitable Customers and Acquiring New ones, • 14,1% Increasing Revenue From Customers. World Data Mining Survey, 6 August, 2002.
Results of Data Mining Include: • Forecasting what may happen in the future • Classifying people or things into groups by recognizing patterns • Clustering people or things into groups based on their attributes • Associating what events are likely to occur together • Sequencing what events are likely to lead to later events
Data Mining versus OLAP • OLAP - On-line Analytical Processing • Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data Analysis Tests for statistical correctness of models Are statistical assumptions of models correct? Eg Is the R-Square good? Hypothesis testing Is the relationship significant? Use a t-test to validate significance Tends to rely on sampling Techniques are not optimised for large amounts of data Requires strong statistical skills Data Mining Originally developed to act as expert systems to solve problems Less interested in the mechanics of the technique If it makes sense then let’s use it Does not require assumptions to be made about data Can find patterns in very large amounts of data Requires understanding of data and business problem Data Mining Versus Statistical Analysis
Data Mining Taxonomy Predictive Method - …predict the value of a particular attribute… Descriptive Method - …foundation of human-interpretable patterns that describe the data…
Data Mining Tasks... • Classification [Predictive] • Clustering [Descriptive] • Association Rule Discovery [Descriptive] • Sequential Pattern Discovery [Descriptive] • Deviation Detection [Predictive]
Data Mining Tasks: Classification Learn a method for predicting the instance class from pre-labeled (classified) instances Many approaches: Statistics, Decision Trees, Neural Networks, ...
Classification: Linear Regression • Linear Regression w0 + w1 x + w2 y >= 0 • Regression computes wi from data to minimize squared error to ‘fit’ the data • Not flexible enough
Classification: Decision Trees if X > 5 then blue else if Y > 3 then blue else if X > 2 then green else blue Y 3 X 2 5
Decision Trees -a way of representing a series of rules that lead to a class or value; -basic components of a decision tree: decision node, branches and leaves; Income>40,000 Job>5 High Debt Low Risk High Risk High Risk Low Risk No Yes Yes No Yes No
Decision Trees (cont.) • handle very well non-numeric data; • work best when the predictor variables are categorical;
Example Decision Tree categorical categorical continuous Splitting Attributes class Refund Yes No NO MarSt Married Single, Divorced TaxInc NO < 80K > 80K YES NO The splitting attribute at a node is determined based on the Gini index.
Classification: Neural Networks efficiently model large and complex problems; may be used in classification problems or for regressions; Starts with input layer=> hidden layer => output layer 3 1 4 6 2 5 Output Inputs Hidden Layer
Neural Networks (cont.) • can be easily implemented to run on massively parallel computers; • can not be easily interpret; • require an extensive amount of training time; • require a lot of data preparation (involve very careful data cleansing, selection, preparation, and pre-processing); • require sufficiently large data set and high signal-to noise ratio.
Kohonen Network Description • unsupervised • seeks to describe dataset in terms of natural clusters of cases
Test Set Model Classification Example categorical categorical continuous class Learn Classifier Training Set
Classification Application • Direct Marketing • Fraud Detection • Customer Attrition/Churn • Sky Survey Cataloging
Data Mining Tasks: Clustering • Goal is to identify categories • Natural grouping of customers by processing all the available data about them. • Other applications • market segmentation, discovering affinity groups, and defect analysis
Data Mining Tasks: Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection; • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Association Rule Discovery Application • Marketing and Sales Promotion • Supermarket Shelf Management • Inventory Management
Deviation Detection & Pattern Discovery Deviation Detection: …discovering most significant changes in data from previously measured or normative values… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining. Sequential Pattern Discovery: …process of looking for patterns and rules that predict strong sequential dependencies among different events… V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sequential Patterns • Identify frequently occurring sequences from given records • 40 percent of female customers buy a gray skirt six months after buying a red jacket
Data Mining Methodology: SAS • Sample • Extract a portion of the dataset for data mining • Explore • Modify • create, select and transform variables with the intention of building a model • Model • Specify a relationship of variables that reliably predicts a desired goal • Assess • Evaluate the practical value of the findings and the model resulting from the data mining effort
Data Mining Methodology: CRISP-DM • Data understanding • Data preparation • Modeling • Evaluation • Deployment
Phases and Tasks Business Understanding Data Understanding Data Preparation Modeling Deployment Evaluation Determine Business Objectives Background Business Objectives Business Success Criteria Situation Assessment Inventory of Resources Requirements, Assumptions, and Constraints Risks and Contingencies Terminology Costs and Benefits Determine Data Mining Goal Data Mining Goals Data Mining Success Criteria Produce Project Plan Project PlanInitial Asessment of Tools and Techniques Collect Initial Data Initial Data Collection Report Describe Data Data Description Report Explore Data Data Exploration Report Verify Data Quality Data Quality Report Data Set Data Set Description Select Data Rationale for Inclusion / Exclusion Clean Data Data Cleaning Report Construct Data Derived Attributes Generated Records Integrate Data Merged Data Format Data Reformatted Data Select Modeling Technique Modeling Technique Modeling Assumptions Generate Test Design Test Design Build Model Parameter Settings Models Model Description Assess Model Model AssessmentRevised Parameter Settings Evaluate Results Assessment of Data Mining Results w.r.t. Business Success Criteria Approved Models Review Process Review of Process Determine Next Steps List of Possible Actions Decision Plan Deployment Deployment Plan Plan Monitoring and Maintenance Monitoring and Maintenance Plan Produce Final Report Final Report Final Presentation Review Project Experience Documentation
Fraud/Non-Compliance Anomaly detection Isolate the factors that lead to fraud, waste and abuse Target auditing and investigative efforts more effectively Credit/Risk Scoring Intrusion detection Parts failure prediction Recruiting/Attracting customers Maximizing profitability (cross selling, identifying profitable customers) Service Delivery and Customer Retention Build profiles of customers likely to use which services Web Mining Health Care Major Application Areas for Data Mining Solutions
Case Study: Search Engines • Early search engines used mainly keywords on a page – were subject to manipulation • Google success is due to its algorithm which uses mainly links to the page • Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google
Case Study:Direct Marketing and CRM • Most major direct marketing companies are using modeling and data mining • Most financial companies are using customer modeling • Modeling is easier than changing customer behaviour • Some successes • Verizon Wireless reduced churn rate from 2% to 1.5%
Biology: Molecular Diagnostics • Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML) • 72 samples, about 7,000 genes ALL AML • Results: 33 correct (97% accuracy), • 1 error (sample suspected mislabelled) • Outcome predictions?
Case Study:Security and Fraud Detection • Credit Card Fraud Detection • Money laundering • FAIS (US Treasury) • Securities Fraud • NASDAQ Sonar system • Phone fraud • AT&T, Bell Atlantic, British Telecom/MCI • Bio-terrorism detection at Salt Lake Olympics 2002
Data Mining and Privacy • Data Mining looks for patterns, not people! • Technical solutions can limit privacy invasion • Replacing sensitive personal data with anon. ID • Give randomized outputs • Multi-party computation – distributed data • …
The Hype Curve for Data Mining and Knowledge Discovery Over-inflated expectations Growing acceptance and mainstreaming rising expectations Disappointment
Final Remarks • Data Mining can be utilized for any field that needs to find patterns or relationships in their data.
Special Data Types • Spatial Data • Streamed Data • Multimedia data
Spatial Mining Spatial Data and Structures Images Spatial Mining Algorithms