650 likes | 749 Views
CS-470: Data Mining. Fall 2009. Organizational Details. Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science and Technology Building, 104C Phone (903 334 6654) e-mail: igor.aizenberg@tamut.edu Office hours: Monday, Wednesday 10am-6pm
E N D
CS-470: Data Mining Fall 2009
Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science and Technology Building, 104C Phone (903 334 6654) e-mail: igor.aizenberg@tamut.edu Office hours: Monday, Wednesday 10am-6pm Tuesday 11pm-3pm Class Web Page: http://www.eagle.tamut.edu/faculty/igor/CS-470.htm
R. J. Roiger, M.W. Geatz, Data Mining. A Tutorial-Based Primer, Addison Wesley, 2003, ISBN 0-201-74128-8 Text Book
Control • Exams (open book, open notes): • Exam 1: October 6, 2009 • Exam 2: November 10, 2009 • Exam 3: December 8, 2009 • Homework
Grading Grading Method Homework and preparation: 10% Exam 1: 30% Exam 2: 30% Exam 3: 30% Grading Scale: 90%+ A 80%+ B 70%+ C 60%+ D less than 60% F
Data Mining: A Definition The process of employing one or more machine learning techniques to automatically analyze and extract knowledge from data. The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules.
What Is Data Mining? • Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. • Machine learning and data mining are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery.
Why Data Mining? — Potential Applications • Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management • Other Applications • Text mining (news group, email, documents) and Web analysis. • Intelligent query answering. • Medical decision support.
Market Analysis and Management (1) • Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time • Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information
Market Analysis and Management (2) • Customer profiling • data mining can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements • identifying the best products for different customers • use prediction to find what factors will attract new customers • Provides summary information • various multidimensional summary reports • statistical summary information (data central tendency and variation)
Corporate Analysis and Risk Management • Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning: • summarize and compare the resources and spending • Competition: • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market
Fraud Detection and Management (1) • Applications • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples • auto insurance: detect a group of people who stage accidents to collect on insurance • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors and ring of references
Fraud Detection and Management (2) • Detecting inappropriate medical treatment • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). • Detecting telephone fraud • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail • Analysts estimate that 38% of retail shrink is due to dishonest employees.
Other Applications • Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Induction-based Learning The process of forming general concept definitions by observing specific examples of concepts to be learned.
Four Levels of Learning Facts Concepts Procedures Principles
Facts A fact is a simple statement of truth.
Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics.
Procedures A procedure is a step-by-step course of action to achieve a goal.
Principles A principles are general truths or laws that are basic to other truths.
Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session.
Three Concept Views Classical View Probabilistic View Exemplar View
Classical View All concepts have definite defining properties.
Probabilistic View People store and recall concepts as generalizations created by observations.
Exemplar View People store and recall likely concept exemplars that are used to classify unknown instances.
Supervised Learning Build a learner model using data instances of known origin. Use the model to determine the outcome new instances of unknown origin.
Supervised Learning: A Decision Tree Example
Decision Tree A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.
Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
Unsupervised Clustering A data mining method that builds models from data without predefined classes.
The “Acme Investors” Datasetof customers maintaining a brokerage account
The “Acme Investors” Dataset & Supervised Learning Can I develop a general profile of an online investor? Can I determine if a new customer is likely to open a margin account? Can I build a model predict the average number of trades per month for a new investor? What characteristics differentiate female and male investors?
The “Acme Investors” Dataset & Supervised Learning Can I develop a general profile of an online investor? – output attribute – transaction method Can I determine if a new customer is likely to open a margin account? - output attribute – margin account Can I build a model predict the average number of trades per month for a new investor? - output attribute – trades/month What characteristics differentiate female and male investors? - output attribute – sex
Alternative:The “Acme Investors” Dataset &Unsupervised Clustering
The “Acme Investors” Dataset & Unsupervised Clustering What attribute similarities group customers of Acme Investors together? What differences in attribute values segment the customer database?
Clustering • Clustering is the task of segmenting a heterogeneous population into a number of more homogeneous subgroups (clusters).
Clustering:Two Approaches • A clustering algorithm requires us to provide an initial best estimate about the total number of clusters in the data (supervised). • A clustering algorithm uses some method in an attempt to determine a best number of clusters (unsupervised)
Classification • Classification deals with discrete outcomes: yes or no; big or small; strange or no strange; yellow, green or red; etc. • Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc. • Neural networks and regression models are the best tools for classification/estimation
Prediction • Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. • Any of the techniques used for classification and estimation for use in prediction.