290 likes | 533 Views
DSCI 4520/5240 (DATA MINING). DSCI 4520/5240 Data Mining. Some slide material taken from or inspired by: Groth, Han and Kamber, Cerrito, SAS. Introduction to DM. “It is a capital mistake to theorize before one has data.
E N D
DSCI 4520/5240 (DATA MINING) DSCI 4520/5240 Data Mining Some slide material taken from or inspired by: Groth, Han and Kamber, Cerrito, SAS
Introduction to DM “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” (Sir Arthur Conan Doyle: Sherlock Holmes, "A Scandal in Bohemia")
Nobel Laureate Calls Data Mining "A Must" In an interview with ComputerWorld in January 1999, Dr. Penzias (won the 1978 Nobel Prize in physics and was the vice president and chief scientist at Bell Laboratories) considered large scale data mining from very large databases as the key application for corporations in the next few years. In response to ComputerWorld's age-old question of "What will be the killer applications in the corporation?" Dr. Penzias replied: "Data mining." He then added: "Data mining will become much more important and companies will throw away nothing about their customers because it will be so valuable. If you're not doing this, you're out of business" he said.
What Is Data Mining? Data mining (knowledge discovery in databases): • A process of identifying hidden patterns and relationships within data (Groth) Data mining: • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful)information or patterns from data in large databases
Motivation: “Necessity is the Mother of Invention” Data explosion problem • Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining • Data warehousing and on-line analytical processing • Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases
Data Deluge hospital patient registries electronic point-of-sale data remote sensing images tax returns stock trades OLTP telephone calls airline reservations credit card charges catalog orders bank transactions
Data Mining, circa 1963 IBM 7090 600 cases “Machine storage limitations restricted the total number of variables which could be considered at one time to 25.”
Business Decision Support • Database Marketing • Target marketing • Customer relationship management • Credit Risk Management • Credit scoring • Fraud Detection • Healthcare Informatics • Clinical decision support
Required Expertise • Domain • Data • Analytical Methods
Multidisciplinary Statistics Pattern Recognition Neurocomputing Machine Learning AI Data Mining Databases KDD
What Is Data Mining? • IT: Complicated database queries • ML: Inductive learning from examples • Stat: What we were taught not to do
Predictive Modeling Inputs Target ... ... ... ... ... ... Cases ... ... ... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...
Types of Targets • Supervised Classification • Event/no event (binary target) • Class label (multiclass problem) • Regression • Continuous outcome • Survival Analysis • Time-to-event (possibly censored)
Why Data Mining? — Potential Applications Database analysis and decision support • Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation • Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis • Fraud detection and management Other Applications • Text mining (news group, email, documents) and Web analysis. • Intelligent query answering
Market Analysis and Management (1) Where are the data sources for analysis? • Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Cross-market analysis • Associations/co-relations between product sales • Prediction based on the association information
Market Analysis and Management (2) Customer profiling • data mining can tell you what types of customers buy what products (clustering or classification) Identifying customer requirements • identifying the best products for different customers • use prediction to find what factors will attract new customers
Corporate Analysis and Risk Management Finance planning and asset evaluation • cash flow analysis and prediction • contingent claim analysis to evaluate assets • cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning: • summarize and compare the resources and spending Competition: • monitor competitors and market directions • group customers into classes and a class-based pricing procedure • set pricing strategy in a highly competitive market
Fraud Detection and Management (1) Applications • widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. Approach • use historical data to build models of fraudulent behavior and use data mining to help identify similar instances Examples • auto insurance: detect a group of people who stage accidents to collect on insurance • money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors and ring of references
Fraud Detection and Management (2) Detecting inappropriate medical treatment • Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). Detecting telephone fraud • Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. • British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail • Analysts estimate that 38% of retail shrink is due to dishonest employees.
On the News: Can Data Mining save America’s schools?
On the News:State Agency Nabs Abuse of Food Stamps JUNE 21, 2004 (COMPUTERWORLD) - The state of Louisiana issues food stamp purchase cards to 600,000 people a year -- but the recipients don't always use them to buy food. So program administrators have started using BI tools to detect suspicious activity for follow-up investigations. When swiped at the point of sale, the purchase card creates a transactional record that's forwarded to the Louisiana Department of Social Services in Baton Rouge. This information is offloaded to a SQL Server data warehouse, where it can be sliced and diced using Information Builders Inc.'s WebFocus query software. Investigators can scan the data by geography, purchase amount and other variables to detect "signatures of fraud," says Duane Fontenot, the department's IT director. The system has been feeding data to 40 fraud investigators for the past eight months and has information about every participating store. For instance, agents using the digital map can see where certain transactions are taking place by parish, city or even larger areas. If a food stamp recipient frequently travels 60 miles to use the card at one store -- passing 30 other stores on the way -- that could indicate a scheme to sell the cards for cash, Fontenot says. In one instance, investigators uncovered a criminal network that was converting the stamps into currency that was then wired to overseas banks. When culprits are faced with such evidence, Fontenot says, "usually, they just confess."
Other Applications Sports • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy • JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Internet Web Surf-Aid • IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc.
Data Mining: A KDD Process Knowledge Pattern Evaluation • Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases
Steps of a KDD Process Learning the application domain: • relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: • Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining • summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation • visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge
Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA DBA Data Sources Paper, Files, Information Providers, Database Systems, OLTP