1 / 49

Chapter 1 Initial Description of Data Mining in Business

Chapter 1 Initial Description of Data Mining in Business. Prepared by: Dr. Tsung-Nan Tsai. Contents. Introduces data mining concepts Presents typical business data applications Explains the meaning of key concepts Gives a brief overview of data mining tools

dcarnes
Download Presentation

Chapter 1 Initial Description of Data Mining in Business

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 1Initial Description of Data Mining in Business Prepared by: Dr. Tsung-Nan Tsai

  2. Contents • Introduces data mining concepts • Presents typical business data applications • Explains the meaning of key concepts • Gives a brief overview of data mining tools • Outlines the remaining chapters of the book

  3. Definition • DATA MINING: exploration & analysis • Refers to the analysis of the large quantities of data that are stored in computers. • by automatic means • of large quantities of data • to discover actionable patterns & rules • Data mining is a way to use massive quantities of data that businesses generate • GOAL - improve marketing, sales, customer support through better understanding of customers

  4. Retail Outlets • Bar coding & scanning generate masses of data • customer service (Grocery stores can quickly process he purchases and accurately determine product prices) • inventory control (Determine the quantity of items of each product on hand, supply chain management) • MICROMARKETING • CUSTOMER PROFITABILITY ANALYSIS • MARKET-BASKET ANALYSIS

  5. Political Data Mining Grossman et al., 10/18/2004, Time, 38 • 2004 Election • Republicans: VoterVault • From Mid-1990s • About 165 million voters • Massive get-out-the-vote drive for those expected to vote Republican • Democrats: Demzilla • Also about 165 million voters • Names typically have 200 to 400 information items

  6. Medical Diagnosis J. Morris, Health Management Technology Nov 2004, 20, 22-24 • Electronic Medical Records • Associated Cardiovascular Consultants • 31 physicians • 40,000 patients per year, southern New Jersey • Data mined to identify efficient medical practice • Enhance patient outcomes • Reduced medical liability insurance

  7. Mayo Clinic Swartz, Information Management Journal Nov/Dec 2004, 8 • IBM developed EMR program • Complete records on almost 4.4 million patients. • Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments.

  8. Business Uses of Data Mining • Toyata used the data mining of its data warehouse to determine more efficient transportation routes, reducing time-to-market by average of 19 days. • Bank firms used the data mining in soliciting credit card customers, • Insurance and Telecommunication companies used DM to detect fraud. • Manufacturing firms used DM in quality control, • Many …..

  9. Business Uses of Data Mining • Customer profiling • Identify profitability from subset customers • Targeting • Determine characteristics of most profitable customers 3. Market-Basket Analysis • Determine correlation of purchases by profile (customers) • Cross-selling • Part of Customer Relationship Management

  10. What is needed to do DM? • DM requires the identification of a problem, along with data collection that can lead to a better understanding of the market. • Computer models provide statistical or other means of analysis. • Two general types of DM studies: • Hypothesis testing: involving expressing a theory about the relationship between actions and outcomes. • Knowledge discovery: a preconceived notion may not be present, but rather than relationships can be identified by looking at the data (correlation analysis).

  11. Reasons why Data Mining is now effective • Data are there • Data are warehoused (computerized) • Walmart: 35 thousand queries per week • Computing economically available • Competitive pressure • Commercial products available

  12. Trends • Every business is service • hotel chains record your preferences • car rental companies the same • service versus price • credit card companies • long distance providers • airlines • computer retailers

  13. Trends • Information as Product • Custom Clothing Technology Corporation • fit jeans, other clothing • INFORMATION BROKERING • IMS - collects prescription data from pharmacies, sells to drug firms • AC Nielsen - TV

  14. Trends • Commercial Software Available • using statistical, artificial intelligence tools that have been developed • Enterprise Miner SAS • Intelligent Miner IBM • Clementine SPSS • PolyAnalyst Megaputer • Specialty products

  15. Fingerhut’s DM models • Fingerhut used segmentation, decision tree, regression analysis, and neural modeling tools from SAS for regression analysis tools and SPSS for neural network tools. • The segmentation model combines order and basic demographic data with Fingerhut’s product offerings. • Neural network models used to identify in mailing patterns and order filling telephone call orders. • Goal: • Create new mailings targeted at customers with the greatest potential payoff. • Create a catalog containing products that those who is interested in, such as furniture, telephones…

  16. How Data Mining Is Being Used • U.S. Government • track down Oklahoma City bombers, Unabomber, many others • Treasury department - international funds transfers, money laundering • Internal Revenue Service

  17. How Data Mining Is Used • Firefly • asks members to rate music and movies • subscribers clustered • clusters get custom-designed recommendations

  18. Warranty Claims Routing • Diesel engine manufacturer • stream of warranty claims • examine each by expert • determine whether charges are reasonable & appropriate • think of expert system to automate claims processing

  19. Data mining application area

  20. Retailing • Affinity positioning is based up the identification of products that the same customer is likely to want. • Cold medicine  tissues • Cross-selling: The knowledge of products that go together can be used by marketing the complementary product. • Grocery stores do that through position product shelf location. • Grocery stores generate mountains of cash register data. Current technology enables grocers to look at customers who have defected from a store, their purchase history, and characteristics of other potential defectors.

  21. Cross-selling • USAA • insurance • doubled number of products held by average customer due to data mining • detailed records on customers • predict products they might need • Fidelity Investments • regression - what makes customer loyal

  22. Banking • CRM involves the application of technology to monitor customer service, a function that is enhanced through data mining support. • DM applications in finance include predicting the prices of equities involving a dynamic environment with surprise information, some of which might be inaccurate … • Only 3% of the customers at Norwest bank provided 44% of their profits. • CRM products enable banks to define and identify customer and household relationships.

  23. Retaining Good Customers • Customer loss: • Banks - Attrition • Cellular Phone Companies - Churn • study who might leave, why • Southern California Gas • customer usage, credit information • direct mail contact - most likely best billing plan • who is price sensitive • Who should get incentives, whom to keep

  24. Credit card management • Bank credit card marketing promotions typically generate 1,000 responses to mailed solicitations – a response rate of about 1%. The rate is improved significantly through data mining analysis. • DM tools used by banks include credit scoring which is a quantified analysis of credit applicants with respect to predictions of on-time loan repayment. (Data covering deposits, savings, loans, credit card, insurance…). • These credit scores can be used to accept/reject recommendations, as well as to establish the size of a credit line. • ATM machines could be rigged up with electronic sales pitches for products that a particular customer is likely to be interested in.

  25. Fairbank & Morris • Credit card company’s most valuable asset: • INFORMATION ABOUT CUSTOMERS • Signet Banking Corporation • obtained behavioral data from many sources • built predictive models • aggressively marketed balance transfer card • First Union • who will move soon - improve retention

  26. Telecommunications • Retention of customers for telemarketing is very difficult. The phenomenon of a customer switching carriers is referred to as churn, a fundamental concept in telemarketing as well as in other fields. • A communications company considered the 1/3 of churn is due to poor call quality, and up to ½ is due to poor equipment. • A cellular fraud prevention monitors traffic to spot problems with faulty telephones. When a telephone begins to go bad, telemarketing personal are alerted to contact the customer and suggest bringing the equipment in for service. • Another way to reduce churn is to protect customers from subscription and cloning (duplication) fraud. Fraud prevention systems provide verification that is transparent to legitimate subscribers.

  27. Human resource management • Business intelligence is a way to truly understand markets, competitors, and processes. • Software technology such as data warehouses, data marts, online analytical processing (OLAP), and data mining can be used to improve firm’s profitability. • In HRM, the analysis can lead to the identification of individuals who are liable to leave the company unless additional compensation or benefits are provided. • HRM would identify the right people so that organizations could treat them well and retain them (reduce churn).

  28. Methodology and Tools Analyzing data Given management goals and that management can translate knowledge into action

  29. Basic Styles • Top-Down: HYPOTHESIS TESTING • SUPERVISED • have a theory, experiment to prove or disprove • SCIENCE • Bottom-Up: KNOWLEDGE DISCOVERY • UNSUPERVISED • start with data, see new patterns • CREATIVITY

  30. Hypothesis Testing • Generate theory • Determine data needed • Get data • Prepare data • Build computer model • Evaluate model results • confirm or reject hypotheses

  31. Generate Theory • Systematically tie different input sources together (MENTAL MODEL) • What causes sales volume? • sales rep performance • economy, seasonality • product quality, price, promotion, location

  32. Generate Theory • Brainstorm: • diverse representatives for broad coverage of perspectives (electronic) • keep under control (keep positive) • generate testable hypotheses

  33. Define Data Needed • Determine data needed to test hypothesis • Lucky - query existing database • More often - gather • pull together from diverse databases, survey, buy

  34. Locate Data • Usually scattered or unavailable • Sources: warranty claims • point-of-sale data (cash register records) • medical insurance claims • telephone call detail records • direct mail response records • demographic data, economic data • PROFILE: counts, summary statistics, cross-tabs, cleanup

  35. Prepare Data for Analysis • Summarize: too much - no discriminant information too little - swamped with useless detail • Process for computer: ASCII, Spreedsheet • Data encoding: how data are recorded can vary - may have been collected with specific purpose • Textual data: avoid if possible (may need to code) • Missing values: missing salary - use mean?

  36. Build and Evaluate Model • Build Computer Model • Choice the appropriate modeling tools and algorithms • Training and test data sets. • Determine if hypotheses supported • statistical practice • test rule-based systems for accuracy • Requires both business and analytic knowledge

  37. SUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud • Use statistics to identify indicators of fraud or abuse • Can rapidly sort through large databases • Identify patterns different from norm • Moderately successful • But only effective on schemes already detected • To benefit firm, need to identify fraud before paying claim

  38. Knowledge Discovery • Machine learning? • Usually need intelligent analyst • Directed: explain value of some variable • Undirected: no dependent variable selected • identify patterns • Use undirected to recognize relationships; use directed to explain once found

  39. Directed • Goal-oriented • Examples: If discount applies, impact on products - who is likely to purchase credit insurance? Predicted profitability of new customer - what to bundle with a particular package • Identify sources of preclassified data • Prepare data for analysis • Built & train computer model • Evaluate

  40. Identify Data Sources • Best - existing corporate data warehouse • data clean, verified, consistent, aggregated • Usually need to generate • most data in form most efficient for designed purpose • historical sales data often purged for dormant customers (but you need that information)

  41. Prepare Data • Put in needed format for computer • Make consistent in meaning • Need to recognize what data are missing change in balance = new – old add missing but known-to-be-important data • Divide data into training, test, evaluation • Decide how to treat outliers • statistically biasing, but may be most important

  42. Build & Train Model • Regression - human builds (selects IVs) • Automatic systems train • give it data, let it hammer • OVERFITTING: • fit the data • TEST SET a means to evaluate model against data not used in training • tune weights before using to evaluate

  43. Evaluate Model • ERROR RATE: proportion of classifications in evaluation set that were wrong • too little training: poor fit on training data and poor error rate • optimal training: good fit on both • too much training: great fit on training data and poor error rate

  44. Undirected Discovery • What items sell together? Strawberries & cream • Directed: What items sell with tofu? tabasco • Long distance caller market segmentation • Uniform usage - weekday & weekend, spikes on holidays • After segmentation: • high & uniform except for several months of nothing

  45. UNSUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud • Look at historical claim submissions • Build ad hoc model to compare with current claims • Assign similarity score to fraudulent claims • Predict fraud potential

  46. Undirected Process • Identify data sources • Prepare data • Build & train computer model • Evaluate model • Apply model to new data • Identify potential targets for undirected • Generate new hypotheses to test

  47. Generate hypotheses • Any commonalities in data? • Are they useful? • Many adults watch children’s movies • chaperones are an important market segment • they probably make final decision • When hypothesis is generated, that determines data needed

  48. Bank Case Study • Directed knowledge discovery to recognize likely prospects for home equity loan • training set - current loan holders • developed model for propensity to borrow • got continuous scores, ranked customers • sent top 11% material • Undirected: segmented market into clusters • in one, 39% had both business & personal accounts • cluster had 27% of the top 11% • Hypothesis: people use home equity to start business

  49. Data mining products and data sets • A good source to view current DM products is www.KDNuggests.com. • The UCI Machine Learning Repository is a source of very good data mining datasets at www.ics.uci.edu/~mlearn/MLOther.html. • Weka DM software at http://www.cs.waikato.ac.nz/ml/weka/ • Tanagra DM software at http://eric.univ-lyon2.fr/~ricco/tanagra/index.html

More Related