490 likes | 503 Views
Chapter 1 Initial Description of Data Mining in Business. Prepared by: Dr. Tsung-Nan Tsai. Contents. Introduces data mining concepts Presents typical business data applications Explains the meaning of key concepts Gives a brief overview of data mining tools
E N D
Chapter 1Initial Description of Data Mining in Business Prepared by: Dr. Tsung-Nan Tsai
Contents • Introduces data mining concepts • Presents typical business data applications • Explains the meaning of key concepts • Gives a brief overview of data mining tools • Outlines the remaining chapters of the book
Definition • DATA MINING: exploration & analysis • Refers to the analysis of the large quantities of data that are stored in computers. • by automatic means • of large quantities of data • to discover actionable patterns & rules • Data mining is a way to use massive quantities of data that businesses generate • GOAL - improve marketing, sales, customer support through better understanding of customers
Retail Outlets • Bar coding & scanning generate masses of data • customer service (Grocery stores can quickly process he purchases and accurately determine product prices) • inventory control (Determine the quantity of items of each product on hand, supply chain management) • MICROMARKETING • CUSTOMER PROFITABILITY ANALYSIS • MARKET-BASKET ANALYSIS
Political Data Mining Grossman et al., 10/18/2004, Time, 38 • 2004 Election • Republicans: VoterVault • From Mid-1990s • About 165 million voters • Massive get-out-the-vote drive for those expected to vote Republican • Democrats: Demzilla • Also about 165 million voters • Names typically have 200 to 400 information items
Medical Diagnosis J. Morris, Health Management Technology Nov 2004, 20, 22-24 • Electronic Medical Records • Associated Cardiovascular Consultants • 31 physicians • 40,000 patients per year, southern New Jersey • Data mined to identify efficient medical practice • Enhance patient outcomes • Reduced medical liability insurance
Mayo Clinic Swartz, Information Management Journal Nov/Dec 2004, 8 • IBM developed EMR program • Complete records on almost 4.4 million patients. • Doctors can ask for how last 100 Mayo patients with same gender, age, medical history responded to particular treatments.
Business Uses of Data Mining • Toyata used the data mining of its data warehouse to determine more efficient transportation routes, reducing time-to-market by average of 19 days. • Bank firms used the data mining in soliciting credit card customers, • Insurance and Telecommunication companies used DM to detect fraud. • Manufacturing firms used DM in quality control, • Many …..
Business Uses of Data Mining • Customer profiling • Identify profitability from subset customers • Targeting • Determine characteristics of most profitable customers 3. Market-Basket Analysis • Determine correlation of purchases by profile (customers) • Cross-selling • Part of Customer Relationship Management
What is needed to do DM? • DM requires the identification of a problem, along with data collection that can lead to a better understanding of the market. • Computer models provide statistical or other means of analysis. • Two general types of DM studies: • Hypothesis testing: involving expressing a theory about the relationship between actions and outcomes. • Knowledge discovery: a preconceived notion may not be present, but rather than relationships can be identified by looking at the data (correlation analysis).
Reasons why Data Mining is now effective • Data are there • Data are warehoused (computerized) • Walmart: 35 thousand queries per week • Computing economically available • Competitive pressure • Commercial products available
Trends • Every business is service • hotel chains record your preferences • car rental companies the same • service versus price • credit card companies • long distance providers • airlines • computer retailers
Trends • Information as Product • Custom Clothing Technology Corporation • fit jeans, other clothing • INFORMATION BROKERING • IMS - collects prescription data from pharmacies, sells to drug firms • AC Nielsen - TV
Trends • Commercial Software Available • using statistical, artificial intelligence tools that have been developed • Enterprise Miner SAS • Intelligent Miner IBM • Clementine SPSS • PolyAnalyst Megaputer • Specialty products
Fingerhut’s DM models • Fingerhut used segmentation, decision tree, regression analysis, and neural modeling tools from SAS for regression analysis tools and SPSS for neural network tools. • The segmentation model combines order and basic demographic data with Fingerhut’s product offerings. • Neural network models used to identify in mailing patterns and order filling telephone call orders. • Goal: • Create new mailings targeted at customers with the greatest potential payoff. • Create a catalog containing products that those who is interested in, such as furniture, telephones…
How Data Mining Is Being Used • U.S. Government • track down Oklahoma City bombers, Unabomber, many others • Treasury department - international funds transfers, money laundering • Internal Revenue Service
How Data Mining Is Used • Firefly • asks members to rate music and movies • subscribers clustered • clusters get custom-designed recommendations
Warranty Claims Routing • Diesel engine manufacturer • stream of warranty claims • examine each by expert • determine whether charges are reasonable & appropriate • think of expert system to automate claims processing
Retailing • Affinity positioning is based up the identification of products that the same customer is likely to want. • Cold medicine tissues • Cross-selling: The knowledge of products that go together can be used by marketing the complementary product. • Grocery stores do that through position product shelf location. • Grocery stores generate mountains of cash register data. Current technology enables grocers to look at customers who have defected from a store, their purchase history, and characteristics of other potential defectors.
Cross-selling • USAA • insurance • doubled number of products held by average customer due to data mining • detailed records on customers • predict products they might need • Fidelity Investments • regression - what makes customer loyal
Banking • CRM involves the application of technology to monitor customer service, a function that is enhanced through data mining support. • DM applications in finance include predicting the prices of equities involving a dynamic environment with surprise information, some of which might be inaccurate … • Only 3% of the customers at Norwest bank provided 44% of their profits. • CRM products enable banks to define and identify customer and household relationships.
Retaining Good Customers • Customer loss: • Banks - Attrition • Cellular Phone Companies - Churn • study who might leave, why • Southern California Gas • customer usage, credit information • direct mail contact - most likely best billing plan • who is price sensitive • Who should get incentives, whom to keep
Credit card management • Bank credit card marketing promotions typically generate 1,000 responses to mailed solicitations – a response rate of about 1%. The rate is improved significantly through data mining analysis. • DM tools used by banks include credit scoring which is a quantified analysis of credit applicants with respect to predictions of on-time loan repayment. (Data covering deposits, savings, loans, credit card, insurance…). • These credit scores can be used to accept/reject recommendations, as well as to establish the size of a credit line. • ATM machines could be rigged up with electronic sales pitches for products that a particular customer is likely to be interested in.
Fairbank & Morris • Credit card company’s most valuable asset: • INFORMATION ABOUT CUSTOMERS • Signet Banking Corporation • obtained behavioral data from many sources • built predictive models • aggressively marketed balance transfer card • First Union • who will move soon - improve retention
Telecommunications • Retention of customers for telemarketing is very difficult. The phenomenon of a customer switching carriers is referred to as churn, a fundamental concept in telemarketing as well as in other fields. • A communications company considered the 1/3 of churn is due to poor call quality, and up to ½ is due to poor equipment. • A cellular fraud prevention monitors traffic to spot problems with faulty telephones. When a telephone begins to go bad, telemarketing personal are alerted to contact the customer and suggest bringing the equipment in for service. • Another way to reduce churn is to protect customers from subscription and cloning (duplication) fraud. Fraud prevention systems provide verification that is transparent to legitimate subscribers.
Human resource management • Business intelligence is a way to truly understand markets, competitors, and processes. • Software technology such as data warehouses, data marts, online analytical processing (OLAP), and data mining can be used to improve firm’s profitability. • In HRM, the analysis can lead to the identification of individuals who are liable to leave the company unless additional compensation or benefits are provided. • HRM would identify the right people so that organizations could treat them well and retain them (reduce churn).
Methodology and Tools Analyzing data Given management goals and that management can translate knowledge into action
Basic Styles • Top-Down: HYPOTHESIS TESTING • SUPERVISED • have a theory, experiment to prove or disprove • SCIENCE • Bottom-Up: KNOWLEDGE DISCOVERY • UNSUPERVISED • start with data, see new patterns • CREATIVITY
Hypothesis Testing • Generate theory • Determine data needed • Get data • Prepare data • Build computer model • Evaluate model results • confirm or reject hypotheses
Generate Theory • Systematically tie different input sources together (MENTAL MODEL) • What causes sales volume? • sales rep performance • economy, seasonality • product quality, price, promotion, location
Generate Theory • Brainstorm: • diverse representatives for broad coverage of perspectives (electronic) • keep under control (keep positive) • generate testable hypotheses
Define Data Needed • Determine data needed to test hypothesis • Lucky - query existing database • More often - gather • pull together from diverse databases, survey, buy
Locate Data • Usually scattered or unavailable • Sources: warranty claims • point-of-sale data (cash register records) • medical insurance claims • telephone call detail records • direct mail response records • demographic data, economic data • PROFILE: counts, summary statistics, cross-tabs, cleanup
Prepare Data for Analysis • Summarize: too much - no discriminant information too little - swamped with useless detail • Process for computer: ASCII, Spreedsheet • Data encoding: how data are recorded can vary - may have been collected with specific purpose • Textual data: avoid if possible (may need to code) • Missing values: missing salary - use mean?
Build and Evaluate Model • Build Computer Model • Choice the appropriate modeling tools and algorithms • Training and test data sets. • Determine if hypotheses supported • statistical practice • test rule-based systems for accuracy • Requires both business and analytic knowledge
SUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud • Use statistics to identify indicators of fraud or abuse • Can rapidly sort through large databases • Identify patterns different from norm • Moderately successful • But only effective on schemes already detected • To benefit firm, need to identify fraud before paying claim
Knowledge Discovery • Machine learning? • Usually need intelligent analyst • Directed: explain value of some variable • Undirected: no dependent variable selected • identify patterns • Use undirected to recognize relationships; use directed to explain once found
Directed • Goal-oriented • Examples: If discount applies, impact on products - who is likely to purchase credit insurance? Predicted profitability of new customer - what to bundle with a particular package • Identify sources of preclassified data • Prepare data for analysis • Built & train computer model • Evaluate
Identify Data Sources • Best - existing corporate data warehouse • data clean, verified, consistent, aggregated • Usually need to generate • most data in form most efficient for designed purpose • historical sales data often purged for dormant customers (but you need that information)
Prepare Data • Put in needed format for computer • Make consistent in meaning • Need to recognize what data are missing change in balance = new – old add missing but known-to-be-important data • Divide data into training, test, evaluation • Decide how to treat outliers • statistically biasing, but may be most important
Build & Train Model • Regression - human builds (selects IVs) • Automatic systems train • give it data, let it hammer • OVERFITTING: • fit the data • TEST SET a means to evaluate model against data not used in training • tune weights before using to evaluate
Evaluate Model • ERROR RATE: proportion of classifications in evaluation set that were wrong • too little training: poor fit on training data and poor error rate • optimal training: good fit on both • too much training: great fit on training data and poor error rate
Undirected Discovery • What items sell together? Strawberries & cream • Directed: What items sell with tofu? tabasco • Long distance caller market segmentation • Uniform usage - weekday & weekend, spikes on holidays • After segmentation: • high & uniform except for several months of nothing
UNSUPERVISED Dorn, National Underwriter Oct 18, 2004, 34,39 • Health care fraud • Look at historical claim submissions • Build ad hoc model to compare with current claims • Assign similarity score to fraudulent claims • Predict fraud potential
Undirected Process • Identify data sources • Prepare data • Build & train computer model • Evaluate model • Apply model to new data • Identify potential targets for undirected • Generate new hypotheses to test
Generate hypotheses • Any commonalities in data? • Are they useful? • Many adults watch children’s movies • chaperones are an important market segment • they probably make final decision • When hypothesis is generated, that determines data needed
Bank Case Study • Directed knowledge discovery to recognize likely prospects for home equity loan • training set - current loan holders • developed model for propensity to borrow • got continuous scores, ranked customers • sent top 11% material • Undirected: segmented market into clusters • in one, 39% had both business & personal accounts • cluster had 27% of the top 11% • Hypothesis: people use home equity to start business
Data mining products and data sets • A good source to view current DM products is www.KDNuggests.com. • The UCI Machine Learning Repository is a source of very good data mining datasets at www.ics.uci.edu/~mlearn/MLOther.html. • Weka DM software at http://www.cs.waikato.ac.nz/ml/weka/ • Tanagra DM software at http://eric.univ-lyon2.fr/~ricco/tanagra/index.html