490 likes | 656 Views
BI – what is it again???. “Business Intelligence” is making purposeful use of data in decision making. The goals of BI are: To support human decision making by providing as much understandable, complete, relevant, well-organized information as necessary and helpful.
E N D
BI – what is it again??? • “Business Intelligence” is making purposeful use of data in decision making. • The goals of BI are: • To support human decision making by providing as much understandable, complete, relevant, well-organized information as necessary and helpful. • To automate some decisions to relieve humans of routine decision making tasks. • To discover new issues/relationships/correlations that may not be able to be readily conceived by humans.
Online analytical processing tools • The vast majority of output from BI is OLAP-related. • Provide information to support both ad-hoc and consistent queries for managerial decision making. • Provide multi-dimensional data analysis techniques. • Perform data aggregations. • Provide introductory through advanced statistical analysis. • Support access to very large databases through additional data structures such as SQL Server Analysis Services (cubes). • Contain enhanced query optimization algorithms to facilitate query processing speed (SQL Server Analysis Services).
OLAP Results Continuum: Generates relatively standardized reports to ad-hoc queries. Answers questions such as: Which products sold the most quantity - by type of product and geographic region? Which stores are currently most profitable? Which are least profitable? Used frequently to support short and long term managerial decision making. OLAP Visualization Presented in standard displays that are accessed frequently Dashboard format used to provide quick and comprehensive overview of business status. Presented in Excel or other spreadsheet format. Display the output using a standard report generator (Crystal Reports, Access, etc.) Display the output graphically. Give the data to the decision maker for further processing.
Data mining tools • Data mining is the set of activities used to find new, “hidden” or unexpected patterns in data. • Data mining tools: • use large sets of data; • uncover patterns based on statistical and artificial intelligence algorithms; • form computer models based on the findings; and • use the models to predict business behavior. • Common synonyms for data mining include knowledge discovery, information harvesting, & pattern analysis. • Proactive tools, used for discovery and prediction.
Data Mining Results Generates information about patterns in data. Data mining provides answers to previously ambiguous questions; but a question area must be defined. May produce information such as: Which products should be promoted to a pre-defined type/category of customer? Which patients have the greatest likelihood of being hospitalized within the next year? Which securities are the most profitable to buy/sell in a particular environment? Data Mining Visualization Focus is on discovery and analysis, rather than reporting, monitoring or communicating a message. Uses primarily graphical output to display the patterns. Included as part of the data mining tool. Can also incorporate the results in standardized reporting tools and/or dashboards, but information is already “discovered” by that time.
Is it OLAP or Data Mining?? • How many people in a given geographic area between the ages of 15-30 were diagnosed with type 2 diabetes last year? • What is the quantity breakdown by county in the U.S. for people diagnosed with type 2 diabetes? • What demographic factors are related to type 2 diabetes? • What is the relationship between weight, exercise, age smoking, and the prevalence of type 2 diabetes? • How many people in a given geographic region will be diagnosed with type 2 diabetes within the next two years?
Is it OLAP or data mining?? (TEC) • How many different customers did we serve? How many applicants did we place? • Which customer was our most profitable? • Which customers have the greatest likelihood of increasing their number of temporary employees next year? • Which geographic region was our most profitable last quarter? • Which geographic region has the fastest growth rate measured by number of employees placed over the last 3 years?
What are some OLAP and data mining questions for the Renown data set??
Business Problem: $10 Million in ATM Fees Financial Institution: Business Problem • Goal: • Identify customers who use other banks’ ATMs
Data Preparation • Detailed ATM Transaction Data • Data Exploration • Created Analytic Data • Model Development • Factor Analysis • Clustering • Location Analysis Cluster Analysis
New Intelligence Segment 1016% of customers incurring 50% of reciprocity fees College Students 60 50 40 Percentage 30 20 10 0 10 8 3 9 5 7 1 Customer Segment IDs %Cost % Customer Ratio Saved $8.2 Millionin expenses!
Case from book:Data Mining Goes to Hollywood! Dependent Variable Independent Variables A Typical Classification Problem
Data Mining Applications • Customer Relationship Management • Maximize return on marketing campaigns • Improve customer retention (churn analysis) • Maximize customer value (cross- or up-selling) • Identify and treat most valued customers • Banking & Other Financial • Detect fraudulent transactions • Maximize customer value (cross- and up-selling) • Optimize cash reserves with forecasting
Data Mining Applications (cont.) • Retailing and Logistics • Optimize inventory levels at different locations • Improve the store layout and sales promotions • Optimize logistics by predicting seasonal effects • Manufacturing and Maintenance • Predict/prevent machinery failures • Identify anomalies in production systems to optimize manufacturing capacity • Discover novel patterns to improve product quality
Data Mining Applications (cont.) • Brokerage and Securities Trading • Predict changes on certain bond prices • Forecast the direction of stock fluctuations • Assess the effect of events on market movements • Identify and prevent fraudulent activities in trading • Insurance • Forecast claim costs for better business planning • Determine optimal rate plans • Optimize marketing to specific customers
Select a topic: Understand the business need Predict (if prediction is the goal) Identify target dataset: Understand the available data Data Mining Process Interpret and refine (test and evaluate results from model) Transformdata: Clean, derive, extract Build an analytical model: May require more than one model Mine the data: Use the software
Videos of data mining in SAS and SQL Server Example of predicting wine quality with decision trees - SAS http://www.youtube.com/watch?v=Nj4L5RFvkMg Example of SQL Server http://www.youtube.com/watch?v=nvZ3OAY_-9g
Categories of data mining tasks • Data mining tasks are generally divided into two major categories: • Descriptive tasks: The objective of these tasks is to discern patterns (correlations, trends, clusters, anomalies) that describe the relationships in data. • Predictive tasks: The objective of these tasks is to predict the value of a particular attribute based on the values of other attributes. The attribute to be predicted is called the “dependent variable” and the attributes used to help make the prediction are called the “independent variables.” • Data mining tasks are usually exploratory and require further work by people to validate and explain the results.
Learning methods for data mining • The goal of data mining is to identify and report patterns. • Statistical strategies used to help the software used for data mining to “learn” are classified on two ends of a spectrum: • Supervised statistical methods: Data is known and labeled prior to running the statistical evaluation. Benchmark data is available. • Unsupervised statistical methods: The method itself figures out the patterns from the data. No benchmark data is available. • Semi-supervised methods are commonly used.
Type of Data Mining Task: Prediction • Refers to the task of building a model for the dependent variable (target attribute) as a function of independent variables (explanatory attributes). • Is called classification for discrete dependent variables and is called regression when used for continuous dependent variables. Examples: • Predicting whether a Web user will make a purchase at an online bookstore is a classification task because the dependent variable is yes/no. • Forecasting the future price of a stock is a regression task because price is a continuous valued attribute. • Contains rules to make determinations. Rules may be exact, strong and/or probabilistic.
Type of Data Mining Task: Association • Used to discover patterns that describe strongly associated features in the data. • Correlates one set of events or items with another set of events of items. • Most frequent use is “market basket analysis.” Market basket analysis determines what products customers purchase together. A set of point-of-sale transactions can be analyzed to uncover patterns in purchases that show what people purchase at the same time in a given geographic location or particular time/day.
Type of Data Mining Task: Association • Sequence (a type of association): • Used to relate events and/or attributes over time. • Can be used to analyze the buying sequences of a particular group of customers. • Can be used to uncover groups of timed sequences of particular promotions (advertising) aimed at a particular group of customers. • Can be used to analyze the sequence of page hits on a web site.
Type of Data Mining Task: Clustering • Seeks to find groups of closely related observations. • Based on identifiable attributes in the groups. • Clustering has been used to do such tasks as: • group sets of related customers (male/female, age groups); • find areas of the ocean that have significant problems; and • identify particular days/times of significant purchases. • Clustering is used to identify related keywords in news articles so that a web site can present related groups of articles in the same place.
Type of Data Mining Task: Clustering • Anomaly detection (a type of clustering): • The task of identifying observations whose characteristics are significantly different from the rest of the data. Such observations are known as anomalies or outliers. • The goal of an anomaly detection algorithm is to discover “real” anomalies and avoid falsely labeling normal objects as anomalous. • Applications include fraud detection, network intrusions, unusual patterns of disease, and changes in environment/weather. • Credit card fraud detection is most common use of anomaly data mining. When a new transaction is made, it is compared against the profile of the user (age, gender, credit limit, annual income, address) to see whether the characteristics of the transaction are very different from previous transactions.
Assessment Methods for Data Mining • Predictive accuracy • Hit rate • Speed • Computational costs in generating and using the model • Robustness • Model’s ability to make accurate predictions with noisy data • Scalability • Model’s ability to make accurate predictions with more or less data • Interpretability • Transparency; ease of understanding
Models supporting data mining • Decision trees • Regression and correlation • Exploratory factor analysis • Artificial neural networks (ANN) • Support vector machines • Case-based reasoning • Bayesian classifiers: Naïve Bayes • Genetic algorithms
Decision Trees • Decision trees are tools used for classification and prediction. • Decision trees represent rules. • Decision trees recursively divide a training set until each division consists of examples from one class
Decision tree process • Create a root node and assign all of the training data to it. • Select significant independent variables • Identify category groupings or interval breaks to create groups most different with respect to the dependent variable • Select as the primary independent variable the one identifying groups with the most different values of the dependent variable • Select additional variables to extend each branch if there are further significant differences
Decision tree algorithms • Decision tree algorithms mainly differ based on: • Splitting criteria • Which variable, what value, etc. • Stopping criteria • When to stop building the tree • Pruning (generalization method) • Pre-pruning versus post-pruning
Requirements to use a decision tree • The situation must be expressible in terms of a fixed collection of properties (e.g. young/old, hot/warm/cold, love/like/sort-of-like/dislike/hate). • There is a target value. • There is enough data to provide training cases for the model. • Not suitable for prediction of continuous variables. • Performs poorly with many independent variables (predictors) and small datasets.
Correlation and regression analysis • Regression analysis describes the way in which one variable is related to another (or to more than one other variable in multiple regression). • Correlation analysis describes the strength of the relationship defined by regression analysis. • The principal goals of regression and correlation analysis are: • Predict. Regression analysis provides estimates of the dependent variable for given values of the independent variable(s). • Disclose. Regression analysis provides measures of the errors that are involved with those estimates. • Relate. Correlation analysis provides estimates of how strong the relationship is between the dependent and independent variables. The coefficient of correlation and the coefficient of determination are two measures of the strength of the relationship.
Regression: what kind of relationship? • Is the relationship between dependent and independent variables direct (positive) or inverse (negative)? • Is the relationship linear or nonlinear? • How strong is the relationship? • Next page provides examples.
Correlation • The descriptive statistic that measures the degree of linear association between two variables is called the correlation coefficient (r). • The measure of relative closeness used by statisticians for evaluating the “goodness of fit” of the regression line is called the coefficient of determination (R2).
Issues in regression and correlation • Correlation is not causation. • Does not always give a clear picture of the direction of the relationship. • Can show spurious relationships. • Is VERY dependent on having a reasonable hypothesis. • Does not do a good job predicting individual behavior.
Artificial Intelligence for Data Mining • Neural networks are useful for data mining and decision-support applications. • People are good at generalizing from experience. • Computers excel at following explicit instructions over and over. • Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains. • Neural networks are useful for pattern recognition or data classification.
Von Neumann vs. Artificial neural networks Von Neumann Program ANN Program Learns rules from data. Rules on the data are not visible; not pre-defined. Able to generalize. Able to recognize patterns. Copes well with errors and “noise.” • Follows pre-defined rules • Solution can be formally specified. • Solution must be formally specified. • Cannot generalize. • Not especially error-tolerant
Artificial neural networks • Neural networks are particularly effective for predicting events when the networks have a large database of prior examples to draw on. But not too large... • Neural networks, like people, learn by example. An ANN is configured for a specific application, such as pattern recognition or data classification, through a learning process. • Student video: http://www.youtube.com/watch?v=DG5-UyRBQD4&feature=related
AI is based on fuzzy logic • Fuzzy logic is a method of reasoning that allows for the partial or incomplete (“fuzzy”) description of a rule. • Computer logic is usually based on answers of yes/no (on/off) to if-then-else questions. • Fuzzy logic is based on overlapping sets of data. An instance has a relative probability of belonging to one set and/or another.
Example of fuzzy logic • For example, let’s say that a person is old. • Is a person either young or old? Is it a yes/no switch? • What are the categories of “young” vs. “old”? • How can a person be “very” old vs. just plain old? • Need a sizable sample set to define categories and characteristics of the categories. • Need to determine the boundaries of the categories. • Need to assign probabilities to the categories.
What does data mining technology deliver? What must be done with the information delivered by data mining to make it usable for an organization? What kinds of skills should a data analyst possess to be able to make use of data mining?