190 likes | 568 Views
Data Mining. Dr. Awad Khalil Computer Science Department AUC. Content. What and Why Data Mining Data Mining Applications Data Mining Operations & associated Techniques Predictive Modeling Database Segmentation Link Analysis Deviation Detection The Data Mining Process
E N D
Data Mining Dr. Awad Khalil Computer Science Department AUC Data Mining, by Dr. Khalil
Content • What and Why Data Mining • Data Mining Applications • Data Mining Operations & associated Techniques • Predictive Modeling • Database Segmentation • Link Analysis • Deviation Detection • The Data Mining Process • The CRISP-DM Model Data Mining, by Dr. Khalil
What and Why Data Mining? • Data Mining is the process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions. • Data mining is concerned with the analysis of data and the use of software techniques for finding hidden and unexpected patterns and relationships in sets of data. • The focus of data mining is to reveal information that is hidden and unexpected. • Data mining requires a single, separate, clean, integrated, and self-consistent source of data. A data warehouse is well equipped for providing data for data mining. • Data mining can provide huge paybacks for companies who have made a significant investment in data warehousing. Data Mining, by Dr. Khalil
Data Mining Applications • Retail/Marketing: • Identifying buying patterns of customers • Finding associations among customer demographic characteristic • Predicting response to mailing companies • Market basket analysis • Banking: • Detecting patterns of fraudulent credit card use • Identifying loyal customers • Predicting customers likely to change their credit card affiliation • Determining credit card spending by customer groups • Insurance: • Claims analysis • Predicting which customers will buy new policies • Medicine: • Characterizing patient behavior to predict surgery visits • Identifying successful medical therapies for different illnesses Data Mining, by Dr. Khalil
Data Mining Operations & Associated Techniques • Predictive Modeling: • Classification • Value prediction • Database Segmentation: • Demographic clustering • Neural clustering • Link Analysis: • Associate discovery • Sequential pattern discovery • Similar time sequence discovery • Deviation Detection: • Statistics • Visualization Data Mining, by Dr. Khalil
Predictive Modeling • Predictive Modeling is similar to the human learning experience in using observations to form a model of the important characteristics of some phenomenon. • This approach uses generalization of the “real world” and the ability to fit new data into a general framework. • Predictive modeling can be used to analyze an existing database to determine some essential characteristics (model) about the data set. • Applications of predictive modeling include customer retention management, credit approval, cross-selling, and direct marketing. • There are two techniques associated with predictive modeling: classification and value prediction. Data Mining, by Dr. Khalil
Classification • Classification is used to establish a specific predetermined class for each record in a database from a finite set of possible class values. • There are two specializations of classification: • Tree induction; • Neural induction. Data Mining, by Dr. Khalil
Classification – Tree Induction • In the shown example, we are interested in predicting who is currently renting property is likely to be interested in buying property. • A predictive model has determined that only two variables are of interest: the length of time the customer has rented property and the age of the customer. • The decision tree presents the analysis in an intuitive way. • The model predicts that those customers who have rented for more than two years and are over 25 years old are the most likely to be interested in buying property Data Mining, by Dr. Khalil
Classification – Neural Network • A Neural Network contains collections of connected nodes with input, output, and processing at each node. • Between the visible input and output layers may be a number of hidden processing layers. • Each processing unit (circle) in one layer is connected to each processing unit in the next layer by a weighted value, expressing the strength of the relationship. • The network attempts to mirror the way the human brain works in processing patterns by arithmetically combining all the variables associated with a given data point. • In this way, it is possible to develop nonlinear predictive models that “learn” by studying combinations of variables and how different combinations of variables affect different data sets. Data Mining, by Dr. Khalil
Value Prediction • Value prediction is used to estimate a continuous numeric value that is associated with a database record. • This technique uses the traditional statistical techniques of linear regression and nonlinear regression. • Linear regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. • Linear regression works well with linear data and is sensitive to the presence of outliers (that is, data values which do not conform to the expected norm). • Although nonlinear regression avoids the main problems of linear regression, it is still not flexible enough to handle all possible shapes of the data plot. • Applications of value prediction include credit card fraud detection and target mailing list identification. Data Mining, by Dr. Khalil
Database Segmentation • The aim of database segmentation is to partition a database into an unknown number of segments, or clusters, of similar records, that is, records that share a number of properties and so are considered to be homogeneous. • This approach uses unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles. • Database segmentation is less precise than other operations and is therefore less sensitive to redundant and irrelevant features. • Applications of database segmentation include customer profiling, direct marketing, and cross-selling. • Database segmentation is associated with demographic or neural clustering techniques, which are distinguished by the allowable data inputs, the methods used to calculate the distance between records, and the presentation of the resulting segments for analysis. Data Mining, by Dr. Khalil
Link Analysis • Link analysis aims to establish links, called associations, between the individual records, or sets of records, in a database. • There are three specializations of link analysis: • Association discovery: finds items that imply the presence of other items in the same event. These affinities between items are represented by association rules. For example “when a customer rents a property for more than two years and is more than 25 years old, in 40% of cases, the customer will buy a property. This association happens in 35% of all customers who rent properties.” • Sequential pattern discovery: finds patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time. For example, this approach can be used to understand long-term customer buying behavior. • Similar time sequence discovery: is used, for example, in the discovery of links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate, For example, within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines. • Applications of link analysis include product affinity analysis, direct marketing, and stock price movement. Data Mining, by Dr. Khalil
Deviation Detection • Deviation detection is a relatively new technique in terms of commercially available data mining tools. • It identifies outliers, which express deviation from some previously known expectation and norm. • This operation can be performed using statistics and visualization techniques. For example, linear regression facilitates the identification of outliers in data while modern visualization techniques display summaries and graphical representations that make deviations easy to detect. • Applications of deviation detection include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing. Data Mining, by Dr. Khalil
The Data Mining Process • In 1996 a consortium of vendors and users developed a specification called the Cross Industry Standard Process for Data Mining (CRISP-DM). • CRISP-DM specifies a data mining process that is not specific to any particular industry or tool. • CRISP-DM has evolved from the knowledge Discovery processes used widely in industry and in direct response to user requirements. • The major aims of CRISP-DM are make large data mining projects run more efficiently as well as to make them cheaper, more reliable, and more manageable. Data Mining, by Dr. Khalil
The CRISP-DM Model • The CRISP-DM methodology is a hierarchical process model. • At the top level, the process is divided into six different generic phases, ranging from business understanding to deployment of project results. • The next level elaborates each of these phases as comprising several generic tasks. At this level, the description is generic enough to cover all the DM scenarios. • The third level specializes these tasks for specific situations. For example, the generic task might be cleaning data, and the specialized task could be cleaning of numeric or categorical values. • The fourth level is the process instance, that is, a record of actions, decisions, and result of an actual execution of a DM project. • The model also discusses relationships between different DM tasks. Data Mining, by Dr. Khalil
The CRISP-DM Phases • Business understanding – determine business objectives, assess situation, determine data mining goal; and produce a project plan. • Data understanding – collect initial data, describe data; explore data; and verify data quality. • Data preparation – select data, clean data, construct data, integrate data, and format data. • Modeling – select modeling technique, generate test design, build model, and assess model. • Evaluation – evaluate results, review process, and determine next step. • Deployment – plan deployment, plan monitoring and maintenance, produce final report, and review report. Data Mining, by Dr. Khalil
Data Mining Tools • There are a growing number of commercial data mining tools on the marketplace. • The important features of data mining tools include: • Data preparation • Selection of data mining operations (algorithms) • Product scalability and performance • Facilities for understanding results Data Mining, by Dr. Khalil
Thank you Data Mining, by Dr. Khalil