Section 5

Section 5 Data Mining

Section Content • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

5.1 Data Mining Introduction • Data mining: • the discovery of new information in terms of patterns or rules from huge amounts of data • mining tools should identify these patterns, rules and trends with minimal user input • data mining is related to • statistics: exploratory data analysis • artificial intelligence: knowledge discovery and machine learning • techniques from machine learning, statistics, neural networks and genetic algorithms are used • due to the vastness of the amount of data, efficiency/scalability of data mining algorithms is a key issue CA306 Data Mining

Data Mining and Data Warehousing • The goal of data warehousing is to support decision making with data. • Data mining can help in conjunction with a data warehouse with certain types of decisions. • Data mining helps to extract new patterns/rules that cannot be found by merely querying or processing data. • Aggregated or summarised collections of data in warehouses improves the efficiency of data mining in these cases. • The potential use of data mining needs to be considered early in the design of a data warehouse. CA306 Data Mining

Sections Covered • 5.1 Introduction • 5.2 Knowledge Discovery • 5.3 Association Rules • 5.4 Sequential Patterns • 5.5 Classification and Regression • 5.6 Other Forms of Data Mining • 5.7 Applications of Data Mining CA306 Data Mining

5.2 Knowledge Discovery • Data mining is part of the knowledge discovery process: • data selection • data cleansing • enrichment • data transformation / encoding • data mining • reporting and display • Example: • Database: Transaction database for a goods retailer • Client data: name, zip code, phone, date of purchase, item code, price, quantity, total amount CA306 Data Mining

Knowledge Discovery - Example • New knowledge can be discovered from the client data • data selection: • data about specific items or categories of items • items from stores in specific regions • data cleansing: • correct incorrect zip codes • eliminate records with incorrect phone numbers • enrichment: add additional information • age, income, credit rating of client • data transformation: reduce the amount of data • group items into product categories • group zip codes into regions CA306 Data Mining

Data Mining - Knowledge Discovery • Data mining might discover • co-occurrences - items that are typically bought together • association rules - when a customer buys video equipment, he/she also buys another electronic gadget • sequential patters - when a customer buys a camera, then within 3 months he/she buys photographic supplies • classification trees - customers can be classified by frequency of visits, types of finance used, etc. combined with statistics about the classes • This information can then be used to for example • optimise store locations • run promotions • plan seasonal marketing strategies CA306 Data Mining

Goals of Data Mining • Prediction • show how certain attributes within the data will behave in the future • example: predict what customers will buy under certain discounts • example: predict sales volume for some period • Identification • data patterns can be used to identify the existence of an item, an event, or an activity • example: detecting intruders by the commands they execute CA306 Data Mining

Goals of Data Mining • Classification • partition data such that different classes or categories can be identified • example: customers can be categorised into regular and infrequent shoppers, into discount-seeking customers etc. • categorisation - e.g. into food categories - can reduce the complexity of data mining • Optimisation • optimise the use of limited resources (time, space, money, etc) • example: what are the best productsto spend our money on over the next three months? CA306 Data Mining

Types of Knowledge Discovered • Co-occurrences • collection of items/actions/events that occur together • example: items that are bought together by a consumer in a shop • Association rules • correlation of a set of items with another range of values for another set of variables • example: when someone buys bread, he/she is likely to buy cheese • Classification hierarchies • create a hierarchy of classes from an existing set of events or transactions • example: customers might be divided into a credit worthiness hierarchy based on their previous credit transactions CA306 Data Mining

Types of Knowledge Discovered • Sequential patterns • search for a sequence of events or actions • example: a patient that underwent cardiac surgery and later developed high blood urea, is likely to suffer from kidney problems • Patterns within time series • detection of similarities within positions of the time series • example: a pattern in a time series of stock market prices may be used to predict employment rates • Categorisation and segmentation • partition a set of events of items into segments/categories/classes • example: treatment data on a disease can be partitioned into groups based on the side effects that are caused CA306 Data Mining

Counting Co-occurrences • The problem is to count co-occurring itemsets - motivated by market basket analysis. • A database of consumer transactions forms the basis • transaction: a single visit to a store, an order at a virtual store (Web site), or a single order through a mail-order catalog • a transaction consists of a transaction ID, customer ID, date, item and quantity • The goal is to identify items that are typically purchased together. • This can be used to improve the layout of shops or catalogs. CA306 Data Mining

Frequent Itemsets (1) • Consider the following transaction table: Transaction Customer Date Items bought 101 12 11/09 milk, bread, juice 792 13 12/09 milk, juice 1130 14 14/09 milk, eggs 1735 13 14/09 bread, coffee, biscuits Items bought in one visit are already grouped together into itemsets. • Support of an itemset: the fraction of transactions that contain all items in the itemset • Examples • {milk, juice} has a support of 50 % • {bread, coffee} has a support of 25 % CA306 Data Mining

Frequent Itemsets (2) • Large itemsets are itemsets that have a certain minimum support, i.e. are itemsets that occur frequently. • Example: • for a minimum support of 40%, the large itemsets are {milk, juice}, {milk}, {juice}, {bread} • Proposition: • every subset of a large itemset is also a large itemset • Algorithm: • large itemsets can be computed incrementally • start with itemsets of cardinality 1 that have the required support CA306 Data Mining

5.3 Association Rules • A database can be regarded as a collection of transactions. • Each transaction involves a set of items. • Example: the items in a basket that a shopper uses in a supermarket Transaction Time Items bought 101 6:35 milk, bread, juice 792 7:38 milk, juice 1130 8:05 milk, eggs 1735 8:40 bread, coffee, biscuits CA306 Data Mining

Association Rules • An association rule is of form X => Y where X and Y are two disjoint sets of items • Example: • for sets of goods as itemsets X and Y, the expression X => Y means that if a customer buys X, he/she is also likely to buy Y. • if the customer buys milk, he/she is also likely to buy juice. • The support for a rule X => Y is the percentage of transactions that hold all of the items in the union X  Y. • Examples: • Milk => Juice has 50% support • Bread => Juice has 25% support CA306 Data Mining

Association Rules • The confidence of a rule X => Y is the percentage (fraction) of all transactions including X that also include Y. • Example: • the rule Milk => Juice has confidence 66.7% • that means that 2/3 of all transactions with milk also include juice • Note that support and confidence might be different. • The goal is to discover rules with a certain minimum support and confidence. • These rules can be used for prediction: for a rule Pen => Ink offer discounts on pens and you might increase ink sales. CA306 Data Mining

Association Rules • How to compute these rules? • Generate large itemsets (itemsets with a certain minimum support) • For each large itemset X, generate all rules with a certain minimum confidence (mconf): for X and Y  X, let Z = X - Y (divide X into Y and Z) if support(X) / support(Y) > mconf then Y => Z is a valid rule the confidence of rule Y => Z is defined as support(X) / support(Y) • Example: for X={milk, juice} and Y={milk}  {milk, juice}, let Z={juice} X, Y, Z have support 50%, 75% and 50%, resp. (support for itemsets 5.14) for mconf=40% {milk} => {juice}is a valid rule with confidence 66.7% ( 50/75 ) CA306 Data Mining

Generating Association Rules • In principle, generating rules based on large itemsets and their support is straightforward. • Computing all large itemsets and their support creates an efficiency problem if the number of items is very high. • If m is the number of items, then 2m is the number of different itemsets. • Example: a typical supermarket might have several thousands of items. • Computing the support of all itemsets might take a long time. • Reducing the combinatorial search space is therefore important - the following properties can be used: • subsets of large itemsets are large • extensions of small itemsets are small CA306 Data Mining

Association Rules - Algorithms • Outline of an algorithm that finds large itemsets: • Step 1: • test the support for itemsets of length 1 - called 1-itemsets - by scanning the database; • discard those that do not meet the minimum requirement. • Step 2: • extend large 1-itemsets into 2-itemsets by appending one item each time (this generates all itemsets of length two); • test the support and eliminate all 2-itemsets that do not meet the minumum support. • Step 3: • repeat the above steps: extend (k-1)-itemsets into k-itemsets. CA306 Data Mining

Association Rules among Hierarchies • Items might be divided among disjoint hierarchies based on some classification, e.g. Beverage can be divided into Juice and Milk Associations might occur among the hierarchies of items. • Example:healthy frozen yoghurt => bottled water • Particularly interesting are associations across hierarchies. • this kind of information can be used to arrange different kinds of items in a supermarket CA306 Data Mining

Negative Associations • Negative associations are more difficult to detect than positive associations. • Example: 60% of customers who buy crisps do not buy bottled water. • There are usually more negative associations than positive ones. • The majority of itemset combinations do not occur in databases. • Finding interesting negative associations can be difficult. CA306 Data Mining

Association Rules - Additional Considerations • Sampling: • For very large databases, sampling improves efficiency. • Truly representative samples can help to find most of the rules. • The danger is that • false positives might be discovered (large itemsets that are not truly large); • true positives might be missing. • Other problems: • Cardinality of itemsets and volume of transactions can be very high. • Variablity of transactions (geographical, season) makes sampling difficult. • Multiple classifications along different dimensions. CA306 Data Mining

5.4 Sequential Patterns • Sequential patterns are based on sequences of itemsets. • Assume transactions to be ordered by time. • Example: • transactions in a supermarket • {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} may be based on three visits of a customer • A subsequence of a sequence is obtained by deleting one or more itemsets. • Example: • let {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} be the orginal sequence • {milk, bread, juice} ; {bread, eggs} is a subsequence • {milk, bread, juice} ; {milk, coffee, biscuits} is a subsequence CA306 Data Mining

Support for Sequences • A sequence {a1, ... , am} is contained in another sequence S if S has a subsequence {b1, ..., bn} such that ai bi for 1 <= i <= n • Example: • {milk, bread} ; {coffee, biscuits} is contained in {milk, bread, juice} ; {bread, eggs} ; {milk, coffee, biscuits} • The support of a sequence S is the percentage of a set of given sequences that contain S as a subsequence. CA306 Data Mining

Discovery of Patterns in Time Series • Time series are sequences of events. • An event might be a fixed type of transaction. • Example: • closing price of a stock or fund each day. • Analysis of time series: • find period of time in which the stock did not fluctuate more than 1% • find period (week/month/quarter) with the greatest loss • identify stocks with similar behaviour CA306 Data Mining

5.5 Classification and Regression • Classification Rules • Regression • Tree-structured Rules CA306 Data Mining

Discovery of Classification Rules • Classification means defining/identifying a function that maps an object into one of many possible classes. • Example: a bank wants to classify loan applicants into “loanworthy” and “not loanworthy” • a classification rule could define the classification • not loanworthy: current monthly debt obligation exceeds 25% of monthly net income • loanworthy: otherwise • loanworthiness is a dependent,categorical attribute • In general there is one rule (set) per class (var1 in range1) and ... and (varn in rangen) => object O in class C1 var1 , ..., varn are the predictor attributes CA306 Data Mining

Support and Confidence • Again we can define support and confidence for these rules. • The support for a classification conditionC is the percentage of tuples that satisfy C. • The support for a ruleC1 => C2 is the support for the condition C1 C2. (C1 AND C2 is the set of objects in both C1 and C2.) • Consider those tuples that satisfy condition C1. The confidencefor a ruleC1 => C2 is the percentage of such rules that also satisfy condition C2. CA306 Data Mining

Regression • Regression is similar to classification, except that the dependent variable is numerical (and not categorical). • Rules (such as classification rules) can be regarded as functions. • A regression rule is a function that maps variables into a target class variable. • Example: LabTest(patientID, test1, ... , testn) • the values in that relation result from a series of lab tests • the target variable P is the probability of survival - a numerical variable • the regression rule: (test1 in range1) and ... and (testn in rangen) => P = x • the regression function is P = f(test1, ... , testn) CA306 Data Mining

Regression (2) • If P appears as a function y = f(x1, ... , xn) and f is linear in the domain variables, then the process of deriving f from a given set of tuples <x1, ... , xn, y> is called linear regression. • Linear regression is a common statistical technique. CA306 Data Mining

Tree-Structured Rules • Specific classification and regression rules shall now be examined. • These are rules that can be represented as trees - called classification trees or decision trees. • These trees are typically the output of the data mining activity. • Each path from a root to a leaf node represents one classification rule. • Example: Insurance risk determination for motor insurance Age <= 25 > 25 Car Type NO sports family YESNO CA306 Data Mining

Decision Trees • A decision tree is a graphical representation of a collection of classification rules. • Each node in the tree is labelled with a predictor or splitting attribute. • Each outgoing edge of an internal node is labelled with a predicate that involves the splitting attribute. • Each leaf node is labelled with a value of the depending attribute. • A classification rule can be associated with each leaf node - constructed as the conjunction of the predicates: • Age <= 25 and Car Type = sports for the YES-leaf • Decision trees are constructed in two phases: • growth phase: create tree based on specialised rules from an input database (relation) • pruning phase: reduce tree size by generalising rules CA306 Data Mining

5.6 Other Types of Data Mining • Neural Networks • Genetic Algorithms • Clustering and Segmentation CA306 Data Mining

Neural Networks • Techniques from artificial intelligence can be used to generalise regression. • Neural networks provide an iterative method to carry out this generalised regression. • Neural networks use a curve-fitting approach to infer a function from a set of samples. • This process is based on learning: a test sample is the initial input, the system then incrementally infers functions based on more samples • Neural networks can be applied to classification problems. • Modelling time series with neural networks is difficult. CA306 Data Mining

Genetic Algorithms (1) • Genetic algorithms (GA) are a class of randomised search procedures for adaptive and robust search over a wide range of search topologies. • Principle: • Genetic algorithms extend the idea of characterising human DNA by a four-letter alphabet (A,C,T,G). • Construction: • Devise an alphabet that allows the encoding of a solution to the decision problem in terms of strings of that alphabet. • Usage: • Study the cutting and combination of strings (compare natural reproduction and evolution). • New generations of individuals (solutions) are generated and assessed - survival of the fittest. CA306 Data Mining

Genetic Algorithms (2) • Generation of solutions - comparison with other techniques. • GA search uses a set of solutions during each generation rather than a single solution. • The search in the string-space represents a much larger parallel search in the space of encoded solutions. • The memory of the search completed is represented solely by the set of solutions available for generation. • A GA is a randomised algorithm since search mechanisms use probabilistic operators. • While progressing from one generation to the next, a GA finds near-optimal balance between knowledge acquisition and exploitation by manipulating encoded solutions. CA306 Data Mining

Clustering and Segmentation • Clustering is about identification and classification. • Clustering tries to identify categories (or clusters) to which a data object can be mapped. • The categories can be disjoint or might overlap; they might be organised into trees. • A related problem: multivariate probability density functions. CA306 Data Mining

5.7 Applications of Data Mining • Decision-making contexts: • marketing: • analysis of customer behaviour based on buying patterns; • determination of marketing strategies (store locations, advertising campaigns, etc); • segmentation of customers, stores, products. • finance: • analysis of creditworthiness of clients; • performance analysis of finance investments; • evaluation of financing options; • fraud detection. CA306 Data Mining

Applications • Manufacturing: • optimisation of resources (machines, manpower, material); • optimal design of manufacturing process, shop-floor layout, etc. • Health care: • analysis of effectiveness of certain treatments; • optimisation of processes in a hospital; • analysing side effects of drugs; • relating patient wellness and doctor qualifications. CA306 Data Mining

Section 5

Section 5

Presentation Transcript

Section 5

Section 5

Section 5

Section 5-5

Section 5

Section 5

SECTION 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5

Section 5