COT5230 Data Mining

COT5230 Data Mining Week 2 Data Mining Tasks M O N A S H A U S T R A L I A ’ S I N T E R N A T I O N A L U N I V E R S I T Y Data Mining Tasks 2.1

Lecture Outline • Market Basket Analysis • Machine Learning - Basic Concepts Data Mining Tasks 2.2

Data Mining Tasks 1 • Various Taxonomies exist. Berry & Linoff define 6 tasks • Classification • Estimation • Prediction • Affinity Grouping • Clustering • Description Data Mining Tasks 2.3

Data Mining Tasks 2 • The Tasks are also referred to as Operations. Cabena et al. define 4 Operations • Predictive Modeling • Database Segmentation • Link Analysis • Deviation Detection Data Mining Tasks 2.4

Affinity Grouping • Affinity grouping is also referred to as Market Basket Analysis • A common example is the discovery of which items are frequently sold together at a supermarket. If this is known, decisions can be made about: • arranging items on shelves • which items should be promoted together • which items should not simultaneously be discounted Data Mining Tasks 2.5

Market Basket Analysis Confidence Rule Body When a customer buys a shirt, in 70% of cases, he or she will also buy a tie! We find this happens in 13.5% of all purchases. Rule Head Support Data Mining Tasks 2.6

The Usefulness of Market Basket Analysis • Some rules are useful: Unknown, unexpected and indicative of some action to take. • Some rules are trivial: Known by anyone familiar with the business. • Some rules are inexplicable: Seem to have no explanation and do not suggest a course of action.“The key to success in business is to know something that nobody else knows” Aristotle Onassis Data Mining Tasks 2.7

Co-Occurrence Table Customer Items 1 orange juice (OJ), cola 2 milk, orange juice, window cleaner 3 orange juice, detergent 4 orange juice, detergent, cola 5 window cleaner, cola OJ Cleaner Milk Cola Detergent OJ 4 1 1 2 2 Cleaner 1 2 1 1 0 Milk 1 1 1 0 0 Cola 2 1 0 3 1 Detergent 2 0 0 1 2 Data Mining Tasks 2.8

The Process for Market Basket Analysis • A co-occurrence cube would show associations in three dimensions - hard to visualize more • We must: • Choose the right set of items • Generate rules by deciphering the counts in the co-occurrence matrix • Overcome the practical limits imposed by many items in large numbers of transactions Data Mining Tasks 2.9

Choosing the Right Set of Items • Choosing the right level of detail (the creation of classes and a taxonomy) • Virtual items may be added to take advantage of information that goes beyond the taxonomy • Anonymous versus signed transactions Data Mining Tasks 2.10

What is a Rule? Ifconditionthenresult Note: If nappies and Thursday then beer is usually better than (in the sense that it is more actionable) If Thursday then nappies and beerbecause it has just one item in the result If a 3 way combination is the most common, then consider rules with just 1 item in the result, e.g. If A and B, then C If A and C, then B Data Mining Tasks 2.11

Is the Rule a Useful Predictor? - 1 • Confidence is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items in the condition. Consider:if B and C then A • If this rule has a confidence of 0.33, it means that when B and C occur in a transaction, there is a 33% chance that A also occurs. Data Mining Tasks 2.12

Is the Rule a Useful Predictor? - 2 • Consider the following table of probabilities of items and there combinations: Data Mining Tasks 2.13

Is the Rule a Useful Predictor? - 3 • Now consider the following rules:It is tempting to choose “If B and C then A”, because it is the most confident (33%) - but there is a problem Data Mining Tasks 2.14

Is the Rule a Useful Predictor? - 4 • This rule is actually worse than just saying that A randomly occurs in the transaction - which happens 45% of the time • A measure called improvement indicates whether the rule predicts the result better than just assuming the result in the first place p(condition and result) p(condition)p(result) Improvement = Data Mining Tasks 2.15

Is the Rule a Useful Predictor? - 5 • Improvement measures how much better a rule is at predicting a result than just assuming the result in the first place • When improvement > 1, the rule is better at predicting the result than random chance Data Mining Tasks 2.16

Is the Rule a Useful Predictor? - 6 • Consider the improvement for our rules: • None of the rules with three items shows any improvement - the best rule in the data actually has only two items: “if A then B”. A predicts the occurrence of B 1.31 times better than chance. Data Mining Tasks 2.17

Is the Rule a Useful Predictor? - 7 • When improvement < 1, negating the result produces a better rule. For exampleif B and C thennot Ahas a confidence of 0.67 and thus an improvement of 0.67/0.55 = 1.22 • Negated rules may not be as useful as the original association rules when it comes to acting on the results Data Mining Tasks 2.18

Strengths and Weaknesses • Strengths • Clear understandable results • Supports undirected data mining • Works on variable length data • Is simple to understand • Weaknesses • Requires exponentially more computational effort as the problem size grows • Suits items in transactions but not all problems fit this description • It can be difficult to determine the right set of items to analysis • It does not handle rare items well; simply considering the level of support will exclude these items Data Mining Tasks 2.19

Machine Learning • “A general law can never be verified by a finite number of observations. It can, however, be falsified by only one observation.” Karl Popper • The patterns that machine learning algorithms find can never be definitive theories • Any results discovered must to be tested for statistical relevance Data Mining Tasks 2.20

The Empirical Cycle Analysis Theory Observation Prediction Data Mining Tasks 2.21

Concept Learning - 1 • Example: the concept of a wombat • a learning algorithm could consider many animals and be advised in each case whether it is a wombat or not. From this a definition would be deduced. • The definition is • complete if it recognizes all instances of a concept ( in this case a wombat). • consistent if it does not classify any negative examples as falling under the concept. Data Mining Tasks 2.22

Concept Learning - 2 • An incomplete definition is too narrow and would not recognize some wombats. • An inconsistent definition is too broad and would classify some non-wombats as wombats. • A bad definition could be both inconsistent and incomplete. Data Mining Tasks 2.23

Hypothesis Characteristics - 1 • Classification Accuracy • 1 in a million wrong is better than 1 in 10 wrong. • Transparency • A person is able understand the hypothesis generated. It is then much easier to take action Data Mining Tasks 2.24

Hypothesis Characteristics - 2 • Statistical Significance • The hypothesis must perform better than the naïve prediction. (Imagine if 80% of animals considered are wombats and the theory is that all animals are wombats then the theory is right 80% of the time! But nothing has been learnt.) • Information Content • We look for a rich hypothesis. The more information contained (while still being transparent) the more understanding is gained and the easier it is to formulate an action plan. Data Mining Tasks 2.25

Complexity of Search Space • Machine learning can be considered as a search problem. We wish to find the correct hypothesis from among many. • If there are only a few hypotheses we could try them all but if there are an infinite number we need a better strategy. • If we have a measure of the quality of the hypothesis we can use that measure to select potential good hypotheses and based on the selection try to improve the theories (hill-climbing search) • Consider the metaphor of the kangaroo in the mist. • This demonstrates that it is important to know the complexity of the search space. Also that some pattern recognition patterns are almost impossible to solve. Data Mining Tasks 2.26

Learning as a Compression • We have learnt something if we have an algorithm that creates a description of the data that is shorter than the original data set • A knowledge representation is required that is incrementally compressible and an algorithm that can achieve that incremental compression • The file-in could be a relation table and the file-out a prediction or a suggested clustering Algorithm File-out File-in Data Mining Tasks 2.27

Types of Input Message (File-in) • Unstructured or random messages • Highly structured messages with patterns that are easy to find • Highly structured messages that are difficult to decipher • Partly structured messages • Most data sets considered by data mining are in this class. There are patterns to be found but the data sets are not highly regular Data Mining Tasks 2.28

Minimum Message Length Principle • The best theory to explain a set of data is the one that minimizes the sum of the length, in bits, of the description of the theory, plus the length of the data when encoded with the help of the theory.01100011001001101100011010101111100100110 00110011000011 110001100110000111 • Put another way, if regularity is found in a data set and the description of this regularity together with the description of the exceptions is still shorter than the original data set, then something of value has been found. Original data set Theory Data set coded with the theory Data Mining Tasks 2.29

Noise and Redundancy • The distortion or mutation of a message is the number of bits that are corrupted • making the message longer by including redundant information can ensure that a message is received correctly even in the presence of noise • Some pattern recognition algorithms cope well with the presence of noise, others do not • We could consider a database which lacks integrity to contain a large amount of noise • patterns may exist for a small percentage of the data due solely to noise Data Mining Tasks 2.30

COT5230 Data Mining

COT5230 Data Mining

Presentation Transcript

COT5230 Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining

Data Mining: Data

Data Mining: Data