160 likes | 649 Views
Issues with Data Mining. Data Mining involves Generalization. Data mining learns generalizations of the instances in training data E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute
E N D
Data Mining involves Generalization • Data mining learns generalizations of the instances in training data • E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute • This means, generalizations predict (or describe) the behaviour of instances beyond the training data • This in turn means, knowledge is extracted from raw data using data mining • This knowledge drives end-user’s decision making process
Generalization as Search • The process of generalization can be viewed as searching a space of all possible patterns or models • For a pattern that fits the data • This view provides a standard framework for understanding all data mining techniques • E.g decision tree learning involves searching through all possible decision trees • Lecture 4 shows two example decision trees that fit the weather data • One of them is a better generalization than the other (Example 2)
Bias • Important choices made in a data mining system are • representation language, • search method and • model pruning method • This means, each data mining scheme involves • Language bias – the language chosen to represent the patterns or models • Search bias – the order in which the space is searched • Overfitting-avoidance bias – the way overfitting to the training data is avoided
Language Bias • Different languages used for representing patterns and models • E.g. rules and decision trees • A concept fits a subset of training data • That subset can be described as a disjunction of rules • E.g classifier for the weather data can be represented as a disjunction of rules • Languages differ in their ability to represent patterns and models • This means, when a language with lower representation ability is used, the data mining system may not achieve good performance • Domain knowledge helps to cut down search space
Search Bias • An exhaustive search over the search space is computationally expensive • Search is speeded up by using heuristics • Pure children nodes indicate good tree stumps in decision tree learning • By definition heuristics cannot guarantee optimum patterns or models • Using information gain may mislead us to select a suboptimal attribute at the root • Complex search strategies possible • Those that pursue several alternatives parallelly • Those that allow backtracking • A high-level search bias • General-to-specific: start with a root node and grow the decision tree to fit the specific data • Specific-to-general: choose specific examples in each class and then generalize the class by including k-nearest neighbour examples
Overfitting-avoidance bias • We want to search for ‘best’ patterns and models • Simple models are the best • Two strategies • Start with the simplest model and stop building model when it starts to become complex • Start with a complex model and prune it to make it simpler • Each strategy biases search in a different way • Biases are unavoidable in practice • Each data mining scheme might involve a configuration of biases • These biases may serve some problems well • There is no universal best learning scheme! • We saw this in our practicals with Weka
Combining Multiple Models • Because there is no ideal data mining scheme, it is useful to combine multiple models • Idea of democracy – decisions made based on collective wisdom • Each data mining scheme acts like an expert using its knowledge to make decisions • Three general approaches • Bagging • Boosting • Stacking • Bagging and boosting both follow the same approach • Take a vote on the class prediction from all the different schemes • Bagging uses a simple average of votes while boosting uses a weighted average • Boosting gives more weight to more knowledgeable experts • Boosting is generally considered the most effective
Bias-Variance Decomposition • Assume • Infinite training data sets of the same size, n • Infinite number of classifiers trained on the above data sets • For any learning scheme • Bias = expected error of the classifier even after increasing training data infinitely • Variance = expected error due to the particular training set used • Total expected error = bias + variance • Combining multiple classifiers decreases the expected error by reducing the variance component
Bagging • Bagging stands for bootstrap aggregating • Combines equally weighted predictions from multiple models • Bagging exploits instability in learning schemes • Instability – small change in training data results in big change in model • Idealized version for classifier • Collect several independent training sets • Build a classifier from each training set • E.g learn a decision tree from each training set • The class of a test instance is the prediction that received most votes from all the classifiers • Practically it is not feasible to obtain several independent training sets
Bagging Algorithm • Model Generation • Let n be the number of instances in the training data • For each of t iterations • Sample n instances with replacement from training data • Apply the learning algorithm to the sample • Store the resulting model • Classification • For each of the t models: • Predict class of instance using model • Return class that has been predicted most often
Boosting • Multiple data mining methods might complement each other • Each method performing well on a subset of data • Boosting combines complementing models • Using weighted voting • Boosting is iterative • Each new model is built to overcome the deficiencies in the earlier models • Several variants of boosting • AdaBoost.M1 – based on the idea of giving weights to instances • Boosting involves two stages • Model generation • Classification
Boosting • Model generation • Assign equal weight to each training instance • For each of t iterations: • Apply learning algorithm to weighted dataset and store resulting model • Compute error e of model on weighted dataset and store error • If e=0 or e>=0.5 • Terminate model generation • For each instance in dataset: • If instance classified correctly by model: • Multiply weight of instance by e/(1-e) • Normalize weight of all instances • Classification • Assign weight of zero to all classes • For each of the t (or less) models: • Add –log(e/(1-e)) to weight of class predicted by model • Return class with highest weight
Stacking • Bagging and boosting combine models of the same type • E.g. a set of decision trees • Stacking is applied to models of different types • Because different models may not perform comparably well, voting may not work • Voting is problematic when two out of three classifiers perform poorly • Stacking uses a metalearner to combine different base learners • Base learners: level-0 models • Meta learner: level-1 model • Predictions of base learners fed as inputs to the meta learner • Base learner predictions on training data cannot be input to meta learner • Instead use cross-validation results on base learner • Because classification is done by base learners, meta learners use simple learning schemes
Combining models using Weka • Weka offers methods to perform bagging, boosting and stacking over classifiers • In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu • AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly • You are encouraged to try these methods on your own