1 / 15

Issues with Data Mining

Issues with Data Mining. Data Mining involves Generalization. Data mining learns generalizations of the instances in training data E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute

Faraday
Download Presentation

Issues with Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues with Data Mining

  2. Data Mining involves Generalization • Data mining learns generalizations of the instances in training data • E.g. a decision tree learnt from weather data captures generalizations about the prediction of values for Play attribute • This means, generalizations predict (or describe) the behaviour of instances beyond the training data • This in turn means, knowledge is extracted from raw data using data mining • This knowledge drives end-user’s decision making process

  3. Generalization as Search • The process of generalization can be viewed as searching a space of all possible patterns or models • For a pattern that fits the data • This view provides a standard framework for understanding all data mining techniques • E.g decision tree learning involves searching through all possible decision trees • Lecture 4 shows two example decision trees that fit the weather data • One of them is a better generalization than the other (Example 2)

  4. Bias • Important choices made in a data mining system are • representation language, • search method and • model pruning method • This means, each data mining scheme involves • Language bias – the language chosen to represent the patterns or models • Search bias – the order in which the space is searched • Overfitting-avoidance bias – the way overfitting to the training data is avoided

  5. Language Bias • Different languages used for representing patterns and models • E.g. rules and decision trees • A concept fits a subset of training data • That subset can be described as a disjunction of rules • E.g classifier for the weather data can be represented as a disjunction of rules • Languages differ in their ability to represent patterns and models • This means, when a language with lower representation ability is used, the data mining system may not achieve good performance • Domain knowledge helps to cut down search space

  6. Search Bias • An exhaustive search over the search space is computationally expensive • Search is speeded up by using heuristics • Pure children nodes indicate good tree stumps in decision tree learning • By definition heuristics cannot guarantee optimum patterns or models • Using information gain may mislead us to select a suboptimal attribute at the root • Complex search strategies possible • Those that pursue several alternatives parallelly • Those that allow backtracking • A high-level search bias • General-to-specific: start with a root node and grow the decision tree to fit the specific data • Specific-to-general: choose specific examples in each class and then generalize the class by including k-nearest neighbour examples

  7. Overfitting-avoidance bias • We want to search for ‘best’ patterns and models • Simple models are the best • Two strategies • Start with the simplest model and stop building model when it starts to become complex • Start with a complex model and prune it to make it simpler • Each strategy biases search in a different way • Biases are unavoidable in practice • Each data mining scheme might involve a configuration of biases • These biases may serve some problems well • There is no universal best learning scheme! • We saw this in our practicals with Weka

  8. Combining Multiple Models • Because there is no ideal data mining scheme, it is useful to combine multiple models • Idea of democracy – decisions made based on collective wisdom • Each data mining scheme acts like an expert using its knowledge to make decisions • Three general approaches • Bagging • Boosting • Stacking • Bagging and boosting both follow the same approach • Take a vote on the class prediction from all the different schemes • Bagging uses a simple average of votes while boosting uses a weighted average • Boosting gives more weight to more knowledgeable experts • Boosting is generally considered the most effective

  9. Bias-Variance Decomposition • Assume • Infinite training data sets of the same size, n • Infinite number of classifiers trained on the above data sets • For any learning scheme • Bias = expected error of the classifier even after increasing training data infinitely • Variance = expected error due to the particular training set used • Total expected error = bias + variance • Combining multiple classifiers decreases the expected error by reducing the variance component

  10. Bagging • Bagging stands for bootstrap aggregating • Combines equally weighted predictions from multiple models • Bagging exploits instability in learning schemes • Instability – small change in training data results in big change in model • Idealized version for classifier • Collect several independent training sets • Build a classifier from each training set • E.g learn a decision tree from each training set • The class of a test instance is the prediction that received most votes from all the classifiers • Practically it is not feasible to obtain several independent training sets

  11. Bagging Algorithm • Model Generation • Let n be the number of instances in the training data • For each of t iterations • Sample n instances with replacement from training data • Apply the learning algorithm to the sample • Store the resulting model • Classification • For each of the t models: • Predict class of instance using model • Return class that has been predicted most often

  12. Boosting • Multiple data mining methods might complement each other • Each method performing well on a subset of data • Boosting combines complementing models • Using weighted voting • Boosting is iterative • Each new model is built to overcome the deficiencies in the earlier models • Several variants of boosting • AdaBoost.M1 – based on the idea of giving weights to instances • Boosting involves two stages • Model generation • Classification

  13. Boosting • Model generation • Assign equal weight to each training instance • For each of t iterations: • Apply learning algorithm to weighted dataset and store resulting model • Compute error e of model on weighted dataset and store error • If e=0 or e>=0.5 • Terminate model generation • For each instance in dataset: • If instance classified correctly by model: • Multiply weight of instance by e/(1-e) • Normalize weight of all instances • Classification • Assign weight of zero to all classes • For each of the t (or less) models: • Add –log(e/(1-e)) to weight of class predicted by model • Return class with highest weight

  14. Stacking • Bagging and boosting combine models of the same type • E.g. a set of decision trees • Stacking is applied to models of different types • Because different models may not perform comparably well, voting may not work • Voting is problematic when two out of three classifiers perform poorly • Stacking uses a metalearner to combine different base learners • Base learners: level-0 models • Meta learner: level-1 model • Predictions of base learners fed as inputs to the meta learner • Base learner predictions on training data cannot be input to meta learner • Instead use cross-validation results on base learner • Because classification is done by base learners, meta learners use simple learning schemes

  15. Combining models using Weka • Weka offers methods to perform bagging, boosting and stacking over classifiers • In the Explorer, under the classify tab, expand the ‘meta’ section of the hierarchical menu • AdaboostM1 (one of the boosting methods) on Iris data classifies only 7 out of 150 incorrectly • You are encouraged to try these methods on your own

More Related