470 likes | 630 Views
Final Review. This is not a comprehensive review but highlights certain key areas. Top-Level Data Mining Tasks. At highest level, data mining tasks can be divided into: Prediction Tasks (supervised learning) Use some variables to predict unknown or future values of other variables
E N D
Final Review This is not a comprehensive review but highlights certain key areas
Top-Level Data Mining Tasks • At highest level, data mining tasks can be divided into: • Prediction Tasks (supervised learning) • Use some variables to predict unknown or future values of other variables • Classification • Regression • Description Tasks (unsupervised learning) • Find human-interpretable patterns that describe the data • Clustering • Association Rule Mining
Classification: Definition • Given a collection of records (training set ) • Each record contains a set of attributes, one of the attributes is the class, which is to be predicted. • Find a model for class attribute as a function of the values of other attributes. • Model maps record to a class value • Goal: previously unseen records should be assigned a class as accurately as possible. • A test setis used to determine accuracy of the model • Can you think of classification tasks?
Classification • Simple linear • Decision trees (entropy, GINI) • Naïve Bayesian • Nearest Neighbor • Neural Networks
Regression • Predict a value of a given continuous (numerical) variable based on the values of other variables • Greatly studied in statistics • Examples: • Predicting sales amounts of new product based on advertising expenditure. • Predicting wind velocities as a function of temperature, humidity, air pressure, etc. • Time series prediction of stock market indices
Clustering • Given a set of data pointsfind clusters so that • Data points in same cluster are similar • Data points in different clusters are dissimilar You try it on the Simpsons. How can we cluster these 5 “data points”?
Association Rule Discovery • Given a set of records each of which contain some number of items from a given collection • Produce dependency rules which will predict occurrence of an item based on occurrences of other items. Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Diapers beer
Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes and attribute values • Same attribute can be mapped to different attribute values • Example: height can be measured in feet or meters • Different attributes can be mapped to the same set of values • Example: Attribute values for ID and age are integers • But properties of attribute values can be different • ID has no limit but age has a maximum and minimum value
Types of Attributes • There are different types of attributes • Nominal (Categorical) • Examples: ID numbers, eye color, zip codes • Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} • Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. • Ratio • Examples: temperature in Kelvin, length, time, counts
Decision Tree Representation • Each internal node tests an attribute • Each branch corresponds to attribute value • Each leaf node assigns a classification outlook sunny overcast rain humidity yes wind weak normal strong high no yes no yes
How do we construct the decision tree? • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they can be discretized in advance) • Examples are partitioned recursively based on selected attributes. • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning • All samples for a given node belong to the same class • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf • There are no samples left • Pre-pruning/post-pruning
How To Split Records • Random Split • The tree can grow huge • These trees are hard to understand. • Larger trees are typically less accurate than smaller trees. • Principled Criterion • Selection of an attribute to test at each node - choosing the most useful attribute for classifying examples. • How? • Information gain • measures how well a given attribute separates the training examples according to their target classification • This measure is used to select among the candidate attributes at each step while growing the tree
Advantages/Disadvantages of Decision Trees • Advantages: • Easy to understand (Doctors love them!) • Easy to generate rules • Disadvantages: • May suffer from overfitting. • Classifies by rectangular partitioning (so does not handle correlated features very well). • Can be quite large – pruning is necessary. • Does not handle streaming data easily
on training data on test data Overfitting (another view) • Learning a tree that classifies the training data perfectly may not lead to the tree with the best generalization to unseen data. • There may be noise in the training data that the tree is erroneously fitting. • The algorithm may be making poor decisions towards the leaves of the tree that are based on very little data and may not reflect reliable trends. accuracy hypothesis complexity/size of the tree (number of nodes)
Notes on Overfitting • Overfitting results in decision trees (models in general) that are more complex than necessary • Training error no longer provides a good estimate of how well the tree will perform on previously unseen records • Need new ways for estimating errors
Evaluation • Accuracy • Recall/Precision/F-measure
Bayes Classifiers • That was a visual intuition for a simple case of the Bayes classifier, also called: • Idiot Bayes • Naïve Bayes • Simple Bayes • We are about to see some of the mathematical formalisms, and more examples, but keep in mind the basic idea. • Find out the probability of the previously unseen instance belonging to each class, then simply pick the most probable class. • Go through all the examples on the slides and be ready to generate tables similar to the ones presented in class and the one you created for your HW assignment. • Smoothing
Bayesian Classifiers • Bayesian classifiers use Bayes theorem, which says p(cj| d ) = p(d | cj) p(cj) p(d) • p(cj| d) = probability of instance d being in class cj, This is what we are trying to compute • p(d | cj) = probability of generating instance d given class cj, We can imagine that being in class cj, causes you to have feature d with some probability • p(cj) = probability of occurrence of class cj, This is just how frequent the class cj, is in our database • p(d) = probability of instance d occurring This can actually be ignored, since it is the same for all classes
Bayesian Classification • Statistical method for classification. • Supervised Learning Method. • Assumes an underlying probabilistic model, the Bayes theorem. • Can solve diagnostic and predictive problems. • Particularly suited when the dimensionality of the input is high • In spite of the over-simplified assumption, it often performs better in many complex real-world situations
Advantages/Disadvantages of Naïve Bayes • Advantages: • Fast to train (single scan). Fast to classify • Not sensitive to irrelevant features • Handles real and discrete data • Handles streaming data well • Disadvantages: • Assumes independence of features
Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobis
Strengths and Weaknesses • Strengths: • Simple to implement and use • Comprehensible – easy to explain prediction • Robust to noisy data by averaging k-nearest neighbors • Distance function can be tailored using domain knowledge • Can learn complex decision boundaries • Much more expressive than linear classifiers & decision trees • More on this later • Weaknesses: • Need a lot of space to store all examples • Takes much more time to classify a new example than with a parsimonious model (need to compare distance to all other examples) • Distance function must be designed carefully with domain knowledge
Strengths and Weaknesses • Strengths: • Simple to implement and use • Comprehensible – easy to explain prediction • Robust to noisy data by averaging k-nearest neighbors • Distance function can be tailored using domain knowledge • Can learn complex decision boundaries • Much more expressive than linear classifiers & decision trees • More on this later • Weaknesses: • Need a lot of space to store all examples • Takes much more time to classify a new example than with a parsimonious model (need to compare distance to all other examples) • Distance function must be designed carefully with domain knowledge
Perceptrons • The perceptron is a type of artificial neural network which can be seen as the simplest kind of feedforward neural network: a linear classifier • Introduced in the late 50s • Perceptron convergence theorem (Rosenblatt 1962): • Perceptron will learn to classify any linearly separable set of inputs. • Perceptron is a network: • single-layer • feed-forward: data only travels in one direction XOR function (no linear separation)
Perceptron: Artificial Neuron Model Model network as a graphwith cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of a neuron is calculated by summing the weighted input values from its input links threshold threshold function Vector notation:
Summary of Neural Networks When are Neural Networks useful? Instances represented by attribute-value pairs Particularly when attributes are real valued The target function is Discrete-valued Real-valued Vector-valued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required
Types of Clusterings • A clustering is a set of clusters • Important distinction between hierarchical and partitionalsets of clusters • Partitional Clustering • A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset • Hierarchical clustering • A set of nested clusters organized as a hierarchical tree
K-means Clustering • Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest centroid • Number of clusters, K, must be specified • The basic algorithm is very simple • K-means tutorial available from http://maya.cs.depaul.edu/~classes/ect584/WEKA/k-means.html
K-means Clustering • Ask user how many clusters they’d like. (e.g. k=3) • Randomly guess k cluster Center locations • Each datapoint finds out which Center it’s closest to. • Each Center finds the centroid of the points it owns… • …and jumps there • …Repeat until terminated! 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering: Step 1 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering 5 4 3 2 1 0 0 1 2 3 4 5
k1 k2 k3 K-means Clustering
Strengths of Hierarchical Clustering • Do not have to assume any particular number of clusters • Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level • They may correspond to meaningful taxonomies • Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)
Hierarchical Clustering • Two main types of hierarchical clustering • Agglomerative: • Start with the points as individual clusters • At each step, merge the closest pair of clusters until only one cluster (or k clusters) left • Divisive: • Start with one, all-inclusive cluster • At each step, split a cluster until each cluster contains a point (or there are k clusters) • Agglomerative is most common
p1 p2 p3 p4 p5 . . . p1 p2 p3 p4 p5 . . . How to Define Inter-Cluster Similarity Similarity? • MIN • MAX • Group Average • Distance Between Centroids • Other methods driven by an objective function • Ward’s Method uses squared error Proximity Matrix
DBSCAN • DBSCAN is a density-based algorithm. • Density = number of points within a specified radius (Eps) • A point is a core point if it has more than a specified number of points (MinPts) within Eps • These are points that are at the interior of a cluster • A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point • A noise point is any point that is not a core point or a border point.
What Is Association Mining? • Association rule mining: • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. • Applications: • Market Basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.
Association Rule Mining • We are interested in rules that are • non-trivial (and possibly unexpected) • actionable • easily explainable
Support and Confidence Customer buys diaper Customer buys both • Find all the rules X Y with minimum confidence and support • Support = probability that a transaction contains {X,Y} • i.e., ratio of transactions in which X, Y occur together to all transactions in database. • Confidence = conditional probability that a transaction having X also contains Y • i.e., ratio of transactions in which X, Y occur together to those in which X occurs. Customer buys beer In general confidence of a rule LHS => RHS can be computed as the support of the whole itemset divided by the support of LHS: Confidence (LHS => RHS) = Support(LHS È RHS) / Support(LHS)
Definition: Frequent Itemset • Itemset • A collection of one or more items • Example: {Milk, Bread, Diaper} • k-itemset • An itemset that contains k items • Support count () • Frequency of occurrence of itemset • E.g. ({Milk, Bread,Diaper}) = 2 • Support • Fraction of transactions that contain an itemset • E.g. s({Milk, Bread, Diaper}) = 2/5 • Frequent Itemset • An itemset whose support is greater than or equal to a minsup threshold
The Apriori algorithm • The best known algorithm • Two steps: • Find all itemsets that have minimum support (frequent itemsets, also called large itemsets). • Use frequent itemsets to generate rules. • E.g., a frequent itemset {Chicken, Clothes, Milk} [sup = 3/7] and one rule from the frequent itemset Clothes Milk, Chicken [sup = 3/7, conf = 3/3] CS583, Bing Liu, UIC
Associations: Pros and Cons • Pros • can quickly mine patterns describing business/customers/etc. without major effort in problem formulation • virtual items allow much flexibility • unparalleled tool for hypothesis generation • Cons • unfocused • not clear exactly how to apply mined “knowledge” • only hypothesis generation • can produce many, many rules! • may only be a few nuggets among them (or none)
Association Rules • Association rule types: • Actionable Rules – contain high-quality, actionable information • Trivial Rules – information already well-known by those familiar with the business • Inexplicable Rules – no explanation and do not suggest action • Trivial and Inexplicable Rules occur most often