770 likes | 789 Views
Learn about the basics of Bayesian classification, its advantages in probabilistic learning, and its application in classification problems. Understand how to compute posterior probabilities and apply naive assumptions to simplify the classification process.
E N D
Data Mining and Knowledge Acquizition — Chapter 5 III — BIS 541 2016/2017 Summer
Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy
Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds
Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost
Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)
Example • H X is an apple • P(H) priori probability that X is an apple • X observed data:round andred • P(H/X) probability that X is an apple given that we observe that it is red and round • P(X/H) posteriori probability that a data is red and round given that it is an apple • P(X) priori probabilility that it is red and round
Applying Bayesian Theorem • P(H/X)= P(H,X)/P(X) from Bayesian theorem • Similarly: • P(X/H) = P(H,X)/P(H) • P(H,X) = P(X/H)P(H) • hence • P(H/X)= P(X/H)P(H)/P(X) • calculate P(H/X) from • P(X/H),P(H),P(X)
Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(Ci|X) = prob. that the sample tuple X=<x1,…,xk> is of class Ci. There are m classes Ci i =1 to m • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCisuch thatP(Ci|X) is maximal • P(Ci|X)> P(Cj|X) 1<=j<=m ji
Estimating a-posteriori probabilities • Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) • P(X) is constant for all classes • P(Ci) = relative freq of class Ci samples • Ci such that P(Ci|X) is maximum = Ci such that P(X|Ci)·P(Ci) is maximum • Problem: computing P(X|Ci) is unfeasible!
Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci) • If i-th attribute is categorical:P(xi|Ci) is estimated as the relative freq of samples having value xi as i-th attribute in class Ci =sik/si . • If i-th attribute is continuous:P(xi|Ci) is estimated thru a Gaussian density function • Computationally easy in both cases
Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)
Bayesian classification: Example Given the new customer X=(age<=30 ,income =medium, student=yes,credit_rating=fair) What is the probability of buying computer Compute P(buy computer = yes/X) and P(buy computer = no/X) Decision: - list as probabilities or - chose the maximum conditional probability
Compute P(buy computer = yes/X) = P(X/yes)*P(yes)/P(X) P(buy computer = no/X) P(X/no)*P(no)/P(X) Drop P(X) Decision: maximum of • P(X/yes)*P(yes) • P(X/no)*P(no)
Naïve Bayesian Classifier: Example • Compute P(X/Ci)*P(Ci) for each class • P(X/C = yes)*P(yes) P(age=“<30” | buys_computer=“yes”)* P(income=“medium” |buys_computer=“yes”)* P(credit_rating=“fair” | buys_computer=“yes”)* P(student=“yes” | buys_computer=“yes)* P(C =yes)
P(X/C = no)*P(no) P(age=“<30” | buys_computer=“no”)* P(income=“medium” | buys_computer=“no”)* P(student=“yes” | buys_computer=“no”)* P(credit_rating=“fair” | buys_computer=“no”)* P(C=no)
Naïve Bayesian Classifier: Example P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(buys_computer=“yes”)=9/14=0,643 P(buys_computer=“no”)=5/14=0,357
P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.02 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|buys_computer=“no”) * P(buys_computer=“no”) =0.019*0.357=0.0007 X belongs to class “buys_computer=yes”
Class probabilities • P(yes/X) = P(X/yes)*P(yes)/P(X) • P(no/X) = P(X/no)*P(no)/P(X) • What is P(X)? • P(X)= P(X/yes)*P(yes)+P(X/no)*P(no) • = 0.02 + 0.0007 • = 0.0207 • So • P(yes/X) = 0.02/0.0207 • P(no/X) = 0.0007/0.0207 • Hence • P(yes/X) + P(no/X) = 1
Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks
Y Z P Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X,Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X
Bayesian Belief Network: An Example Family History Smoker (FH, ~S) (~FH, S) (~FH, ~S) (FH, S) LC 0.7 0.8 0.5 0.1 LungCancer Emphysema ~LC 0.3 0.2 0.5 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks
Learning Bayesian Networks • Several cases • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct graph topology • Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining
Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy
Other Classification Methods • k-nearest neighbor classifier • case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches
Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference
The k-Nearest Neighbor Algorithm • All instances correspond to points in the n-D space. • The nearest neighbor are defined in terms of Euclidean distance. • The target function could be discrete- or real- valued. • For discrete-valued, the k-NN returns the most common value among the k training examples nearest toxq. • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ _ _ . _ . + . + . _ + xq . _ +
V: v1,...vn • fp(xq) = argmaxvVki=1(v,f(xi)) • (a,b) =1 if a=b otherwise 0 • for real valued target functions • fp(xq) = ki=1f(xi)/k
Discussion on the k-NN Algorithm • The k-NN algorithm for continuous-valued target functions • Calculate the mean values of the k nearest neighbors • Distance-weighted nearest neighbor algorithm • Weight the contribution of each of the k neighbors according to their distance to the query point xq • giving greater weight to closer neighbors • Similarly, for real-valued target functions • Robust to noisy data by averaging k-nearest neighbors • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. • To overcome it, axes stretch or elimination of the least relevant attributes.
Robust to noisy data by averaging k-nearest neighbors • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. • To overcome it, axes stretch or elimination of the least relevant attributes. • stretch each variable with a different factor • experiment on the best stretching factor by cross validation • zi =kxi • irrelevant variables has small k values • k=0 completely eliminates the variable
Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference
Nearest Neighbor Approaches Based on the concept of similarity Memory-Based Reasoning (MBR) – results are based on analogous situations in the past Collaborative Filtering – results use preferences in addition to analogous situations from the past
Memory-Based Reasoning (MBR) • Our ability to reason from experience depends on our ability to recognize appropriate examples from the past… • Traffic patterns/routes • Movies • Food • We identify similar example(s) and apply what we know/learned to current situation • These similar examples in MBR are referred to as neighbors
MBR Applications • Fraud detection • Customer response prediction • Medical treatments • Classifying responses – MBR can process free-text responses and assign codes
MBR Strengths • Ability to use data “as is” – utilizes both a distance function and a combination function between data records to help determine how “neighborly” they are • Ability to adapt – adding new data makes it possible for MBR to learn new things • Good results without lengthy training
MBR Example – Rents in Tuxedo, NY • Classify nearest neighbors based on descriptive variables – population & median home prices (not geography in this example) • Range midpoint in 2 neighbors is $1,000 & $1,250 so Tuxedo rent should be $1,125; 2nd method yields rent of $977 • Actual midpoint rent in Tuxedo turns out to be $1,250 (one method) and $907 in another.
MBR Challenges • Choosing appropriate historical data for use in training • Choosing the most efficient way to represent the training data • Choosing the distance function, combination function, and the number of neighbors
Distance Function • For numerical variables • Absolute value of distane |A-B| • Ex d(27,51)= |27-51|=24 • Square of differences (A-B)2 • Ex d(27,51)= (27-51)=242 • Normalized absolute value |A-B|/max differ • Ex d(27,51)= |27-51|/|27-52|=0,96 • Standardised absolute value • |A-B|/standard deviation • Categorical variables (similar to clusteing) • Ex gender • d(male,male)=0, d(female,female)=0 • d(male,female)=1, d(female,male)=1
Combining distance between variables • Manhatten • Ex dsum(A,B)=dgender(A,B)+ dsalaryr(A,B)+ dage(A,B) • Normalized summation • Ex dsum(A,B)/max dsum • Euclidean • deuc(A,B)= • Sqrt(dgender(A,B)2+ dsalaryr(A,B) 2+ dage(A,B) 2)
The Combination Function • For categorical target variables- classification • Voting:Majority rule • Weighted voting • Weights inversly proportional to the distance • For numerical target variables –numerical prediction • Take average • Weighted average • Weights inversly proportional to the distance
Collaborative Filtering • Lots of human examples of this: • Best teachers • Best courses • Best restaurants (ambiance, service, food, price) • Recommend a dentist, mechanic, PC repair, blank CDs/DVDs, wines, B&Bs, etc… • CF is a variant of MBR particularly well suited to personalized recommendations
Collaborative Filtering • Starts with a history of people’s personal preferences • Uses a distance function – people who like the same things are “close” • Uses “votes” which are weighted by distances, so close neighbor votes count more • Basically, judgments of a peer group are important
Collaborative Filtering • Knowing that lots of people liked something is not sufficient… • Who liked it is also important • Friend whose past recommendations were good (or bad) • High profile person seems to influence • Collaborative Filtering automates this word-of-mouth everyday activity
Preparing Recommendations for Collaborative Filtering • Building customer profile – ask new customer to rate selection of things • Comparing this new profile to other customers using some measure of similarity • Using some combination of the ratings from similar customers to predict what the new customer would select for items he/she has NOT yet rated
Collaborative Filtering Example • What rating would Nathaniel give to Planet of the Apes? • Simon, distance 2, rated it -1 • Amelia, distance 4, rated it -4 • Using weighted average inverse to distance, it is predicted that he would rate it a -2 • =(0.5*-1 + 0.25*-4) / (0.5 + 0.25) • Nathaniel can certainly enter his rating after seeing the movie which could be close or far from the prediction
Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy
Holdout estimation • What to do if the amount of data is limited? • The holdout method reserves a certain amount for testing and uses the remainder for training • Usually: one third for testing, the rest for training • Problem: the samples might not be representative • Example: class might be missing in the test data • Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets
Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?