1 / 77

Data Mining and Knowledge Acquizition — Chapter 5 III —

Learn about the basics of Bayesian classification, its advantages in probabilistic learning, and its application in classification problems. Understand how to compute posterior probabilities and apply naive assumptions to simplify the classification process.

harryk
Download Presentation

Data Mining and Knowledge Acquizition — Chapter 5 III —

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Knowledge Acquizition — Chapter 5 III — BIS 541 2016/2017 Summer

  2. Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy

  3. Bayesian Classification: Why? • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data. • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

  4. Bayesian Theorem: Basics • Let X be a data sample whose class label is unknown • Let H be a hypothesis that X belongs to class C • For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X • P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge) • P(X): probability that sample data is observed • P(X|H) : probability of observing the sample X, given that the hypothesis holds

  5. Bayesian Theorem • Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem • Informally, this can be written as posterior =likelihood x prior / evidence • MAP (maximum posteriori) hypothesis • Practical difficulty: require initial knowledge of many probabilities, significant computational cost

  6. Naïve Bayes Classifier • A simplified assumption: attributes are conditionally independent: • The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) • No dependence relation between attributes • Greatly reduces the computation cost, only count the class distribution. • Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)

  7. Example • H X is an apple • P(H) priori probability that X is an apple • X observed data:round andred • P(H/X) probability that X is an apple given that we observe that it is red and round • P(X/H) posteriori probability that a data is red and round given that it is an apple • P(X) priori probabilility that it is red and round

  8. Applying Bayesian Theorem • P(H/X)= P(H,X)/P(X) from Bayesian theorem • Similarly: • P(X/H) = P(H,X)/P(H) • P(H,X) = P(X/H)P(H) • hence • P(H/X)= P(X/H)P(H)/P(X) • calculate P(H/X) from • P(X/H),P(H),P(X)

  9. Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(Ci|X) = prob. that the sample tuple X=<x1,…,xk> is of class Ci. There are m classes Ci i =1 to m • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sampleXthe class labelCisuch thatP(Ci|X) is maximal • P(Ci|X)> P(Cj|X) 1<=j<=m ji

  10. Estimating a-posteriori probabilities • Bayes theorem: P(Ci|X) = P(X|Ci)·P(Ci) / P(X) • P(X) is constant for all classes • P(Ci) = relative freq of class Ci samples • Ci such that P(Ci|X) is maximum = Ci such that P(X|Ci)·P(Ci) is maximum • Problem: computing P(X|Ci) is unfeasible!

  11. Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|Ci) = P(x1|Ci)·…·P(xk|Ci) • If i-th attribute is categorical:P(xi|Ci) is estimated as the relative freq of samples having value xi as i-th attribute in class Ci =sik/si . • If i-th attribute is continuous:P(xi|Ci) is estimated thru a Gaussian density function • Computationally easy in both cases

  12. Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

  13. Bayesian classification: Example Given the new customer X=(age<=30 ,income =medium, student=yes,credit_rating=fair) What is the probability of buying computer Compute P(buy computer = yes/X) and P(buy computer = no/X) Decision: - list as probabilities or - chose the maximum conditional probability

  14. Compute P(buy computer = yes/X) = P(X/yes)*P(yes)/P(X) P(buy computer = no/X) P(X/no)*P(no)/P(X) Drop P(X) Decision: maximum of • P(X/yes)*P(yes) • P(X/no)*P(no)

  15. Naïve Bayesian Classifier: Example • Compute P(X/Ci)*P(Ci) for each class • P(X/C = yes)*P(yes) P(age=“<30” | buys_computer=“yes”)* P(income=“medium” |buys_computer=“yes”)* P(credit_rating=“fair” | buys_computer=“yes”)* P(student=“yes” | buys_computer=“yes)* P(C =yes)

  16. P(X/C = no)*P(no) P(age=“<30” | buys_computer=“no”)* P(income=“medium” | buys_computer=“no”)* P(student=“yes” | buys_computer=“no”)* P(credit_rating=“fair” | buys_computer=“no”)* P(C=no)

  17. Naïve Bayesian Classifier: Example P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 P(buys_computer=“yes”)=9/14=0,643 P(buys_computer=“no”)=5/14=0,357

  18. P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“yes”) * P(buys_computer=“yes”) =0.044*0.643=0.02 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|buys_computer=“no”) * P(buys_computer=“no”) =0.019*0.357=0.0007 X belongs to class “buys_computer=yes”

  19. Class probabilities • P(yes/X) = P(X/yes)*P(yes)/P(X) • P(no/X) = P(X/no)*P(no)/P(X) • What is P(X)? • P(X)= P(X/yes)*P(yes)+P(X/no)*P(no) • = 0.02 + 0.0007 • = 0.0207 • So • P(yes/X) = 0.02/0.0207 • P(no/X) = 0.0007/0.0207 • Hence • P(yes/X) + P(no/X) = 1

  20. Naïve Bayesian Classifier: Comments • Advantages : • Easy to implement • Good results obtained in most of the cases • Disadvantages • Assumption: class conditional independence , therefore loss of accuracy • Practically, dependencies exist among variables • E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc • Dependencies among these cannot be modeled by Naïve Bayesian Classifier • How to deal with these dependencies? • Bayesian Belief Networks

  21. Y Z P Bayesian Networks • Bayesian belief network allows a subset of the variables conditionally independent • A graphical model of causal relationships • Represents dependency among the variables • Gives a specification of joint probability distribution • Nodes: random variables • Links: dependency • X,Y are the parents of Z, and Y is the parent of P • No dependency between Z and P • Has no loops or cycles X

  22. Bayesian Belief Network: An Example Family History Smoker (FH, ~S) (~FH, S) (~FH, ~S) (FH, S) LC 0.7 0.8 0.5 0.1 LungCancer Emphysema ~LC 0.3 0.2 0.5 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Bayesian Belief Networks

  23. Learning Bayesian Networks • Several cases • Given both the network structure and all variables observable: learn only the CPTs • Network structure known, some hidden variables: method of gradient descent, analogous to neural network learning • Network structure unknown, all variables observable: search through the model space to reconstruct graph topology • Unknown structure, all hidden variables: no good algorithms known for this purpose • D. Heckerman, Bayesian networks for data mining

  24. Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy

  25. Other Classification Methods • k-nearest neighbor classifier • case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches

  26. Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference

  27. The k-Nearest Neighbor Algorithm • All instances correspond to points in the n-D space. • The nearest neighbor are defined in terms of Euclidean distance. • The target function could be discrete- or real- valued. • For discrete-valued, the k-NN returns the most common value among the k training examples nearest toxq. • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ _ _ . _ . + . + . _ + xq . _ +

  28. V: v1,...vn • fp(xq) = argmaxvVki=1(v,f(xi)) • (a,b) =1 if a=b otherwise 0 • for real valued target functions • fp(xq) = ki=1f(xi)/k

  29. Discussion on the k-NN Algorithm • The k-NN algorithm for continuous-valued target functions • Calculate the mean values of the k nearest neighbors • Distance-weighted nearest neighbor algorithm • Weight the contribution of each of the k neighbors according to their distance to the query point xq • giving greater weight to closer neighbors • Similarly, for real-valued target functions • Robust to noisy data by averaging k-nearest neighbors • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. • To overcome it, axes stretch or elimination of the least relevant attributes.

  30. Robust to noisy data by averaging k-nearest neighbors • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. • To overcome it, axes stretch or elimination of the least relevant attributes. • stretch each variable with a different factor • experiment on the best stretching factor by cross validation • zi =kxi • irrelevant variables has small k values • k=0 completely eliminates the variable

  31. Instance-Based Methods • Instance-based learning: • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified • Typical approaches • k-nearest neighbor approach • Instances represented as points in a Euclidean space. • Locally weighted regression • Constructs local approximation • Case-based reasoning • Uses symbolic representations and knowledge-based inference

  32. Nearest Neighbor Approaches Based on the concept of similarity Memory-Based Reasoning (MBR) – results are based on analogous situations in the past Collaborative Filtering – results use preferences in addition to analogous situations from the past

  33. Memory-Based Reasoning (MBR) • Our ability to reason from experience depends on our ability to recognize appropriate examples from the past… • Traffic patterns/routes • Movies • Food • We identify similar example(s) and apply what we know/learned to current situation • These similar examples in MBR are referred to as neighbors

  34. MBR Applications • Fraud detection • Customer response prediction • Medical treatments • Classifying responses – MBR can process free-text responses and assign codes

  35. MBR Strengths • Ability to use data “as is” – utilizes both a distance function and a combination function between data records to help determine how “neighborly” they are • Ability to adapt – adding new data makes it possible for MBR to learn new things • Good results without lengthy training

  36. MBR Example – Rents in Tuxedo, NY • Classify nearest neighbors based on descriptive variables – population & median home prices (not geography in this example) • Range midpoint in 2 neighbors is $1,000 & $1,250 so Tuxedo rent should be $1,125; 2nd method yields rent of $977 • Actual midpoint rent in Tuxedo turns out to be $1,250 (one method) and $907 in another.

  37. MBR Challenges • Choosing appropriate historical data for use in training • Choosing the most efficient way to represent the training data • Choosing the distance function, combination function, and the number of neighbors

  38. Example

  39. Distance Function • For numerical variables • Absolute value of distane |A-B| • Ex d(27,51)= |27-51|=24 • Square of differences (A-B)2 • Ex d(27,51)= (27-51)=242 • Normalized absolute value |A-B|/max differ • Ex d(27,51)= |27-51|/|27-52|=0,96 • Standardised absolute value • |A-B|/standard deviation • Categorical variables (similar to clusteing) • Ex gender • d(male,male)=0, d(female,female)=0 • d(male,female)=1, d(female,male)=1

  40. Combining distance between variables • Manhatten • Ex dsum(A,B)=dgender(A,B)+ dsalaryr(A,B)+ dage(A,B) • Normalized summation • Ex dsum(A,B)/max dsum • Euclidean • deuc(A,B)= • Sqrt(dgender(A,B)2+ dsalaryr(A,B) 2+ dage(A,B) 2)

  41. The Combination Function • For categorical target variables- classification • Voting:Majority rule • Weighted voting • Weights inversly proportional to the distance • For numerical target variables –numerical prediction • Take average • Weighted average • Weights inversly proportional to the distance

  42. Collaborative Filtering • Lots of human examples of this: • Best teachers • Best courses • Best restaurants (ambiance, service, food, price) • Recommend a dentist, mechanic, PC repair, blank CDs/DVDs, wines, B&Bs, etc… • CF is a variant of MBR particularly well suited to personalized recommendations

  43. Collaborative Filtering • Starts with a history of people’s personal preferences • Uses a distance function – people who like the same things are “close” • Uses “votes” which are weighted by distances, so close neighbor votes count more • Basically, judgments of a peer group are important

  44. Collaborative Filtering • Knowing that lots of people liked something is not sufficient… • Who liked it is also important • Friend whose past recommendations were good (or bad) • High profile person seems to influence • Collaborative Filtering automates this word-of-mouth everyday activity

  45. Preparing Recommendations for Collaborative Filtering • Building customer profile – ask new customer to rate selection of things • Comparing this new profile to other customers using some measure of similarity • Using some combination of the ratings from similar customers to predict what the new customer would select for items he/she has NOT yet rated

  46. Collaborative Filtering Example • What rating would Nathaniel give to Planet of the Apes? • Simon, distance 2, rated it -1 • Amelia, distance 4, rated it -4 • Using weighted average inverse to distance, it is predicted that he would rate it a -2 • =(0.5*-1 + 0.25*-4) / (0.5 + 0.25) • Nathaniel can certainly enter his rating after seeing the movie which could be close or far from the prediction

  47. Chapter 7. Classification and Prediction • Bayesian Classification • Model Based Reasoning • Collaborative Filtering • Classification accuracy

  48. Holdout estimation • What to do if the amount of data is limited? • The holdout method reserves a certain amount for testing and uses the remainder for training • Usually: one third for testing, the rest for training • Problem: the samples might not be representative • Example: class might be missing in the test data • Advanced version uses stratification • Ensures that each class is represented with approximately equal proportions in both subsets

  49. Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples • In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) • The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method • Still not optimum: the different test sets overlap • Can we prevent overlapping?

More Related