Clustering and Probability (Chap 7)

Clustering and Probability (Chap 7)

Review from Last Lecture Defined the K-means problem for formalizing the notion of clustering. Discussed the K-means algorithm. Noted that the K-means algorithm was “quite good” in discovering “concepts” from data (based on features). Noted the important distinction between “attributes” and “features”.

Example of K-means -1 Let initial centroids be C1 = (1,1) and C2 = (2,1)

Example of K-means-2 C1: = (1,1); C2 = ((2+3+4)/3, (1+4+5)/3)) = (3,3.33)

Example of K-means-3 C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Example of K-means-4 C1: = ((1+2)/2, (1+1)/2) = (1.5,1); C2 = ((3 +4)/2), (4+5)/2) = (3.5, 4.5)

Example: 2 Clusters A(-1,2) B(1,2) c c 4 (0,0) C(-1,-2) D(1,-2) c c 2 K-means Problem: Solution is (0,2) and (0,-2) and the clusters are {A,B} and {C,D} K-means Algorithm: Suppose the initial centroids are (-1,0) and (1,0) then {A,C} and {B,D} end up as the two clusters.

Several other issues regarding clustering How do you select the initial centroids? How do you select the right number of clusters ? How do you deal with non-Euclidean distance/similarity measures ? Other approaches (like hierarchical, spectral etc.) Curse of high-dimensionality.

Question What should the “prediction” be for the flower ?

Prediction and Probability • When we make predictions we should assign “probabilities” with the prediction. • Examples: • 20% chance it will rain tomorrow. • 50% chance that the tumor is malignant. • 60% chance that the stock market will fall by the end of the week. • 30% that the next president of the United States will be a Democrat. • 0.1% chance that the user will click on a banner-ad. • How do we assign probabilities to complex events.. using smart data algorithms…and counting.

Probability Basics • Probability is a deep topic…..but for most cases the rules are straightforward to apply.. • Terminology • Experiment • Sample Space • Events • Probability • Rules of probability • Conditional Probability • Bayes Rule

Probability: Sample Space • Consider an experimentand let S be the space of possible outcomes. • Example: • Experiment is tossing a coin; S={h,t} • Experiment is rolling a pair of dice: S={(1,1),(1,2),…(6,6)} • Experiment is a race consisting of three cars: 1,2 and 3. The sample space is {(1,2,3),(1,3,2),(2,1,3),(2,3,1),(3,1,2),(3,2,1)}

Probabilities Let Sample Space S = {1,2,…m} Consider numbers pi is the probability that the outcome of the experiment is i. Suppose we toss a fair coin. Sample space is S={h,t}. Then ph = 0.5 and pt = 0.5.

Probability • Experiment: Will it rain or not in Sydney : S = {rain, no-rain} • Prain = 138/365 =0.38; Pno-rain = 227/365 • Assigning (or rather how to) probabilities is a deep philosophical problem. • What is the probability that the “green object standing outside my house is a burglar dressed in green.”

Probability • An Event A is a set of possible outcomes of the experiment. Thus A is a subset of S. • Let A be the event of getting a seven when we roll a pair of dice. • A = {(1,6),(6,1),(2,5),(5,2),(4,3),(3,4) } • P(A) = 6/36 = 1/6 • In general

Probability • The sample space S and events are “sets”. • P(S) = 1; • P(Φ) = 0 • Addition: • Often • Complement:

Example • Suppose the probability of raining today is 0.4 and tomorrow is also 0.4 and on both days is 0.1. What is the probability it does not rain on either day. • S={(R,N), (R,R),(N,N),(N,R)} • Let A be the event that it will rain today and B it will rain tomorrow. Then • A ={(R,N), (R,R)} ; B={(N,R),(R,R)} • Rain at least today or tomorrow: • Will not rain on either day: 1 – 0.7 = 0.3

Conditional Probability • One of the most important concepts in all of Data Mining and Machine Learning • P(A|B) = P(AB)/P(B) ..assuming P(B) not equal 0. • Conditional probability of A given B has occurred. • Probability it will rain tomorrow given it has rained today. • P(A|B) = P(AB)/(B) = 0.1/0.4 = ¼ = 0.25 • In general P(A|B) is not equal to P(B|A)

We need conditional probability to answer…. What should the “prediction” be for the flower ?

Bayes Rule Prior Posterior

Bayes Rule: Example The ASX market goes up 60% of the days of a year. 40% of the time it stays the same or goes down. The day the ASX is up, there is a 50% chance that the Shanghai Index is up. On other days there is 30% chance that Shanghai goes up. Suppose The Shanghai market is up. What is the probability that ASX was up. Define A1 as “ASX is up”; A2 is “ASX is not up” Define S1 as “Shanghai is up”; S2 is “Shanghai is not up” We want to calculate P(A1|S1) ? P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3 P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7;

Bayes Rule: Example We want to calculate P(A1|S1) ? P(A1) = 0.6; P(A2) = 0.4; P(S1|A1) = 0.5; P(S1|A2) = 0.3 P(S2|A1) = 1 – P(S1|A1) = 0.5; P(S2|A2) = 1 –P(S1|A2) = 0.7; P(A1|S1) = P(S1|A1)P(A1)/(P(S1)) How do we calculate P(S1) ?

Bayes Rule: Example P(S1) = P(S1,A1) + P(S1,A2) [Key Step] = P(S1|A1)P(A1) + P(S1|A2)P(A2) = 0.5 x 0.6 + 0.3 x 0.4 = 0.42 Finally, P(A1|S1) = P(S1|A1)P(A1)/P(S1) = (0.5 x 0.6)/0.42 = 0.71

Example: Iris Flower F=Flower; SL=Sepal Length; SW = Sepal Width; PL=Petal Length; PW =Petal Width Data choose the maximum

Example: Iris Flower • So how do we compute: P(Data|F=A) ? • This is a a non-trivial question…[subject to much research] • How many times does “Data” appear in the “database” when F=A. • In this case “Data” is a 4-dimensional “data vector.” Each component takes 3 values (small, medium, large). Thus number of combinations 3^4 = 81.

Example: Iris Flower • Conditional Independence • P(Data|F=A) = P(SL=Large,SW=Small,PL=Medium,PW=Small|F=A) • ~= P(SL=Large|F=A)P(SW=Small|F=A)P(PL=Medium|A)P(PW=Small|A) • The above is an assumption to make the “computation easier.” • Surprisingly evidence suggest that it works reasonably well in practice. • This prediction method (which exploits conditional independence) is called “Naïve Bayes Classifier.”

Clustering and Probability (Chap 7)