Probability theory retro

Probability theoryretro 14/03/2017

Probability (atomic) events (A) and probability space () Axioms: - 0 ≤ P(A) ≤ 1 - P()=1 - If A1, A2, … mutually exclusive events (Ai ∩Aj = , i  j) then P(k Ak) =  k P(Ak) 1

- P(Ø) = 0 - P(¬A)=1-P(A) - P(A  B)=P(A)+P(B) – P(A∩B) - P(A) = P(A ∩ B)+P(A ∩¬B) If A  B, then P(A) ≤ P(B) and P(B-A) = P(B) – P(A) 2

Conditional probability Conditional probability is the probability of some event A, given the occurrence of some other event B. P(A|B) = P(A∩B)/P(B) Chain rule: P(A∩B) = P(A|B)·P(B) Example: A: headache, B: influenza P(A) = 1/10, P(B) = 1/40, P(A|B)=? 3

Conditional probability

Independence of events A and B are independent iff P(A|B) = P(A) Corollary: P(AB) = P(A)P(B) P(B|A) = P(B) 5

Product rule A1, A2, …, An arbitrary events P(A1A2…An) = P(An|A1…An-1) P(An-1|A1…An-2)…P(A2| A1)P(A1) If A1, A2, …, An events form a complete probability space and P(Ai) > 0 for each i, then P(B) = ∑j=1nP(B |Ai)P(Ai) 6

Bayes rule P(A|B) = P(A∩B)/P(B) = P(B|A)P(A)/P(B) 7

Random variable ξ:  → R Random variable vectors… 8

cumulative distribution function (CDF), F(x) = P( < x) F(x1) ≤ F(x2), if x1 < x2 limx→-∞F(x) = 0, limx→∞F(x) = 1 F(x) is non-decreasing and right-continuous 9

Discrete vs continousrandom variables Discrete: its value set forms a finite of infinate series Continous: we assume that f(x) is valid on the (a, b) interval 10

Probabilitydensityfunctions (pdf) F(b) - F(a) = P(a <  < b) = a∫bf(x)dx f(x) = F ’(x) és F(x) = .-∞∫xf(t)dt

Empirical estimation of a density Histogram 12

Independence of random variables  and  are independent, iff any a≤ b, c ≤ d P(a ≤ ≤b, c ≤≤d) = P(a ≤ ≤b) P(c ≤≤d). 13

Composition of random variables Discrete case:  =  +  iff  and  are independent rn = P( = n) = k=- P( = n - k,  = k) 14

Expected value  can take values x1, x2, … with p1, p2, … probability then M() = ixipi continous case: M() = -∞∫ xf(x)dx 15

Properties of expected value M(c) = cM() M( + ) = M() + M() If  and  are independent random variables, then M() = M()M() 16

Standard deviation D() = (M[( - M())2])1/2 D2() = M(2) – M2() 17

Properties of standard deviation D2(a + b) = a2D2() if 1, 2, …, n are independent random variables then D2(1 + 2 + … + n) = D2(1) + D2(2) + … + D2(n) 18

Correlation Covariance: c = M[( - M())( - M())] c is 0 if  and  are independent Correlation coefficient: r = c / ((D()D()), normalised covariance into [-1,1] 19

Well-known distributions Normal/Gauss Binomial:  ~ B(n,p) M() = np D() = np(1-p) 20

Pattern ClassificationAll materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000with the permission of the authors and the publisher

Bayes classification

Supervisedlearning: Basedontrainingexamples (E), learn a modell whichworksfineonpreviouslyunseenexamples. Classification: a supervisedlearningtask of categorisationofentitiesintopredefinedsetofclasses 23 Classification

Posterior, likelihood, evidence P(j | x) = P(x | j) . P (j) / P(x) Posterior = (Likelihood. Prior) / Evidence Where in case of two categories Pattern Classification, Chapter 2 (Part 1)

Pattern Classification, Chapter 2 (Part 1)

Bayes Classifier • Decision given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) True state of nature = 1 if P(1 | x) < P(2 | x) True state of nature = 2 This rule minimizes the probability of the error. Pattern Classification, Chapter 2 (Part 1)

Classifiers, Discriminant Functionsand Decision Surfaces • The multi-category case • Set of discriminant functions gi(x), i = 1,…, c • The classifier assigns a feature vector x to class i if: gi(x) > gj(x) j  i Pattern Classification, Chapter 2 (Part 2)

For the minimum error rate, we take gi(x) = P(i | x) (max. discrimination corresponds to max. posterior!) gi(x)  P(x | i) P(i) gi(x) = ln P(x | i) + ln P(i) (ln: natural logarithm!) Pattern Classification, Chapter 2 (Part 2)

Feature space divided into c decision regions if gi(x) > gj(x) j  i then x is in Ri (Rimeans assign x to i) • The two-category case • A classifier is a “dichotomizer” that has two discriminant functions g1 and g2 Let g(x)  g1(x) – g2(x) Decide 1 if g(x) > 0 ; Otherwise decide 2 Pattern Classification, Chapter 2 (Part 2)

The computation of g(x) Pattern Classification, Chapter 2 (Part 2)

Discriminant functions of the Bayes Classifierwith Normal Density Pattern Classification, Chapter 2 (Part 1)

The Normal Density • Univariate density • Density which is analytically tractable • Continuous density • A lot of processes are asymptotically Gaussian • Handwritten characters, speech sounds are ideal or prototype corrupted by random process (central limit theorem) Where:  = mean (or expected value) of x 2 = expected squared deviation or variance Pattern Classification, Chapter 2 (Part 2)

Multivariate density • Multivariate normal density in d dimensions is: where: x = (x1, x2, …, xd)t(t stands for the transpose vector form)  = (1, 2, …, d)t mean vector  = d*d covariance matrix || and -1 are determinant and inverse respectively Pattern Classification, Chapter 2 (Part 2)

Discriminant Functions for the Normal Density • We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln P(x | i) + ln P(i) • Case of multivariate normal Pattern Classification, Chapter 2 (Part 3)

Case i = 2.I(I stands for the identity matrix) Pattern Classification, Chapter 2 (Part 3)

A classifier that uses linear discriminant functions is called “a linear machine” • The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) Pattern Classification, Chapter 2 (Part 3)

The hyperplane is always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 3)

The hyperplane separatingRiand Rj always orthogonal to the line linking the means! Pattern Classification, Chapter 2 (Part 3)

Case i =  (covariance of all classes are identical but arbitrary!) • Hyperplane separating Ri and Rj (the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) Pattern Classification, Chapter 2 (Part 3)

Case i = arbitrary • The covariance matrices are different for each category (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids, hyperhyperboloids) Pattern Classification, Chapter 2 (Part 3)

Exercise Select the optimal decision where: • = {1, 2} P(x | 1) N(2, 0.5) (Normal distribution) P(x | 2) N(1.5, 0.2) P(1) = 2/3 P(2) = 1/3 Pattern Classification, Chapter 2

Probability theory retro

Probability theory retro

Presentation Transcript

Probability Theory

Probability theory

Probability theory

Probability Theory

Probability Theory

Probability Theory

PROBABILITY THEORY

Probability Theory:

Probability Theory

Probability theory

Probability Theory

Probability Theory

Probability Theory

Probability Theory

Probability theory

Probability theory

Probability Theory

Probability Theory

Probability Theory

Probability Theory

Probability theory