1 / 33

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

This chapter explores probability distributions, graphs, and terminologies related to pattern recognition and machine learning. It covers topics such as Schur complement, completing the square, Robbins-Monro algorithm, Bayesian relationships with Gaussian distributions, and variations of Gaussian distributions.

matthewsk
Download Presentation

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011

  2. For understanding distributions Terminologies

  3. Terminologies • Schur complement: relationship between original matrix and its inverse. • Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. • Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x)

  4. Terminologies (cont.) • [Stochastic appro., wiki., 2011] • Condition on that • Trace Tr(W) is sum of diagonals • Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

  5. Gaussian distributions and motives Distributions

  6. Conditional Gaussian Distribution Assume y=Xa, x=Xb • Derivation of conditional mean and variance: • Noting Schur complement • Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb.

  7. Marginal Gaussian Distribution • Goal is also to identify mean and variance by ‘completing the square’. • Solving above integration while noting Schur complement and compare components

  8. Bayesian relationship with Gaussian distr. (quick view) • Consider multivariable Gaussian where • Thus • According to Bayesian equation • The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x) • Ie. becomes

  9. Bayesian relationship with Gaussian distr. • Starting from • Mean and var. for joint Gaussian distr. P(x,y) • Mean and variance for P(x|y) Can be seem as prior Can be seem as likelihood Can be seem as posterior

  10. Bayesian relationship with Gaussian distr., sequential est. • Estimate mean by (N-1)+1 observations • Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. • solve for by Robbin-Monro

  11. Bayesian relationship with Univariate Gaussian distr. • Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function • Conjugate prior of univariate Gaussian is Gaussian-gamma function

  12. Bayesian relationship with Multivariate Gaussian distr. • Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr. • Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

  13. Gaussian distributions variations Distributions

  14. Student’s t-distr • Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. • If Xi are normal random then • T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. • Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

  15. Student’s t-distr (cont.) • For multivariate Gaussian , corresponding t-distri. • Mahalanobis dist. • Mean, variance

  16. Gaussian with periodic variables • To avoid mean been dependent on choice of origin use polar coordinate • Solve for theta • Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

  17. Gaussian with periodic variables (cont.) • From Gaussian of Cartesian coordinate to polar • Becomes • Von Mises distr. • Mean • Precision (concentration)

  18. Gaussian with periodic variables: mean and variance • Solving log likelihood • mean • precision ‘m’ • By noting

  19. Mixture of Gaussians • In part1 we already know one limitation of Gaussian is unimodal property. • Solution: linear comb. (superposition) of Gaussians • Mixing coefficients sum to 1 • Posterior here is known as ‘responsibilities’ • Log likelihood:

  20. Exponential family • Natural form • Normalize by • 1) Bernoulli • Becomes • 2) Multinomial • Becomes

  21. Exponential family (cont.) • 3) Univariate Gaussian • Becomes • Solve for natural parameter • Becomes • From max. likelihood

  22. And interesting methodologies Parameters of Distributions

  23. Uninformative priors • “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr.) prior. • Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. • 1) location parameter for translation invariance • 2) scale parameter for scale invariance in

  24. Nonparametric methods • Instead of assume form of distribution, use nonparametric methods. • 1) Histogram of constant bin width • Good for sequential data • Problem: discontinuity, dimensionality increase exp. • 2) Kernel estimators: sum of Parzen windows • ‘N’ Observations falling in region R (volume V) is ‘K’ • becomes

  25. Nonparametric method: Kernel estimators • 2) Kernel estimators: fix V, determine K • Form of kernel function for points falling in R • h>0 is fixed parameter bandwidth for smoothing • Parzen estimator. Can choose k(u) (ex. Gaussian)

  26. Nonparametric method: Nearest-neighbor • 3) Nearest neighbor: this time use data to grow V Prior: • Same as kernel estimator: training set is store as knowledge base. • ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. • For classifying N points into Nk points in class Ck from Bayesian maximize

  27. Nonparametric method: Nearest-neighbor (cont.) • 3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… - for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki., 2011]

  28. From David Barber’s book Ch.2 Basic Graph Concepts

  29. Directed and undirected graphs • G with vertices and edges that are directed or undirected. • Directed graph, A->B but not B->A then A is ancestor or parent, where B is child • Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) • Connected undirected graph: path between every vertices • Clique: cycle for undirected graph

  30. Representations of Graphs • Singled connected (tree): only one path from A to B • Spanning tree of undirected graph: singly connected subset covering all vertices • Graph representation (numerical) • Edge list: ex. • Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

  31. Representations of Graphs (cont.) • Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix • Provided there are no edge from a vertex to itself • K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. • 2 cliques: vertices {1,2,3} • and {2,3,4}

  32. Incidence Matrix • Adjacency matrix A and incidence matrix Zinc • Maximum clique incidence matrix Z • Property: • Note: Zinc columns denote edges, and rows denote vertices

  33. Additional Information • Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127. • Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23. • Slide uploaded to Google group. Use with reference.

More Related