330 likes | 370 Views
This chapter explores probability distributions, graphs, and terminologies related to pattern recognition and machine learning. It covers topics such as Schur complement, completing the square, Robbins-Monro algorithm, Bayesian relationships with Gaussian distributions, and variations of Gaussian distributions.
E N D
Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011
For understanding distributions Terminologies
Terminologies • Schur complement: relationship between original matrix and its inverse. • Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. • Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x)
Terminologies (cont.) • [Stochastic appro., wiki., 2011] • Condition on that • Trace Tr(W) is sum of diagonals • Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.
Gaussian distributions and motives Distributions
Conditional Gaussian Distribution Assume y=Xa, x=Xb • Derivation of conditional mean and variance: • Noting Schur complement • Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb.
Marginal Gaussian Distribution • Goal is also to identify mean and variance by ‘completing the square’. • Solving above integration while noting Schur complement and compare components
Bayesian relationship with Gaussian distr. (quick view) • Consider multivariable Gaussian where • Thus • According to Bayesian equation • The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x) • Ie. becomes
Bayesian relationship with Gaussian distr. • Starting from • Mean and var. for joint Gaussian distr. P(x,y) • Mean and variance for P(x|y) Can be seem as prior Can be seem as likelihood Can be seem as posterior
Bayesian relationship with Gaussian distr., sequential est. • Estimate mean by (N-1)+1 observations • Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. • solve for by Robbin-Monro
Bayesian relationship with Univariate Gaussian distr. • Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function • Conjugate prior of univariate Gaussian is Gaussian-gamma function
Bayesian relationship with Multivariate Gaussian distr. • Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr. • Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.
Gaussian distributions variations Distributions
Student’s t-distr • Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. • If Xi are normal random then • T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. • Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision
Student’s t-distr (cont.) • For multivariate Gaussian , corresponding t-distri. • Mahalanobis dist. • Mean, variance
Gaussian with periodic variables • To avoid mean been dependent on choice of origin use polar coordinate • Solve for theta • Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle
Gaussian with periodic variables (cont.) • From Gaussian of Cartesian coordinate to polar • Becomes • Von Mises distr. • Mean • Precision (concentration)
Gaussian with periodic variables: mean and variance • Solving log likelihood • mean • precision ‘m’ • By noting
Mixture of Gaussians • In part1 we already know one limitation of Gaussian is unimodal property. • Solution: linear comb. (superposition) of Gaussians • Mixing coefficients sum to 1 • Posterior here is known as ‘responsibilities’ • Log likelihood:
Exponential family • Natural form • Normalize by • 1) Bernoulli • Becomes • 2) Multinomial • Becomes
Exponential family (cont.) • 3) Univariate Gaussian • Becomes • Solve for natural parameter • Becomes • From max. likelihood
And interesting methodologies Parameters of Distributions
Uninformative priors • “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr.) prior. • Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. • 1) location parameter for translation invariance • 2) scale parameter for scale invariance in
Nonparametric methods • Instead of assume form of distribution, use nonparametric methods. • 1) Histogram of constant bin width • Good for sequential data • Problem: discontinuity, dimensionality increase exp. • 2) Kernel estimators: sum of Parzen windows • ‘N’ Observations falling in region R (volume V) is ‘K’ • becomes
Nonparametric method: Kernel estimators • 2) Kernel estimators: fix V, determine K • Form of kernel function for points falling in R • h>0 is fixed parameter bandwidth for smoothing • Parzen estimator. Can choose k(u) (ex. Gaussian)
Nonparametric method: Nearest-neighbor • 3) Nearest neighbor: this time use data to grow V Prior: • Same as kernel estimator: training set is store as knowledge base. • ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. • For classifying N points into Nk points in class Ck from Bayesian maximize
Nonparametric method: Nearest-neighbor (cont.) • 3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… - for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki., 2011]
From David Barber’s book Ch.2 Basic Graph Concepts
Directed and undirected graphs • G with vertices and edges that are directed or undirected. • Directed graph, A->B but not B->A then A is ancestor or parent, where B is child • Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) • Connected undirected graph: path between every vertices • Clique: cycle for undirected graph
Representations of Graphs • Singled connected (tree): only one path from A to B • Spanning tree of undirected graph: singly connected subset covering all vertices • Graph representation (numerical) • Edge list: ex. • Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.
Representations of Graphs (cont.) • Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix • Provided there are no edge from a vertex to itself • K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. • 2 cliques: vertices {1,2,3} • and {2,3,4}
Incidence Matrix • Adjacency matrix A and incidence matrix Zinc • Maximum clique incidence matrix Z • Property: • Note: Zinc columns denote edges, and rows denote vertices
Additional Information • Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127. • Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23. • Slide uploaded to Google group. Use with reference.