Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs Affiliation: Kyoto University Name: Kevin Chien, Dr. Oba Shigeyuki, Dr. Ishii Shin Date: Nov 04, 2011

For understanding distributions Terminologies

Terminologies • Schur complement: relationship between original matrix and its inverse. • Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic. • Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x)

Terminologies (cont.) • [Stochastic appro., wiki., 2011] • Condition on that • Trace Tr(W) is sum of diagonals • Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.

Gaussian distributions and motives Distributions

Conditional Gaussian Distribution Assume y=Xa, x=Xb • Derivation of conditional mean and variance: • Noting Schur complement • Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb.

Marginal Gaussian Distribution • Goal is also to identify mean and variance by ‘completing the square’. • Solving above integration while noting Schur complement and compare components

Bayesian relationship with Gaussian distr. (quick view) • Consider multivariable Gaussian where • Thus • According to Bayesian equation • The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x) • Ie. becomes

Bayesian relationship with Gaussian distr. • Starting from • Mean and var. for joint Gaussian distr. P(x,y) • Mean and variance for P(x|y) Can be seem as prior Can be seem as likelihood Can be seem as posterior

Bayesian relationship with Gaussian distr., sequential est. • Estimate mean by (N-1)+1 observations • Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. • solve for by Robbin-Monro

Bayesian relationship with Univariate Gaussian distr. • Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function • Conjugate prior of univariate Gaussian is Gaussian-gamma function

Bayesian relationship with Multivariate Gaussian distr. • Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr. • Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.

Gaussian distributions variations Distributions

Student’s t-distr • Use in analysis of variance on whether effect is real and statistical significant using t-distri. w/ n-1 degree of freedom. • If Xi are normal random then • T-distr. has lower peak and longer tail (allow more outliers thus robust) than Gaussian distr. • Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision

Student’s t-distr (cont.) • For multivariate Gaussian , corresponding t-distri. • Mahalanobis dist. • Mean, variance

Gaussian with periodic variables • To avoid mean been dependent on choice of origin use polar coordinate • Solve for theta • Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle

Gaussian with periodic variables (cont.) • From Gaussian of Cartesian coordinate to polar • Becomes • Von Mises distr. • Mean • Precision (concentration)

Gaussian with periodic variables: mean and variance • Solving log likelihood • mean • precision ‘m’ • By noting

Mixture of Gaussians • In part1 we already know one limitation of Gaussian is unimodal property. • Solution: linear comb. (superposition) of Gaussians • Mixing coefficients sum to 1 • Posterior here is known as ‘responsibilities’ • Log likelihood:

Exponential family • Natural form • Normalize by • 1) Bernoulli • Becomes • 2) Multinomial • Becomes

Exponential family (cont.) • 3) Univariate Gaussian • Becomes • Solve for natural parameter • Becomes • From max. likelihood

And interesting methodologies Parameters of Distributions

Uninformative priors • “Subjective Bayesian”: avoid incorrect assumption by using uninformative (ex. uniform distr.) prior. • Improper prior: prior need not sum to 1 for posterior to sum to 1 as per Bayes equation. • 1) location parameter for translation invariance • 2) scale parameter for scale invariance in

Nonparametric methods • Instead of assume form of distribution, use nonparametric methods. • 1) Histogram of constant bin width • Good for sequential data • Problem: discontinuity, dimensionality increase exp. • 2) Kernel estimators: sum of Parzen windows • ‘N’ Observations falling in region R (volume V) is ‘K’ • becomes

Nonparametric method: Kernel estimators • 2) Kernel estimators: fix V, determine K • Form of kernel function for points falling in R • h>0 is fixed parameter bandwidth for smoothing • Parzen estimator. Can choose k(u) (ex. Gaussian)

Nonparametric method: Nearest-neighbor • 3) Nearest neighbor: this time use data to grow V Prior: • Same as kernel estimator: training set is store as knowledge base. • ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions. • For classifying N points into Nk points in class Ck from Bayesian maximize

Nonparametric method: Nearest-neighbor (cont.) • 3) Nearest neighbor: assign new point to class Ck by majority vote of its k nearest neighbors ……………… - for k=1 and n->∞ , error is bounded by Bayes error rate [k-nearest neighbor algorithm, wiki., 2011]

From David Barber’s book Ch.2 Basic Graph Concepts

Directed and undirected graphs • G with vertices and edges that are directed or undirected. • Directed graph, A->B but not B->A then A is ancestor or parent, where B is child • Directed Acyclic Graph (DAG): directed graph with no cycle (no revisit of vertex) • Connected undirected graph: path between every vertices • Clique: cycle for undirected graph

Representations of Graphs • Singled connected (tree): only one path from A to B • Spanning tree of undirected graph: singly connected subset covering all vertices • Graph representation (numerical) • Edge list: ex. • Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.

Representations of Graphs (cont.) • Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix • Provided there are no edge from a vertex to itself • K maximum clique undirected graph has a N x K matrix, where each column Ck express which nodes form a clique. • 2 cliques: vertices {1,2,3} • and {2,3,4}

Incidence Matrix • Adjacency matrix A and incidence matrix Zinc • Maximum clique incidence matrix Z • Property: • Note: Zinc columns denote edges, and rows denote vertices

Additional Information • Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127. • Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23. • Slide uploaded to Google group. Use with reference.

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs

Presentation Transcript

Looking at Data - Distributions Displaying Distributions with Graphs

STATISTICS FOR BUSINESS

Chapter 6

Introduction to Probability Distributions

Chapter 9

PATTERN RECOGNITION AND MACHINE LEARNING

Ch 2. Probability Distribution Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

4.1 Probability Distributions

Pattern Recognition and Machine Learning

Probability Distributions

This week: overview on pattern recognition (related to machine learning)

Chapter 5: Probability Distributions: Discrete Probability Distributions

Chapter 5 Discrete Probability Distributions

PATTERN RECOGNITION Fatoş Tunay Yarman Vural

Chapter 6 Discrete Probability Distributions

Chapter Six Normal Curves and Sampling Probability Distributions

Introduction to Pattern Recognition Chapter 1 ( Duda et al.)

机器学习 machine learning

M ARIO F . T RIOLA

Chapter 4: Describing Distributions

Chapter 3-PART I