Bayesian regularization of learning

Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC

Induction F.Bacon Machine Deduction R.Descartes Math. modeling Learning Scientific methods Models Data

Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering

Problem statement • Learning isinverse, ill-posed problem • Model Data • Learning paradoxes • Infinite predictions Finite data? • How to optimize future predictions? • How to select regular from casual in data? • Regularization of learning • Optimal model complexity

Well-posed problem • Solution is unique • Solution is stable • Hadamard (1900-s) • Tikhonoff (1960-s)

Learning from examples • Problem: • Find hypothesish, generating observed dataDin modelH • Well defined if not sensitive to: • noise in data (Hadamard) • learning procedure (Tikhonoff)

Learning is ill-posed problem • Example: Function approximation • Sensitive tonoise in data • Sensitive tolearning procedure

Learning is ill-posed problem • Solution is non-unique

Problem regularization • Main idea: restrict solutions – sacrifice precision to stability How to choose?

+ + … + Statistical Learning practice • DataLearning set+ Validation set • Cross-validation: • Systematic approach to ensembles Bayes

Statistical Learning theory • Learning as inverse Probability • Probability theory.H:hD • Learning theory.H:hD Bernoulli (1713) H Bayes (~ 1750)

Bayesian learning Prior Posterior Evidence

Coin tossing game H

Monte Carlo simulations

Bayesian regularization • Most Probable hypothesis  Learning error Regularization Example: Function approximation

0 1 11 10 110 111 Minimal Description Length Rissanen (1978) • Most Probable hypothesis hypothesis Code length for: Data Example: Optimal prefix code

Data Complexity • ComplexityK(D |H) =min L(h,D|H) Kolmogoroff (1965) Code lengthL(h,D) = codeddata L(D|h)+ decodingprogram L(h) Decoding DataD

Complex = Unpredictable Solomonoff (1978) • Prediction error ~ L(h,D)/L(D) • Random data is uncompressible • Compression = predictability Example: block coding Programh:lengthL(h,D) Decoding DataD

L(h,D) UniversalPrior H D • All 2L programs with lengthL are equiprobable • Data complexity Solomonoff (1960) Bayes (~1750)

Statistical ensemble • Shorter description length • Proof: • Corollary: Ensemble predictions are superior to most probable prediction

Ensemble prediction

Model comparison Posterior Evidence

Statistics: Bayes vs. Fisher • Fisher: maxLikelihood • Bayes: maxEvidence

Historical outlook • 20 – 60s of ХХ century • Parametric statistics • AsymptoticN • 60 - 80s of ХХ century • Non-Parametric statistics • Regularization of ill-posed problems • Non-asymptotic learning • Algorithmic complexity • Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)

Statistical physics • Probability of hypothesis - microstate • Optimal model - macrostate

Free energy • F = - log Z: • Log ofSum  • F = E – TS: • Sum of logs • P = P{L}

EM algorithm.Main idea • Introduce independent P: • Iterations • E-step: • М-step:

EM algorithm • Е-step • Estimate Posterior for given Model • М-step • Update Model for given Posterior

x y h P(x|H) y h(x) x Bayesian regularization: Examples • Hypothesis testing • Function approximation • Data clustering

y h0 Hypothesis testing • Problem • Noisy observations:y • Is theoretical value h0true? • ModelH: Gaussian noise Gaussian prior

Optimal model: Phase transition • Confidence •  finite •  infinite

P(h) h y P(h) y Threshold effect • Student coefficient • Hypothesis h0 is true • Corrections to h0

y h(x) x Function approximation • Problem • Noisy data:y(x) • Find approximation h(x) • Model: Noise Prior

Optimal model • Free energy minimization

Saddle point approximation • Function of best hypothesis

ЕМ learning • Е-step. Optimal hypothesis • М-step. Optimal regularization

LaplacePrior • Pruned weights • Equisensitive weights

x P(x|H) Clustering • Problem • Noisy data:x • Find prototypes (mixture density approximation) • How many clusters? • Модель: Noise

Optimal model • Free energy minimization • Iterations • E-step: • М-step:

ЕМ algorithm • Е-step: • М-step:

h(m) 1/ How many clusters? • Number of clusters M() • Optimal number of clusters

Bayesian regularization of learning

Bayesian regularization of learning

Presentation Transcript

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning and Learning Bayesian Networks

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Bayesian learning

Bayesian Learning

Bayesian Learning

Bayesian Learning

Hierarchical Bayesian-Kalman Models for Regularization and ARD in Sequential Learning

Bayesian Learning