580 likes | 736 Views
Bayesian regularization of learning. Sergey Shumsky NeurOK Software LLC. Induction F.Bacon Machine. Deduction R.Descartes Math. modeling. Learning. Scientific methods. Models. Data. Outline. Learning as ill-posed problem General problem: data generalization
E N D
Bayesian regularization of learning Sergey Shumsky NeurOK Software LLC
Induction F.Bacon Machine Deduction R.Descartes Math. modeling Learning Scientific methods Models Data
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Problem statement • Learning isinverse, ill-posed problem • Model Data • Learning paradoxes • Infinite predictions Finite data? • How to optimize future predictions? • How to select regular from casual in data? • Regularization of learning • Optimal model complexity
Well-posed problem • Solution is unique • Solution is stable • Hadamard (1900-s) • Tikhonoff (1960-s)
Learning from examples • Problem: • Find hypothesish, generating observed dataDin modelH • Well defined if not sensitive to: • noise in data (Hadamard) • learning procedure (Tikhonoff)
Learning is ill-posed problem • Example: Function approximation • Sensitive tonoise in data • Sensitive tolearning procedure
Learning is ill-posed problem • Solution is non-unique
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Problem regularization • Main idea: restrict solutions – sacrifice precision to stability How to choose?
+ + … + Statistical Learning practice • DataLearning set+ Validation set • Cross-validation: • Systematic approach to ensembles Bayes
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Statistical Learning theory • Learning as inverse Probability • Probability theory.H:hD • Learning theory.H:hD Bernoulli (1713) H Bayes (~ 1750)
Bayesian learning Prior Posterior Evidence
Bayesian regularization • Most Probable hypothesis Learning error Regularization Example: Function approximation
0 1 11 10 110 111 Minimal Description Length Rissanen (1978) • Most Probable hypothesis hypothesis Code length for: Data Example: Optimal prefix code
Data Complexity • ComplexityK(D |H) =min L(h,D|H) Kolmogoroff (1965) Code lengthL(h,D) = codeddata L(D|h)+ decodingprogram L(h) Decoding DataD
Complex = Unpredictable Solomonoff (1978) • Prediction error ~ L(h,D)/L(D) • Random data is uncompressible • Compression = predictability Example: block coding Programh:lengthL(h,D) Decoding DataD
L(h,D) UniversalPrior H D • All 2L programs with lengthL are equiprobable • Data complexity Solomonoff (1960) Bayes (~1750)
Statistical ensemble • Shorter description length • Proof: • Corollary: Ensemble predictions are superior to most probable prediction
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Model comparison Posterior Evidence
Statistics: Bayes vs. Fisher • Fisher: maxLikelihood • Bayes: maxEvidence
Historical outlook • 20 – 60s of ХХ century • Parametric statistics • AsymptoticN • 60 - 80s of ХХ century • Non-Parametric statistics • Regularization of ill-posed problems • Non-asymptotic learning • Algorithmic complexity • Statistical physics of disordered systems Fisher (1912) Chentsoff (1962) Tikhonoff (1963) Vapnik (1968) Kolmogoroff (1965) Gardner (1988)
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
Statistical physics • Probability of hypothesis - microstate • Optimal model - macrostate
Free energy • F = - log Z: • Log ofSum • F = E – TS: • Sum of logs • P = P{L}
EM algorithm.Main idea • Introduce independent P: • Iterations • E-step: • М-step:
EM algorithm • Е-step • Estimate Posterior for given Model • М-step • Update Model for given Posterior
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
x y h P(x|H) y h(x) x Bayesian regularization: Examples • Hypothesis testing • Function approximation • Data clustering
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
y h0 Hypothesis testing • Problem • Noisy observations:y • Is theoretical value h0true? • ModelH: Gaussian noise Gaussian prior
Optimal model: Phase transition • Confidence • finite • infinite
P(h) h y P(h) y Threshold effect • Student coefficient • Hypothesis h0 is true • Corrections to h0
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
y h(x) x Function approximation • Problem • Noisy data:y(x) • Find approximation h(x) • Model: Noise Prior
Optimal model • Free energy minimization
Saddle point approximation • Function of best hypothesis
ЕМ learning • Е-step. Optimal hypothesis • М-step. Optimal regularization
LaplacePrior • Pruned weights • Equisensitive weights
Outline • Learning as ill-posed problem • General problem: data generalization • General remedy: model regularization • Bayesian regularization. Theory • Hypothesis comparison • Model comparison • Free Energy & EM algorithm • Bayesian regularization. Practice • Hypothesis testing • Function approximation • Data clustering
x P(x|H) Clustering • Problem • Noisy data:x • Find prototypes (mixture density approximation) • How many clusters? • Модель: Noise
Optimal model • Free energy minimization • Iterations • E-step: • М-step:
ЕМ algorithm • Е-step: • М-step:
h(m) 1/ How many clusters? • Number of clusters M() • Optimal number of clusters