• 360 likes • 552 Views
Bayesian Learning. Bayesian Reasoning. Basic assumption The quantities of interest are governed by probability distribution These probability + observed data ==> reasoning ==> optimal decision 의의 , 중요성 직접적으로 확률을 다루는 알고리듬의 근간 예 ) naïve Bayes classifier 확률을 다루지 않는 알고리듬을 분석하기 위한 틀
E N D
Bayesian Reasoning • Basic assumption • The quantities of interest are governed by probability distribution • These probability + observed data ==> reasoning ==> optimal decision • 의의, 중요성 • 직접적으로 확률을 다루는 알고리듬의 근간 • 예) naïve Bayes classifier • 확률을 다루지 않는 알고리듬을 분석하기 위한 틀 • 예) cross entropy , Inductive bias decision tree, MDL principle
Feature & Limitation • Feature of Bayesian Learning • 관측된 데이터들은 추정된 확률을 점진적으로 증감 • Prior Knowledge : P(h) , P(D|h) • Probabilistic Prediction에 응용 • multiple hypothesis의 결합에 의한 prediction • 문제점 • initial knowledge 요구 • significant computational cost
Bayes Theorem • Terms • P(h) : prior probability of h • P(D) : prior probability that D will be observed • P(D|h) : prior knowledge • P(h|D) : posterior probability of h , given D • Theorem • machine learning : 주어진 데이터 들로부터 the most probable hypothesis를 찾는 과정
Example • Medical diagnosis • P(cancer)=0.008 , P(~cancer)=0.992 • P(+|cancer) = 0.98 , P(-|cancer) = 0.02 • P(+|~cancer) = 0.03 , P(-|~cancer) = 0.97 • P(cancer|+) = P(+|cancer)P(cancer) = 0.0078 • P(~cancer|+) = P(+|~cancer)P(~cancer) = 0.0298 • hMAP = ~cancer
MAP hypothesis MAP(Maximum a posteriori) hypothesis
ML hypothesis • maximum likelihood (ML) hypothesis • basic assumption : equally probable a priori • basic formular • P(a^b) = P(A|B)P(B) = P(B|A)P(A)
Bayes Theorem and Concept Learning • Brute-force MAP learning • for each calculate P(h|D) • find hMAP • consistent assumption • noise free data D • target concept c in hypothesis space H • every hypothesis is equally probable • Result • every consistent hypothesis is MAP hypothesis (if h is consistent with D) P(h|D) = 0 (otherwise)
Consistent learner • 정의 : training example들에 대해 에러가 없는 hypothesis를 출력해 주는 알고리듬 • result : • every consistent hypothesis output == MAP hypothesis • every consistent learner output == MAP hypothesis • if uniform prior probability distribution over H • if deterministic, noise-free training data
ML and LSE hypothesis • Least squared error hypothesis • NN , curve fitting, linear regression • continuous-valued target function • task : find f : di=f(xi)+ei • preliminary : • probability densities, Normal distribution • target value independence • result : • limitation : noise only in the target value
ML hypothesis for predicting Probability • Task : find g : g(x) = P(f(x)=1) • question : what criterion should we optimize in order to find a ML hypothesis for g • result : cross entropy • entropy function :
(BP) Gradient search to ML in NN Let G(h,D) = cross entropy By gradient ascent
MDL principle • 목적 : Bayesian method에 의한 inductive bias 와 MLD principle 해석 • Shannon and weaver’s optimal code length
Bayes optimal classifier • Motivation : 새로운 instance의 classification은 모든 hypothesis에 의한 prediction의 결합으로 인하여 최적화 되어진다. • task : Find the most probable classification of the new instance given the training data • answer :combining the prediction of all hypotheses • Bayes optimal classification • limitation : significant computational cost ==> Gibbs algorithm
Gibbs algorithm • Algorithm • 1. Choose h from H, according to the posterior probability distribution over H • 2. Use h to predict the classification of x • Gibbs algorithm의 유용성 • Haussler , 1994 • Error(Gibbs algorithm)< 2*Error(Bayes optimal classifier)
Naïve Bayes classifier • Naïve Bayes classifier • difference • no explicit search through H • by counting the frequency of existing examples • m-estimate of probability = • m : equivalent sample size , p : prior estimate of probability
example • (outlook=sunny,temperature=cool,humidity=high,wind=strong) • P(wind=strong|playTennis=yes)=3/9=.33 • P(wind=string|PlayTennis=no)=3/5=.60 • P(yes)P(sunny|yes)P(cool|yes)P(high|yes)P(strong|yes)=.0053 • P(no)P(sunny|no)P(cool|no)P(high|no)P(strong|no)=.0206 • vNB = no
Bayes Belief Networks • 정의 • describe the joint probability distribution for a set of variables • 모든 변수들이 conditional independence일것을 요구하지 않음 • 변수들간의 부분적 의존 관계를 확률로 표현 • representation
Inference • Task : infer the probability distribution for the target variables • methods • exact inference : NP hard • approximate inference • theoretically NP hard • practically useful • Monte Carlo methods
Learning • Env • structure known + fully observable data • easy , by naïve Bayes classifier • structure known + partially observable data • gradient ascent procedure ( by Russel , 1995 ) • ML hypothesis 와 유사 P(D|h) • structure unknown
Learning(2) • Structure unknown • Bayesian scoring metric ( cooper, Herskovits, 1992 ) • K2 algorithm • cooper, Herskovits, 1992 • heuristic greedy search • fully observed data • constraint-based approach • Spirtes, 1993 • infer dependency and independency relationship • construct structure using this relationship
EM algorithm • EM : estimation, maximization • env • learning in the presence of unobserved variables • the form of probability distribution is known • application • training Bayesian belief networks • training radial basis function networks • basis for many unsupervised clustering algorithm • basis for Baum-Welch’s forward-backward algorithm
K-means algorithm • Env : k normal distribution들로부터 임의로 data 생성 • task : find mean values of each distribution • instance : < xi,z11,z12> • if z is known : using • else use EM algorithm
K-means algorithm • Initialize • calculate E[z] • calculate a new ML hypothesis ==> converge to a local ML hypothesis
General statement of EM algo • Terms • : underlying probability distribution • x : observed data from each distribution • z : unobserved data • Y = X union Z • h : current hypothesis of • h’ : revised hypothesis • task : estimate from X
guideline • Search h’ • if h = : calculate function Q
EM algorithm • Estimation step • maximization step • converge to a local maxima