250 likes | 381 Views
Weak Convergence of Random Free Energy in Information Theory. Sumio Watanabe Tokyo Institute of Technology. Contents. Identification Problem ≡ Math. Phys. with Random Hamiltonian . 1. Background. 2. Main Theorem. 3. Outline of Proof. 4. Applications and Future Study. Background (1).
E N D
Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of Technology
Contents Identification Problem ≡ Math. Phys. with Random Hamiltonian 1. Background 2. Main Theorem 3. Outline of Proof 4. Applicationsand Future Study
Background (1) Example : Classical Spin System Learner Unknown samples x si si wij wij sj sj Learn Hidden Visible Hidden Visible 1 Z(w) p(x|w) = ∑exp( - ∑ wij si sj ) (i,j) Hidden
Background (2) Identification Problem Learning System Observation p(x|w) φ(w) X1, X2 ,…, Xn q(x) Classical Unknown Information Source Estimated Distribution (Relative Entropy) p( x | X1, X2 ,…, Xn ) D(q||p) ≡∫dx q [log q –log p] = ?
Random Free Energy and Relative Entropy Background (3) Definition. Random Free Energy = log-Likelihood of System ≡ - log∫p(X1|w) p(X2|w) ・・・ p(Xn|w) φ(dw) F(X1, X2 ,…, Xn ) n i=1 + Σ log q(Xi) Relation between F and D(q||p) D( q(Xn+1 ) || p(Xn+1 | X1 , X2 ,…, Xn ) ) = F(X1, X2 ,…, Xn+1 ) - F(X1, X2 ,…, Xn )
Identifiability and Singularities Background (4) A learning system p(x|w) is called identifiable p(x|w1) = p(x|w2) (∀x) ⇒ w1=w2 A system which identifies the structure is non-identifiable. Remark. W={w}, w1~w2⇔ “p(x|w1) = p(x|w2) (∀x)” W/is not a manifold because ~ { w ; p(x|w)=p(x|w0)} is an analytic set with singularities.
Main Theorem (1) Mathematical Definitions X : a random variable on RN with p.d.f. q(x). L2(q) = {f ; ∫f(x)2 q(x) dx < ∞ } : real Hilbert space. W : a real d-dimensional manifold. φ(w) : a p.d.f. on W, C0∞ -class function. φ(w) dw : prob. Dist. on W
Main Theorem (2) Mathematical Definitions H(・ ,w) : an L2(q)-valued real analytic function on W. E X[e -H(X,w)]=1 (∀w). [ ⇒ K(w)≡E X[H(X,w)]≧0 ] s.t. W0 ≡{w∈supp φ; K(w)=0} ≠ O e.g. H(x,w)=log q(x) – log p(x|w) Given X1, X2, …,Xn : i.i.d., Random Free Energy n i=1 F = -log ∫exp(- Σ H(Xi, w) ) φ(w) dw
Main Theorem (3) Gel’fand’s Zeta function Difficulty : {w; K(w)=0} is an analytic set with singularities. ζ(z) = ∫ K(w)z φ(w) dw The zeta function : holomorphic in Re(z)>0. Theorem (Atiyah,Sato,Bernstein,Bjork,Kashiwara,1970-1980) (1) ζ(z) can be analytically continued to a meromorphic function on the entire complex plane. (2) All poles are real, negative, and rational numbers. Orders: m1,m2,m3,… Poles: 0>-λ1> -λ2 > -λ3 > ・・・,
Main Theorem (4) Main Theorem The convergence in law holds. F – λ1log n +(m1-1)loglog n → F* (n→∞) where F* can be represented by a limit process of an empirical process on W0. Corollary If E[ D(q||p)] has an asymptotic expansion λ1 n 1 n E[ D(q||p) ] = +o( )
Proof Outline (1) Hironaka Resolution Theorem 0 K(w) K(g(u))=a(u) u12s1u22s2 ・・・ ud2sd U W W0 locally g U0
Proof Outline (2) Resolution Theorem H.Hironaka(1964) M.F.Atiyah(1970) Let K(w)≧0 be a real analytic function defined in a neighborhood of 0∈W⊂Rd. Then there exist an open setW, a real analytic manifold U, and a proper analytic map g: U→W such that • g:U-U0 → W-W0 is an isomorphism. (2) For each P∈U, there are local coordinates (u1,u2,…,ud) centered at P so that locally near P K(g(u)) = a(u) u12s1u22s2・・・ ud2sd where a(u)>0 is an analytic function and si≧0 is integer.
Proof Outline (3) Division of Partition Function Because suppφ is compact and g is a proper map, We can assume W = ∪ Ua (finite sum、joint set measure zero) n i=1 exp(-F) = Σ ∫ exp[ -ΣH(Xi, ga(ua)) ] φa(ua) dua a Ua K(ga(ua)) = a(u) u12s1u22s2・・・ ud2sd in each Ua, φa(ua) = Σ ba (u) u1k1u2k2・・・ udkd ( Both si and ki depend on a ) Hereafter, a is omitted and K(u) ≡ K(g(u)) is used.
Proof Outline (4) B-function ζ(z) = ∫ K(w)z φ(w) dw The zeta function ∃P(w,∂w,z) ∃b(z) s.t. P(w,∂w,z) K(w)z+1=b(z)K(w)z Analytic continuation is carried out using b-function. If K(w) is a polynomial, then there exists an algorithm to calculate b(z). (Oaku, 1997).
Proof Outline (5) Ideals of Local Analytic functions Lemma 1. Let u →H(・,u) be a real analytic function in U. There exist an open set Ue ⊂U and a finite set of analytic functions { gj(u), hj(・,u) ; j=1,2,…,J } in Ue, (1) T(u)≧I (∀u∈Ue), Tij(u)≡∫hj(x,u) hk(x,u) q(x) dx s.t. J j=1 (2) H(x,u) =∑ gj(u) hj(x,u)
Proof Outline (6) Decomposition of Hamiltonian H(x,u) -K(u) K(u)1/2 r(x,u) ≡ Since Lemma 1 and K(u) = ∫{K(x,u)+e-K(x,u)-1} q(x) dx, r(x,u) is well defined even if K(u)=0. 1 n n i=1 σn(u) ≡ ∑ r(Xi,u) n i=1 Random Hamiltonian Σ H(Xi,u) = nK(u) + (nK(u))1/2σn(u)
Proof Outline (7) Donsker’s Empirical Process n i=1 1 n σn(u) ≡ ∑ r(Xi,u) Empirical process Tight Gaussian process σn(・) → σ (・) Central limit theorem in Banach Space (∀ f : a bounded continuous functional on L∞(supp φ)) E [ f(σn)] → Eσ[ f(σ)] x1,x2,…,xn
Proof Outline (8) Poles of Zeta function ζ(z) = Σ∫ K(u)z φ(u) du K(u) = a(u)u12s1u22s2・・・ ud2sd Φ(u) = Σ b(u)u1k1 u2k2・・・ udkd Kj+1 2sj λ = min Kj+1 2sj m = ♯{ j ; λ = }
Proof Outline (9) Zeta function and State Density u=(u,v) u =(uj) ; j ∈J : attains min. L(u)≡ Π uj2sj j ∈J Partial Zeta ∬ L(u) zφ(u,v) dudv : Pole –λ order m Inverse Mellin Transf. ∬ δ(t-L(u)) φ(u,v) dudv State Density = tλ-1(-log t)m-1∫φ(0,v) dv ( t → 0 )
Proof Outline (10) Partition function and Empirical Process Characteristic function of F : Sufficiently small ε>0 (log n)m-1 nλ E[{ Z}iε]→const. ( n → ∞ ) Q.E.D. Partition function ← State Density ← Zeta function Z = ∬ exp(-nK(u,v) + (nK(u,v))1/2σn(u,v)) φ(u,v) dudv t n dt n t n → ∬ ( )λ-1(-log( ))m-1φ(0,v) dv ×exp[ -tK(0,v) + (tK(0,v))1/2σ(0,v) ]
Applications and Future Study (1) Information Science & Mathematical Physics Identification of Unknown Information Source = Statistical Physics with Random Hamiltonian Identification of Hidden Structure = Hamiltonian has Singularities ⇒ Singularities make State Density to be singular.
Applications and Future Study (2) Model Identification From Samples, then true distribution is identified. F F = F(p,φ,X1,X2,…,Xn) p(x|w), φ(w) True
Applications and Future Study (3) Poles and orders of Zeta function 1. If φ(w)>0 at W0, then 0<λ≦d/2. 2. 1≦m≦d. 3. If φ(w) is Jeffreys’ prior, λ≧d/2. 4. If ζ(z) has a pole –λ’, then λ≦λ.
Applications and Future Study (4) Concrete Learning Systems 1. Neural Networks, True H0, Model H. || y – Σah f(bh・x+ch)||2 2 1 (2π)1/2 p(y|x,w)= exp(- ) 2λ≦H0(M+N+1) + (H-H0) Min(M+1,N) 2. Gaussian Mixtures, True H0, Model H. || x - bh||2 2 p(x|w) = Σ ah exp( - ) 2λ≦H0 + (M-1)H/2 +(M-3)/2
Applications and Future Study (5) Future Study 1. Testing hypothesis ⇒ q(x)=p(x|w0) ; w0near singularity 2. Large System : Thermo-dynamical limit. 3. Replica Method : f(z) = E[ exp( zF) ]. 4. Generalization to Non-commutative System.