1 / 25

Weak Convergence of Random Free Energy in Information Theory

Weak Convergence of Random Free Energy in Information Theory. Sumio Watanabe Tokyo Institute of Technology. Contents. Identification Problem ≡ Math. Phys. with Random Hamiltonian . 1. Background. 2. Main Theorem. 3. Outline of Proof. 4. Applications and Future Study. Background (1).

danica
Download Presentation

Weak Convergence of Random Free Energy in Information Theory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weak Convergence of Random Free Energy in Information Theory Sumio Watanabe Tokyo Institute of Technology

  2. Contents Identification Problem ≡ Math. Phys. with Random Hamiltonian 1. Background 2. Main Theorem 3. Outline of Proof 4. Applicationsand Future Study

  3. Background (1) Example : Classical Spin System Learner Unknown samples x si si wij wij sj sj Learn Hidden Visible Hidden Visible 1 Z(w) p(x|w) = ∑exp( - ∑ wij si sj ) (i,j) Hidden

  4. Background (2) Identification Problem Learning System Observation p(x|w) φ(w) X1, X2 ,…, Xn q(x) Classical Unknown Information Source Estimated Distribution (Relative Entropy) p( x | X1, X2 ,…, Xn ) D(q||p) ≡∫dx q [log q –log p] = ?

  5. Random Free Energy and Relative Entropy Background (3) Definition. Random Free Energy = log-Likelihood of System ≡ - log∫p(X1|w) p(X2|w) ・・・ p(Xn|w) φ(dw) F(X1, X2 ,…, Xn ) n i=1 + Σ log q(Xi) Relation between F and D(q||p) D( q(Xn+1 ) || p(Xn+1 | X1 , X2 ,…, Xn ) ) = F(X1, X2 ,…, Xn+1 ) - F(X1, X2 ,…, Xn )

  6. Identifiability and Singularities Background (4) A learning system p(x|w) is called identifiable p(x|w1) = p(x|w2) (∀x) ⇒ w1=w2 A system which identifies the structure is non-identifiable. Remark. W={w}, w1~w2⇔ “p(x|w1) = p(x|w2) (∀x)” W/is not a manifold because ~ { w ; p(x|w)=p(x|w0)} is an analytic set with singularities.

  7. Main Theorem (1) Mathematical Definitions X : a random variable on RN with p.d.f. q(x). L2(q) = {f ; ∫f(x)2 q(x) dx < ∞ } : real Hilbert space. W : a real d-dimensional manifold. φ(w) : a p.d.f. on W, C0∞ -class function. φ(w) dw : prob. Dist. on W

  8. Main Theorem (2) Mathematical Definitions H(・ ,w) : an L2(q)-valued real analytic function on W. E X[e -H(X,w)]=1 (∀w). [ ⇒ K(w)≡E X[H(X,w)]≧0 ] s.t. W0 ≡{w∈supp φ; K(w)=0} ≠ O e.g. H(x,w)=log q(x) – log p(x|w) Given X1, X2, …,Xn : i.i.d., Random Free Energy n i=1 F = -log ∫exp(- Σ H(Xi, w) ) φ(w) dw

  9. Main Theorem (3) Gel’fand’s Zeta function Difficulty : {w; K(w)=0} is an analytic set with singularities. ζ(z) = ∫ K(w)z φ(w) dw The zeta function : holomorphic in Re(z)>0. Theorem (Atiyah,Sato,Bernstein,Bjork,Kashiwara,1970-1980) (1) ζ(z) can be analytically continued to a meromorphic function on the entire complex plane. (2) All poles are real, negative, and rational numbers. Orders: m1,m2,m3,… Poles: 0>-λ1> -λ2 > -λ3 > ・・・,

  10. Main Theorem (4) Main Theorem The convergence in law holds. F – λ1log n +(m1-1)loglog n → F* (n→∞) where F* can be represented by a limit process of an empirical process on W0. Corollary If E[ D(q||p)] has an asymptotic expansion λ1 n 1 n E[ D(q||p) ] = +o( )

  11. Proof Outline (1) Hironaka Resolution Theorem 0 K(w) K(g(u))=a(u) u12s1u22s2 ・・・ ud2sd U W W0 locally g U0

  12. Proof Outline (2) Resolution Theorem H.Hironaka(1964) M.F.Atiyah(1970) Let K(w)≧0 be a real analytic function defined in a neighborhood of 0∈W⊂Rd. Then there exist an open setW, a real analytic manifold U, and a proper analytic map g: U→W such that • g:U-U0 → W-W0 is an isomorphism. (2) For each P∈U, there are local coordinates (u1,u2,…,ud) centered at P so that locally near P K(g(u)) = a(u) u12s1u22s2・・・ ud2sd where a(u)>0 is an analytic function and si≧0 is integer.

  13. Proof Outline (3) Division of Partition Function Because suppφ is compact and g is a proper map, We can assume W = ∪ Ua (finite sum、joint set measure zero) n i=1 exp(-F) = Σ ∫ exp[ -ΣH(Xi, ga(ua)) ] φa(ua) dua a Ua K(ga(ua)) = a(u) u12s1u22s2・・・ ud2sd in each Ua, φa(ua) = Σ ba (u) u1k1u2k2・・・ udkd ( Both si and ki depend on a ) Hereafter, a is omitted and K(u) ≡ K(g(u)) is used.

  14. Proof Outline (4) B-function ζ(z) = ∫ K(w)z φ(w) dw The zeta function ∃P(w,∂w,z) ∃b(z) s.t. P(w,∂w,z) K(w)z+1=b(z)K(w)z Analytic continuation is carried out using b-function. If K(w) is a polynomial, then there exists an algorithm to calculate b(z). (Oaku, 1997).

  15. Proof Outline (5) Ideals of Local Analytic functions Lemma 1. Let u →H(・,u) be a real analytic function in U. There exist an open set Ue ⊂U and a finite set of analytic functions { gj(u), hj(・,u) ; j=1,2,…,J } in Ue, (1) T(u)≧I (∀u∈Ue), Tij(u)≡∫hj(x,u) hk(x,u) q(x) dx s.t. J j=1 (2) H(x,u) =∑ gj(u) hj(x,u)

  16. Proof Outline (6) Decomposition of Hamiltonian H(x,u) -K(u) K(u)1/2 r(x,u) ≡ Since Lemma 1 and K(u) = ∫{K(x,u)+e-K(x,u)-1} q(x) dx, r(x,u) is well defined even if K(u)=0. 1 n n i=1 σn(u) ≡ ∑ r(Xi,u) n i=1 Random Hamiltonian Σ H(Xi,u) = nK(u) + (nK(u))1/2σn(u)

  17. Proof Outline (7) Donsker’s Empirical Process n i=1 1 n σn(u) ≡ ∑ r(Xi,u) Empirical process Tight Gaussian process σn(・) → σ (・) Central limit theorem in Banach Space (∀ f : a bounded continuous functional on L∞(supp φ)) E [ f(σn)]  → Eσ[ f(σ)] x1,x2,…,xn

  18. Proof Outline (8) Poles of Zeta function ζ(z) = Σ∫ K(u)z φ(u) du K(u) = a(u)u12s1u22s2・・・ ud2sd Φ(u) = Σ b(u)u1k1 u2k2・・・ udkd Kj+1 2sj λ = min Kj+1 2sj m = ♯{ j ; λ = }

  19. Proof Outline (9) Zeta function and State Density u=(u,v) u =(uj) ; j ∈J : attains min. L(u)≡ Π uj2sj j ∈J Partial Zeta ∬ L(u) zφ(u,v) dudv : Pole –λ order m Inverse Mellin Transf. ∬ δ(t-L(u)) φ(u,v) dudv State Density = tλ-1(-log t)m-1∫φ(0,v) dv ( t → 0 )

  20. Proof Outline (10) Partition function and Empirical Process Characteristic function of F : Sufficiently small ε>0 (log n)m-1 nλ E[{ Z}iε]→const. ( n → ∞ ) Q.E.D. Partition function ← State Density ← Zeta function Z = ∬ exp(-nK(u,v) + (nK(u,v))1/2σn(u,v)) φ(u,v) dudv t n dt n t n → ∬ ( )λ-1(-log( ))m-1φ(0,v) dv ×exp[ -tK(0,v) + (tK(0,v))1/2σ(0,v) ]

  21. Applications and Future Study (1) Information Science & Mathematical Physics Identification of Unknown Information Source = Statistical Physics with Random Hamiltonian Identification of Hidden Structure = Hamiltonian has Singularities ⇒ Singularities make State Density to be singular.

  22. Applications and Future Study (2) Model Identification From Samples, then true distribution is identified. F F = F(p,φ,X1,X2,…,Xn) p(x|w), φ(w) True

  23. Applications and Future Study (3) Poles and orders of Zeta function 1. If φ(w)>0 at W0, then 0<λ≦d/2. 2. 1≦m≦d. 3. If φ(w) is Jeffreys’ prior, λ≧d/2. 4. If ζ(z) has a pole –λ’, then λ≦λ.

  24. Applications and Future Study (4) Concrete Learning Systems 1. Neural Networks, True H0, Model H. || y – Σah f(bh・x+ch)||2 2 1 (2π)1/2 p(y|x,w)= exp(- ) 2λ≦H0(M+N+1) + (H-H0) Min(M+1,N) 2. Gaussian Mixtures, True H0, Model H. || x - bh||2 2 p(x|w) = Σ ah exp( - ) 2λ≦H0 + (M-1)H/2 +(M-3)/2

  25. Applications and Future Study (5) Future Study 1. Testing hypothesis ⇒ q(x)=p(x|w0) ; w0near singularity 2. Large System : Thermo-dynamical limit. 3. Replica Method : f(z) = E[ exp( zF) ]. 4. Generalization to Non-commutative System.

More Related