710 likes | 957 Views
Sum-Product Networks: A New Deep Architecture. Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon. 1. Graphical Models: Challenges. Restricted Boltzmann Machine (RBM). Bayesian Network. Markov Network. Sprinkler. Rain. Grass Wet.
E N D
Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1
Graphical Models: Challenges Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability Problem: Inference is intractable Problem: Learning is difficult 2
Deep Learning • Stack many layers E.g.: DBN [Hinton & Salakhutdinov,2006] CDBN [Lee et al.,2009] DBM [Salakhutdinov & Hinton,2010] • Potentially much more powerful than shallow architectures[Bengio, 2009] • But … • Inference is even harder • Learning requires extensive effort 3
Learning:Requires approximate inference Inference: Still approximate Graphical Models
E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models
This Talk: Sum-Product Networks Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models
Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models
Can compactly represent many more distributions Sum-Product Networks Graphical Models Existing Tractable Models
Learn optimal way to reuse computation, etc. Sum-Product Networks Graphical Models Existing Tractable Models
Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 10
Why Is Inference Hard? • Bottleneck: Summing out variables • E.g.: Partition function Sum of exponentially many products
Alternative Representation P(X) = 0.4 I[X1=1]I[X2=1] +0.2 I[X1=1]I[X2=0] +0.1 I[X1=0]I[X2=1] +0.3 I[X1=0]I[X2=0]
Alternative Representation P(X) = 0.4 I[X1=1] I[X2=1] +0.2 I[X1=1]I[X2=0] +0.1 I[X1=0]I[X2=1] +0.3 I[X1=0]I[X2=0]
Shorthand for Indicators P(X) = 0.4 X1X2 +0.2 X1X2 +0.1 X1X2 +0.3 X1X2
Sum Out Variables e: X1 = 1 P(e) = 0.4 X1 X2 + 0.2 X1 X2 +0.1 X1X2 +0.3 X1X2 SetX1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1
Graphical Representation 0.4 0.3 0.2 0.1 X1 X1 X2 X2
But … Exponentially Large 2N-1 N2N-1 X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states with even number of 1’s 17
But … Exponentially Large Can we make this more compact? X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states of even number of 1’s 18
Use a Deep Network O(N) Example: Parity Uniform distribution over states with even number of 1’s 19
Use a Deep Network Induce many hidden layers Example: Parity Uniform distribution over states of even number of 1’s 20
Use a Deep Network Reuse partial computation Example: Parity Uniform distribution over states of even number of 1’s 21
Sum-Product Networks (SPNs) • Rooted DAG • Nodes: Sum, product, input indicator • Weights on edges from sum to children 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 22
Distribution Defined by SPN P(X)S(X) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 23
Distribution Defined by SPN 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 P(X)S(X) 24 24
Can We Sum Out Variables? P(e)XeS(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 e: X1 = 1 1 0 1 1 25
Can We Sum Out Variables? P(e) =XeP(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 e: X1 = 1 1 0 1 1 26
Can We Sum Out Variables? P(e)XeS(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 e: X1 = 1 1 0 1 1 27 27
Can We Sum Out Variables? ? = P(e)XeS(X) S(e) 0.7 0.3 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 e: X1 = 1 1 0 1 1 28
Valid SPN • SPN is valid ifS(e) = XeS(X) for alle • Valid Can compute marginals efficiently • Partition function Z can be computed by setting all indicators to 1 29
Valid SPN: General Conditions Theorem:SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete:Under sum, children cover the same set of variables Incomplete Inconsistent S(e) XeS(X) S(e) XeS(X) 30
Semantics of Sums and Products • Product Feature Form feature hierarchy • Sum Mixture (with hidden var. summed out) i i wij wij Sum out Yi j …… …… …… …… j I[Yi = j]
Inference Probability: P(X)=S(X) / Z 0.51 X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72 0.6 0.9 0.7 0.8 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 1 0 0 1
Inference If weights sum to 1 at each sum node ThenZ = 1, P(X)=S(X) 0.51 X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72 0.6 0.9 0.7 0.8 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 1 0 0 1
Inference Marginal: P(e)=S(e) / Z 0.69 = 0.510.18 e: X1 = 1 0.7 0.3 0.6 0.9 0.6 0.9 1 1 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 1 0 1 1
Inference MAP: Replace sums with maxs e: X1 = 1 MAX 0.7 0.42 = 0.294 0.3 0.72 = 0.216 0.7 0.3 0.42 0.72 0.6 0.9 0.7 0.8 MAX MAX MAX MAX 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 1 0 1 1
Inference MAX: Pick child with highest value MAP State: X1 = 1, X2 = 0 e: X1 = 1 MAX 0.7 0.42 = 0.294 0.3 0.72 = 0.216 0.7 0.3 0.42 0.72 0.6 0.9 0.7 0.8 MAX MAX MAX MAX 0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8 X1 X2 X1 X2 1 0 1 1
Handling Continuous Variables • Sum Integral over input • Simplest case: Indicator Gaussian SPN compactly defines a very large mixture of Gaussians
SPNs Everywhere • Graphical models Existing tractable models, e.g.: hierarchical mixture model, thin junction tree, etc. SPNs can compactly represent many more distributions 38
SPNs Everywhere • Graphical models Inference methods, e.g.: Junction-tree algorithm, message passing, … SPNs can represent, combine, and learn the optimal way 39
SPNs Everywhere • Graphical models SPNs can be more compact byleveraging determinism, context-specific independence, etc. 40
SPNs Everywhere E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPN: First approach for learning directly from data Graphical models Models for efficient inference 41
SPNs Everywhere Sum: Average-pooling Max: Max-pooling Graphical models Models for efficient inference General, probabilistic convolutional network 42
SPNs Everywhere E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language 43
Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 44
General Approach • Start with a dense SPN • Find the structure by learning weights Zero weights signify absence of connections • Can also learn with EM Each sum node is a mixture over children
The Challenge • In principle, can use gradient descent • But … gradient quickly dilutes • Similar problem with EM • Hard EM overcomes this problem 46
Our Learning Algorithm • Online learning Hard EM • Sum node maintains counts for each child • For each example • Find MAP instantiation with current weights • Increment count for each chosen child • Renormalize to set new weights • Repeat until convergence 47
Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 48
Task: Image Completion • Very challenging • Good for evaluating deep models • Methodology: • Learn a model from training images • Complete unseen test images • Measure mean square errors
Datasets • Main evaluation: Caltech-101[Fei-Fei et al., 2004] • 101 categories, e.g., faces, cars, elephants • Each category: 30 – 800 images • Also, Olivetti [Samaria & Harter, 1994](400 faces) • Each category: Last third for test Test images: Unseen objects