Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture Pedro Domingos Dept. Computer Science & Eng. University of Washington Joint work with Hoifung Poon 1

Graphical Models: Challenges Restricted Boltzmann Machine (RBM) Bayesian Network Markov Network Sprinkler Rain Grass Wet Advantage: Compactly represent probability Problem: Inference is intractable Problem: Learning is difficult 2

Deep Learning • Stack many layers E.g.: DBN [Hinton & Salakhutdinov,2006] CDBN [Lee et al.,2009] DBM [Salakhutdinov & Hinton,2010] • Potentially much more powerful than shallow architectures[Bengio, 2009] • But … • Inference is even harder • Learning requires extensive effort 3

Learning:Requires approximate inference Inference: Still approximate Graphical Models

E.g., hierarchical mixture model, thin junction tree, etc. Problem: Too restricted Graphical Models Existing Tractable Models

This Talk: Sum-Product Networks Compactly represent partition function using a deep network Sum-Product Networks Graphical Models Existing Tractable Models

Sum-Product Networks Graphical Models Exact inference linear time in network size Existing Tractable Models

Can compactly represent many more distributions Sum-Product Networks Graphical Models Existing Tractable Models

Learn optimal way to reuse computation, etc. Sum-Product Networks Graphical Models Existing Tractable Models

Outline • Sum-product networks (SPNs) • Learning SPNs • Experimental results • Conclusion 10

Why Is Inference Hard? • Bottleneck: Summing out variables • E.g.: Partition function Sum of exponentially many products

Alternative Representation P(X) = 0.4  I[X1=1]I[X2=1] +0.2  I[X1=1]I[X2=0] +0.1  I[X1=0]I[X2=1] +0.3  I[X1=0]I[X2=0]

Alternative Representation P(X) = 0.4  I[X1=1] I[X2=1] +0.2  I[X1=1]I[X2=0] +0.1  I[X1=0]I[X2=1] +0.3  I[X1=0]I[X2=0]

Shorthand for Indicators P(X) = 0.4  X1X2 +0.2 X1X2 +0.1  X1X2 +0.3  X1X2    

Sum Out Variables e: X1 = 1 P(e) = 0.4  X1 X2 + 0.2 X1 X2 +0.1  X1X2 +0.3  X1X2       SetX1 = 1, X1 = 0, X2 = 1, X2 = 1 Easy: Set both indicators to 1

Graphical Representation  0.4 0.3 0.2 0.1       X1 X1 X2 X2

But … Exponentially Large 2N-1                  N2N-1      X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states with even number of 1’s 17

But … Exponentially Large Can we make this more compact?                       X1 X2 X1 X4 X5 X2 X3 X4 X3 X5 Example: Parity Uniform distribution over states of even number of 1’s 18

Use a Deep Network O(N) Example: Parity Uniform distribution over states with even number of 1’s 19

Use a Deep Network Induce many hidden layers Example: Parity Uniform distribution over states of even number of 1’s 20

Use a Deep Network Reuse partial computation Example: Parity Uniform distribution over states of even number of 1’s 21

Sum-Product Networks (SPNs) • Rooted DAG • Nodes: Sum, product, input indicator • Weights on edges from sum to children  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 22

Distribution Defined by SPN P(X)S(X)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 23

Distribution Defined by SPN  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 P(X)S(X) 24 24

Can We Sum Out Variables? P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 25

Can We Sum Out Variables? P(e) =XeP(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 26

Can We Sum Out Variables? P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 27 27

Can We Sum Out Variables? ? = P(e)XeS(X) S(e)  0.7 0.3       0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 e: X1 = 1 1 0 1 1 28

Valid SPN • SPN is valid ifS(e) = XeS(X) for alle • Valid  Can compute marginals efficiently • Partition function Z can be computed by setting all indicators to 1 29

Valid SPN: General Conditions Theorem:SPN is valid if it is complete & consistent Consistent: Under product, no variable in one child and negation in another Complete:Under sum, children cover the same set of variables Incomplete Inconsistent S(e)  XeS(X) S(e) XeS(X) 30

Semantics of Sums and Products • Product  Feature  Form feature hierarchy • Sum  Mixture (with hidden var. summed out) i i   wij wij Sum out Yi j …… …… …… ……   j  I[Yi = j]

Inference Probability: P(X)=S(X) / Z 0.51  X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 0 1

Inference If weights sum to 1 at each sum node ThenZ = 1, P(X)=S(X) 0.51  X: X1 = 1, X2 = 0 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 0 1

Inference Marginal: P(e)=S(e) / Z 0.69 = 0.510.18  e: X1 = 1 0.7 0.3 0.6 0.9    0.6 0.9 1 1      0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

Inference MAP: Replace sums with maxs e: X1 = 1 MAX 0.7  0.42 = 0.294 0.3  0.72 = 0.216 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

Inference MAX: Pick child with highest value MAP State: X1 = 1, X2 = 0 e: X1 = 1 MAX 0.7  0.42 = 0.294 0.3  0.72 = 0.216 0.7 0.3 0.42 0.72    0.6 0.9 0.7 0.8 MAX MAX MAX MAX  0.4 0.9 0.7 0.2 0.6 0.1 0.3 0.8   X1 X2 X1 X2 1 0 1 1

Handling Continuous Variables • Sum  Integral over input • Simplest case: Indicator  Gaussian SPN compactly defines a very large mixture of Gaussians

SPNs Everywhere • Graphical models Existing tractable models, e.g.: hierarchical mixture model, thin junction tree, etc. SPNs can compactly represent many more distributions 38

SPNs Everywhere • Graphical models Inference methods, e.g.: Junction-tree algorithm, message passing, … SPNs can represent, combine, and learn the optimal way 39

SPNs Everywhere • Graphical models SPNs can be more compact byleveraging determinism, context-specific independence, etc. 40

SPNs Everywhere E.g., arithmetic circuits, AND/OR graphs, case-factor diagrams SPN: First approach for learning directly from data Graphical models Models for efficient inference 41

SPNs Everywhere Sum: Average-pooling Max: Max-pooling Graphical models Models for efficient inference General, probabilistic convolutional network 42

SPNs Everywhere E.g., object detection grammar, probabilistic context-free grammar Sum: Non-terminal Product: Production rule Graphical models Models for efficient inference General, probabilistic convolutional network Grammars in vision and language 43

General Approach • Start with a dense SPN • Find the structure by learning weights Zero weights signify absence of connections • Can also learn with EM Each sum node is a mixture over children

The Challenge • In principle, can use gradient descent • But … gradient quickly dilutes • Similar problem with EM • Hard EM overcomes this problem 46

Our Learning Algorithm • Online learning  Hard EM • Sum node maintains counts for each child • For each example • Find MAP instantiation with current weights • Increment count for each chosen child • Renormalize to set new weights • Repeat until convergence 47

Task: Image Completion • Very challenging • Good for evaluating deep models • Methodology: • Learn a model from training images • Complete unseen test images • Measure mean square errors

Datasets • Main evaluation: Caltech-101[Fei-Fei et al., 2004] • 101 categories, e.g., faces, cars, elephants • Each category: 30 – 800 images • Also, Olivetti [Samaria & Harter, 1994](400 faces) • Each category: Last third for test Test images: Unseen objects

Sum-Product Networks: A New Deep Architecture

Sum-Product Networks: A New Deep Architecture

Presentation Transcript

Deep Blue Architecture

6.4 Product / Sum Identities

Deep Blue Architecture

Cumulative Distribution Networks and the Derivative-Sum-Product Algorithm

Product-to-Sum and Sum-to-Product Formulas

Deep Belief Networks

Product Architecture

7.3.1 – Product/Sum Identities

7.5 Sum and Product of Roots

The Sum-Product Algorithm

Counting: Product and Sum Rules

Product clerk adds a new product

Product Architecture

A New Operations Support System Architecture for Broadband IP Networks

Sum and Product Roots

Deep Networks

Counting: Product and Sum Rules

Sum-Product Networks

Sum-Product Networks: A New Deep Architecture