Part III Learning structured representations Hierarchical Bayesian models

Part IIILearning structured representationsHierarchical Bayesian models

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) Grammar Phrase structure Utterance Speech signal

Outline • Learning structured representations • grammars • logical theories • Learning at multiple levels of abstraction

(Chomsky, Pinker, Keil, ...) (McClelland, Rumelhart, ...) A historical divide Structured Representations Unstructured Representations vs Innate knowledge Learning

Structured Representations • Innate Knowledge Chomsky Keil Structure Learning Learning McClelland, Rumelhart Unstructured Representations

Representations asbestos Causal networks lung cancer coughing chest pain Grammars Logical theories

Representations Phonological rules cause Chemicals Diseases affect interact with affect Semantic networks disrupt Biologicalfunctions Bio-active substances

How to learn a R • Search for R that maximizes • Prerequisites • Put a prior over a hypothesis space of Rs. • Decide how observable data are generated from an underlying R.

anything How to learn a R • Search for R that maximizes • Prerequisites • Put a prior over a hypothesis space of Rs. • Decide how observable data are generated from an underlying R.

Context free grammar S  N VP VP  V N  “Alice” V  “scratched” VP  V N N  “Bob” V  “cheered” S S N VP N VP Alice V Alice V N cheered scratched Bob

Probabilistic context free grammar 1.0 0.6 0.5 0.5 S  N VP VP  V N  “Alice” V  “scratched” 0.4 0.5 0.5 VP  V N N  “Bob” V  “cheered” 1.0 S S 1.0 N VP N VP 0.5 0.4 0.5 0.6 Alice V Alice V N 0.5 0.5 cheered scratched Bob probability = 1.0 * 0.5 * 0.6 = 0.3 probability = 1.0*0.5*0.4*0.5*0.5 = 0.05

The learning problem Grammar G: 1.0 0.6 0.5 0.5 S  N VP VP  V N  “Alice” V  “scratched” 0.4 0.5 0.5 VP  V N N  “Bob” V  “cheered” Data D: Alice scratched. Bob scratched. Alice scratched Alice. Alice scratched Bob. Bob scratched Alice. Bob scratched Bob. Alice cheered. Bob cheered. Alice cheered Alice. Alice cheered Bob. Bob cheered Alice. Bob cheered Bob.

Grammar learning • Search for G that maximizes • Prior: • Likelihood: • assume that sentences in the data are independently generated from the grammar. (Horning 1969; Stolcke 1994)

Experiment • Data: 100 sentences ... (Stolcke, 1994)

Generating grammar: Model solution:

Predicate logic • A compositional language For all x and y, if y is the sibling of x then x is the sibling of y For all x, y and z, if x is the ancestor of y and y is the ancestor of z, then x is the ancestor of z.

Learning a kinship theory Theory T: Sibling(victoria, arthur), Sibling(arthur,victoria), Ancestor(chris,victoria), Ancestor(chris,colin), Parent(chris,victoria), Parent(victoria,colin), Uncle(arthur,colin), Brother(arthur,victoria) Data D: … (Hinton, Quinlan, …)

Learning logical theories • Search for T that maximizes • Prior: • Likelihood: • assume that the data include all facts that are true according to T (Conklin and Witten; Kemp et al 08; Katz et al 08)

Theory-learning in the lab R(c,b) R(k,c) R(f,c) R(c,l) R(f,k) R(k,l) R(l,b) R(f,l) R(l,h) R(f,b) R(k,b) R(f,h) R(b,h) R(c,h) R(k,h) (cf Krueger 1979)

f,k f,c f,l f,b f,h k,c k,l k,b k,h c,l c,b c,h l,b l,h b,h Theory-learning in the lab Transitive: R(f,k). R(k,c). R(c,l). R(l,b). R(b,h). R(X,Z) ← R(X,Y), R(Y,Z).

Learning time Complexity trans. trans. Theory length Goodman (Kemp et al 08) trans. excep. trans.

Conclusion: Part 1 • Bayesian models can combine structured representations with statistical inference.

Outline • Learning structured representations • grammars • logical theories • Learning at multiple levels of abstraction

Vision (Han and Zhu, 2006)

Motor Control (Wolpert et al., 2003)

Causal learning chemicals Schema diseases symptoms asbestos mercury Causal models lung cancer minamata disease muscle wasting coughing chest pain Patient 1: asbestos exposure, coughing, chest pain ContingencyData Patient 2: mercury exposure, muscle wasting (Kelley; Cheng; Waldmann)

Universal Grammar Hierarchical phrase structure grammars (e.g., CFG, HPSG, TAG) P(grammar | UG) Grammar P(phrase structure | grammar) Phrase structure P(utterance | phrase structure) Utterance P(speech | utterance) Speech signal

Hierarchical Bayesian model U Universal Grammar P(G|U) G Grammar P(s|G) s1 s2 s3 s4 s5 s6 Phrase structure P(u|s) u1 u2 u3 u4 u5 u6 Utterance A hierarchical Bayesian model specifies a joint distribution over all variables in the hierarchy:P({ui}, {si}, G | U) = P ({ui} | {si}) P({si} | G) P(G|U)

Top-down inferences U Universal Grammar G Grammar s1 s2 s3 s4 s5 s6 Phrase structure u1 u2 u3 u4 u5 u6 Utterance Infer {si} given {ui}, G: P( {si} | {ui}, G) α P( {ui} | {si} ) P( {si} |G)

Bottom-up inferences U Universal Grammar G Grammar s1 s2 s3 s4 s5 s6 Phrase structure u1 u2 u3 u4 u5 u6 Utterance Infer G given {si} and U: P(G| {si}, U) α P( {si} | G) P(G|U)

Simultaneous learning at multiple levels U Universal Grammar G Grammar s1 s2 s3 s4 s5 s6 Phrase structure u1 u2 u3 u4 u5 u6 Utterance Infer G and {si} given {ui} and U: P(G, {si} | {ui}, U) α P( {ui} | {si} )P({si} |G)P(G|U)

Word learning Whole-object bias Shape bias Words in general Individual words car monkey duck gavagai Data

A hierarchical Bayesian model physical knowledge Coins • Qualitative physical knowledge (symmetry) can influence estimates of continuous parameters (FH, FT). q~ Beta(FH,FT) FH,FT ... Coin 1 Coin 2 Coin 200 q200 q1 q2 d1d2 d3 d4 d1d2 d3 d4 d1d2 d3 d4 • Explains why 10 flips of 200 coins are better than 2000 flips of a single coin: more informative about FH, FT.

Word Learning “This is a dax.” “Show me the dax.” • 24 month olds show a shape bias • 20 month olds do not (Landau, Smith & Gleitman)

“lug” “wib” “zup” “div” Is the shape bias learned? • Smith et al (2002) trained 17-month-olds on labels for 4 artificial categories: • After 8 weeks of training 19-month-olds show the shape bias: “Show me the dax.” “This is a dax.”

Learning about feature variability ? (cf. Goodman)

A hierarchical model Meta-constraints M Color varies across bags but not much within bags Bags in general mostly yellow mostly blue? mostly green mostly red mostly brown … Bag proportions Data …

A hierarchical Bayesian model M Meta-constraints Within-bag variability = 0.1 Bags in general = [0.4, 0.4, 0.2] … [1,0,0] [0,1,0] [1,0,0] [0,1,0] [.1,.1,.8] Bag proportions … Data [6,0,0] [0,6,0] [6,0,0] [0,6,0] [0,0,1] …

A hierarchical Bayesian model M Meta-constraints Within-bag variability = 5 Bags in general = [0.4, 0.4, 0.2] … [.5,.5,0] [.5,.5,0] [.5,.5,0] [.5,.5,0] [.4,.4,.2] Bag proportions … Data [3,3,0] [3,3,0] [3,3,0] [3,3,0] [0,0,1] …

Shape of the Beta prior

A hierarchical Bayesian model Meta-constraints M Bags in general … Bag proportions Data …

Learning about feature variability Meta-constraints M Categories in general Individual categories Data

“wib” “lug” “zup” “div”

“wib” “lug” “zup” “div” “dax”

Model predictions Choice probability “dax” “Show me the dax:”

Where do priors come from? Meta-constraints M Categories in general Individual categories Data

Knowledge representation

Children discover structural form • Children may discover that • Social networks are often organized into cliques • The months form a cycle • “Heavier than” is transitive • Category labels can be organized into hierarchies

Part III Learning structured representations Hierarchical Bayesian models

Part III Learning structured representations Hierarchical Bayesian models

Presentation Transcript

Part III Hierarchical Bayesian Models

Module 2: Bayesian Hierarchical Models

Bayesian models of inductive learning

Bayesian models of inductive learning

Hierarchical Models

Representations / Models

Bayesian models of inductive learning

Bayesian models of inductive learning

Latent Tree Models Part III: Learning Algorithms

Bayesian Hierarchical Clustering

Learning overhypotheses with hierarchical Bayesian models

Bayesian Learning for Conditional Models

Chapter 2: Bayesian hierarchical models in geographical genetics

Hierarchical Models

Bayesian models of inductive learning

Bayesian Hierarchical Clustering

Hierarchical Models

Bayesian models of inductive learning

Bayesian models of inductive learning

Bayesian models of inductive learning

Hierarchical Bayesian-Kalman Models for Regularization and ARD in Sequential Learning