1.69k likes | 2.07k Views
Bayesian models of human learning and inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL). (http://web.mit.edu/cocosci/Talks/nips06-tutorial.ppt). Thanks to Tom Griffiths, Charles Kemp, Vikash Mansinghka.
E N D
Bayesian models of human learning and inference Josh Tenenbaum MIT Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL) (http://web.mit.edu/cocosci/Talks/nips06-tutorial.ppt) • Thanks to Tom Griffiths, Charles Kemp, Vikash Mansinghka
The probabilistic revolution in AI • Principled and effective solutions for inductive inference from ambiguous data: • Vision • Robotics • Machine learning • Expert systems / reasoning • Natural language processing • Standard view: no necessary connection to how the human brain solves these problems.
Probabilistic inference inhuman cognition? • “People aren’t Bayesian” • Kahneman and Tversky (1970’s-present): “heuristics and biases” research program. 2002 Nobel Prize in Economics. • Slovic, Fischhoff, and Lichtenstein (1976): “It appears that people lack the correct programs for many important judgmental tasks.... it may be argued that we have not had the opportunity to evolve an intellect capable of dealing conceptually with uncertainty.” • Stephen Jay Gould (1992): “Our minds are not built (for whatever reason) to work by the rules of probability.”
95 out of 100 doctors “Base rate neglect” Correct answer A. greater than 90% B. between 70% and 90% C. between 50% and 70% D. between 30% and 50% E. between 10% and 30% F. less than 10% The probability of breast cancer is 1% for a woman at 40 who participates in a routine screening. If a woman has breast cancer, the probability is 80% that she will have a positive mammography. If a woman does not have breast cancer, the probability is 9.6% that she will also have a positive mammography. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?
Availability biases in probability judgment • How likely is that a randomly chosen word • ends in “g”? • ends in “ing”? • When buying a car, how much do you weigh your friend’s experience relative to consumer satisfaction surveys?
Probabilistic inference inhuman cognition? • “People aren’t Bayesian” • Kahneman and Tversky (1970’s-present): “heuristics and biases” research program. 2002 Nobel Prize in Economics. • Psychology is often drawn towards the mind’s errors and apparent irrationalities. • But the computationally interesting question remains: How does mind work so well?
Bayesian models of cognition Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman, Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...] Language acquisition and processing [Brent, de Marken, Niyogi, Klein, Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …] Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording, Kawato, Doya, Todorov, Shadmehr,…] Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …] Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …] Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …] Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro, Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …] Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …] Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …] Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker, Goodman, Tenenbaum, …]
Learning concepts from examples • Word learning “horse” “horse” “horse”
“tufa” “tufa” “tufa” Learning concepts from examples
Everyday inductive leaps How can people learn so much about the world . . . • Kinds of objects and their properties • The meanings of words, phrases, and sentences • Cause-effect relations • The beliefs, goals and plans of other people • Social structures, conventions, and rules . . . from such limited evidence?
Contributions of Bayesian models • Principled quantitative models of human behavior, with broad coverage and a minimum of free parameters and ad hoc assumptions. • Explain how and why human learning and reasoning works, in terms of (approximations to) optimal statistical inference in natural environments. • A framework for studying people’s implicit knowledge about the structure of the world: how it is structured, used, and acquired. • A two-way bridge to state-of-the-art AI and machine learning.
Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” • Algorithm: Cognitive psychology • Implementation: Neurobiology
What about those errors? • The human mind is not a universal Bayesian engine. • But, the mind does appear adapted to solve important real-world inference problems in approximately Bayesian ways, e.g. • Predicting everyday events • Causal learning and reasoning • Learning concepts from examples • Like perceptual tasks, adults and even young children solve these problems mostly unconsciously, effortlessly, and successfully.
Technical themes • Inference in probabilistic models • Role of priors, explaining away. • Learning in graphical models • Parameter learning, structure learning. • Bayesian model averaging • Being Bayesian over network structures. • Bayesian Occam’s razor • Trade off model complexity against data fit.
Technical themes • Structured probabilistic models • Grammars, first-order logic, relational schemas. • Hierarchical Bayesian models • Acquire abstract knowledge, supports transfer. • Nonparametric Bayes • Flexible models that grow in complexity as new data warrant. • Tractable approximate inference • Markov chain Monte Carlo (MCMC), Sequential Monte Carlo (particle filtering).
Outline • Predicting everyday events • Causal learning and reasoning • Learning concepts from examples
Outline • Predicting everyday events • Causal learning and reasoning • Learning concepts from examples
Basics of Bayesian inference • Bayes’ rule: • An example • Data: John is coughing • Some hypotheses: • John has a cold • John has lung cancer • John has a stomach flu • Likelihood P(d|h) favors 1 and 2 over 3 • Prior probability P(h) favors 1 and 3 over 2 • Posterior probability P(h|d) favors 1 over 2 and 3
Bayesian inference in perception and sensorimotor integration (Weiss, Simoncelli & Adelson 2002) (Kording & Wolpert 2004)
Memory retrieval as Bayesian inference(Anderson & Schooler, 1991) Power law of forgetting: Spacing effects in forgetting: Additive effects of practice & delay: Mean # recalled Log memory strength Log delay (hours) Log delay (seconds) Retention interval (days)
Memory retrieval as Bayesian inference(Anderson & Schooler, 1991) For each item in memory, estimate the probability that it will be useful in the present context. Use priors based on the statistics of natural information sources.
Memory retrieval as Bayesian inference(Anderson & Schooler, 1991) Power law of forgetting: Spacing effects in forgetting: Additive effects of practice & delay: Log need odds Log need odds Log # days since last occurrence Log # days since last occurrence Log # days since last occurrence [New York Times data; c.f. email sources, child-directed speech]
Everyday prediction problems(Griffiths & Tenenbaum, 2006) • You read about a movie that has made $60 million to date. How much money will it make in total? • You see that something has been baking in the oven for 34 minutes. How long until it’s ready? • You meet someone who is 78 years old. How long will they live? • Your friend quotes to you from line 17 of his favorite poem. How long is the poem? • You see taxicab #107 pull up to the curb in front of the train station. How many cabs in this city?
Making predictions • You encounter a phenomenon that has existed for tpast units of time. How long will it continue into the future? (i.e. what’s ttotal?) • We could replace “time” with any other quantity that ranges from 0 to some unknown upper limit.
Bayesian inference P(ttotal|tpast) P(tpast|ttotal) P(ttotal) posterior probability likelihood prior
Bayesian inference P(ttotal|tpast) P(tpast|ttotal) P(ttotal) 1/ttotal 1/ttotal posterior probability likelihood prior Assume random sample (0 < tpast < ttotal) “Uninformative” prior (e.g., Jeffreys, Jaynes)
Bayesian inference P(ttotal|tpast) 1/ttotal 1/ttotal posterior probability Random sampling “Uninformative” prior P(ttotal|tpast) ttotal tpast
Bayesian inference P(ttotal|tpast) 1/ttotal 1/ttotal posterior probability Random sampling “Uninformative” prior P(ttotal|tpast) ttotal tpast Best guess for ttotal: t such that P(ttotal > t|tpast) = 0.5:
Bayesian inference P(ttotal|tpast) 1/ttotal 1/ttotal posterior probability Random sampling “Uninformative” prior P(ttotal|tpast) ttotal tpast Yields Gott’s Rule: P(ttotal > t|tpast) = 0.5 when t = 2tpast i.e., best guess for ttotal = 2tpast.
Evaluating Gott’s Rule • You read about a movie that has made $78 million to date. How much money will it make in total? • “$156 million” seems reasonable. • You meet someone who is 35 years old. How long will they live? • “70 years” seems reasonable. • Not so simple: • You meet someone who is 78 years old. How long will they live? • You meet someone who is 6 years old. How long will they live?
The effects of priors • Different kinds of priors P(ttotal) are appropriate in different domains. e.g., wealth, contacts e.g., height, lifespan [Gott: P(ttotal) ttotal-1 ]
Evaluating human predictions • Different domains with different priors: • A movie has made $60 million • Your friend quotes from line 17 of a poem • You meet a 78 year old man • A move has been running for 55 minutes • A U.S. congressman has served for 11 years • A cake has been in the oven for 34 minutes • Use 5 values of tpast for each. • People predict ttotal .
You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign?
You learn that in ancient Egypt, there was a great flood in the 11th year of a pharaoh’s reign. How long did he reign? How long did the typical pharaoh reign in ancient egypt?
Summary: prediction • Predictions about the extent or magnitude of everyday events follow Bayesian principles. • Contrast with Bayesian inference in perception, motor control, memory: no “universal priors” here. • Predictions depend rationally on priors that are appropriately calibrated for different domains. • Form of the prior (e.g., power-law or exponential) • Specific distribution given that form (parameters) • Non-parametric distribution when necessary. • In the absence of concrete experience, priors may be generated by qualitative background knowledge.
Outline • Predicting everyday events • Causal learning and reasoning • Learning concepts from examples
P(x4) X4 X3 P(x3) P(x1|x3, x4) X1 X2 P(x2|x3) Bayesian networks Nodes: variables Links: direct dependencies Each node has a conditional probability distribution Data: observations of X1, ..., X4 Four random variables: X1 coughing X2 high body temperature X3 flu X4 lung cancer
P(x4) X4 X3 P(x3) P(x1|x3, x4) X1 X2 P(x2|x3) Causal Bayesian networks Nodes: variables Links:causal mechanisms Each node has a conditional probability distribution Data: observations of and interventions onX1, ..., X4 Four random variables: X1 coughing X2 high body temperature X3 flu X4 lung cancer (Pearl; Glymour & Cooper)
Inference in causal graphical models B A • Explaining away or “discounting” in social reasoning (Kelley; Morris & Larrick) • “Screening off” in intuitive causal reasoning (Waldmann, Rehder & Burnett, Blok & Sloman, Gopnik & Sobel) • Better in chains than common-cause structures; common-cause better if mechanisms clearly independent • Understanding and predicting the effects of interventions (Sloman & Lagnado; Gopnik & Schulz) C B P(c|b) vs. P(c|b, a) P(c|b, not a) B A C C A
P(x4) P(x4) X4 X4 X3 X3 P(x3) P(x3) P(x1|x3, x4) P(x1|x3, x4) X1 X1 X2 X2 P(x2|x3) P(x2|x3) Learning graphical models • Structure learning: what causes what? • Parameter learning: how do causes work?
Bayesian learning of causal structure Data d Causal hypotheses h X3 X3 X4 X4 X1 X2 X1 X2 1. What is the most likely network h given observed data d ? 2. How likely is there to be a link X4X2 ? (Bayesian model averaging)
M1 p(D = d | M ) M2 All possible data sets d Bayesian Occam’s Razor (MacKay, 2003; Ghahramani tutorials) For any model M, • Law of “conservation of belief”: A model that can predict many possible data sets must assign each of them low probability.
Learning causation from contingencies C present (c+) C absent (c-) e.g., “Does injecting this chemical cause mice to express a certain gene?” a c E present (e+) d b E absent (e-) Subjects judge the extent C to which causes E (rate on a scale from 0 to 100)
Two models of causal judgment • Delta-P (Jenkins & Ward, 1965): • Power PC (Cheng, 1997): Power
1.00 0.50 0.75 DP 0.00 0.25 People DP Power Judging the probability that C E (Buehner & Cheng, 1997; 2003) • Independent effects of both DP and causal power. • At DP=0, judgments decrease with base rate. (“frequency illusion”)
C B w0 w1 E Learning causal strength(parameter learning) Assume this causal structure: DP and causal power are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): linear DP, Noisy-OR causal power B