Cognitive Computer Vision

Cognitive Computer Vision Kingsley Sage khs20@sussex.ac.uk and Hilary Buxton hilaryb@sussex.ac.uk Prepared under ECVision Specific Action 8-3 http://www.ecvision.org

Lecture 13 • Learning Bayesian Belief Networks • Taxonomy of methods • Learning BBNs for the fully observable data and known structure case

So why are BBNs relevant to Cognitive CV? • Provides a well-founded methodology for reasoning with uncertainty • These methods are the basis for our model of perception guided by expectation • We can develop well-founded methods of learning rather than just being stuck with hand-coded models

B A O C N Reminder: What is a BBN? • Compact representation of the joint probability • Each variable is represented as a node. • Conditional independence assumptions are encoded using a set of arcs • Different types of graph exist. The one shown is a Directed Acyclic Graph (DAG)

Why is learning important in the context of BBNs? • Knowledge acquisition can be an expensive process • Experts may not be readily available (scarce knowledge) or simply not exist • But you might have a lot of data from (say) case studies • Learning allows us to construct BBN models from the data and in the process gain insight into the nature of the problem domain

The process of learning Data (may be full or partial) Learning process Model structure (if known)

A B O What do we mean by “partial” data? • Training data where there are missing values e.g.: Discrete valued BBN with 3 nodes

A O B A O B What do we mean by “known” and “unknown” structure? Known structure Unknown structure

Taxonomy of learning methods • In this lecture we will look at the full observability and known model structure case in detail • In the next lecture we will take an overview of the other three cases Observability

LIKELIHOOD Full observability & known structure Getting the notation right • The model parameters (CPDs) are represented as  (example later) • Training data set D • We want to find parameters to maximise P(|D) • Likelihood function L(:D) is P(D| )

Training data Dz A O B Full observability & known structure Getting the notation right

A O B Factorising the likelihood expression

Decomposition in general All the parameters for each node can be estimated separately

A B O L(:D)  ExampleEstimating parameter for root node Let’s say our training data D contains these values for A {T,F,T,T,F,T,T,T} We represent our single parameter  as the probability that a=T The likelihood for the sequence is:

So what about the prior on ? We have an expression for P(a[1],…,a[M]), all we need to do now is to say something about P() If all values of  were equally likely at the outset, then we have a MAXIMUM LIKELIHOOD ESTIMATE (MLE) for P(|a[1],…,a[M]) which for our example is  = 0.75 I.e. p(a=T is 0.75)

So what about the prior on ? If P() is not uniform, we need to take that into account when computing our estimate for a model parameter. In that case P(|x[1],…,x[M]) would be a MAXIMUM APOSTERIORI PROBABILITY (MAP) estimate There are many different forms of prior, one of the more common ones in this application is the DIRICHLET prior …

Dirichlet(T,F) p()  The Dirichlet prior

Semantic priors • If the training data D is sorted into known classes, the priors can be estimate beforehand. These are called “semantic priors” • This involves an element of hand coding and loses the advantage gaining some insight into the problem domain • Does give the advantage of mapping into expert knowledge of the classes in the problem

Summary • Estimation relies on sufficient statistics • For ML estimate for discrete valued nodes, we use counts #: • For MAP estimate, we have to account for the prior

Next time … • Overview of methods for learning BBNs: • Full data and unknown structure • Partial data and known structure • Partial data and unknown structure • Excellent tutorial at by Koller and Friedman: www.cs.huji.ac.il/~nir/Nips01-Tutorial/ • Some of today’s slides were adapted from that tutorial

Cognitive Computer Vision