CMSC 671 Fall 2001

CMSC 671Fall 2001 Class #25-26 – Tuesday, November 27 / Thursday, November 29

Today’s class • Neural networks • Bayesian learning

Machine Learning: Neural and Bayesian Chapter 19 Some material adapted from lecture notes by Lise Getoor and Ron Parr

Neural function • Brain function (thought) occurs as the result of the firing of neurons • Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters • Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds • Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength • There are about 1011 neurons and about 1014 synapses in the human brain

Biology of a neuron

Brain structure • Different areas of the brain have different functions • Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent • Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly • We don’t know how different functions are “assigned” or acquired • Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) • Partly the result of experience (learning) • We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” • Our neural networks are not nearly as complex or intricate as the actual brain structure

Comparison of computing power • Computers are way faster than neurons… • But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel • Neural networks are designed to be massively parallel • The brain is effectively a billion times faster

Neural networks • Neural networks are made up of nodes or units, connected by links • Each link has an associated weight and activation level • Each node has an input function (typically summing over weighted inputs), an activation function, and an output

Layered feed-forward network Output units Hidden units Input units

Neural unit

“Executing” neural networks • Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level • Working forward through the network, the input function of each unit is applied to compute the input value • Usually this is just the weighted sum of the activation on the links feeding into this node • The activation function transforms this input function into a final value • Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

Learning neural networks • Backpropagation • Cascade correlation: adding hidden units Take it away, Chih-Yun! Next up: Sohel

B E A C Learning Bayesian networks • Given training set • Find B that best matches D • model selection • parameter estimation Inducer Data D

Parameter estimation • Assume known structure • Goal: estimate BN parameters q • entries in local probability models, P(X | Parents(X)) • A parameterization q is good if it is likely to generate the observed data: • Maximum Likelihood Estimation (MLE) Principle: Choose q* so as to maximize L i.i.d. samples

Parameter estimation in BNs • The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter • The MLE (maximum likelihood estimate) solution: • for each value x of a node X • and each instantiation u of Parents(X) • Just need to collect the counts for every combination of parents and children observed in the data • MLE is equivalent to an assumption of a uniform prior over parameter values sufficient statistics

Sufficient statistics: Example Moon-phase • Why are the counts sufficient? Light-level Earthquake Burglary Alarm

Model selection Goal: Select the best network structure, given the data Input: • Training data • Scoring function Output: • A network that maximizes the score

Same key property: Decomposability Score(structure) = Si Score(family of Xi) Structure selection: Scoring • Bayesian: prior over parameters and structure • get balance between model complexity and fit to data as a byproduct • Score (G:D) = log P(G|D)  log [P(D|G) P(G)] • Marginal likelihood just comes from our parameter estimates • Prior on structure can be any measure we want; typically a function of the network complexity Marginal likelihood Prior

DeleteEA AddEC B E Δscore(C) Δscore(A) A B E C A C B E B E A A ReverseEA Δscore(A) C C Heuristic search

DeleteEA DeleteEA AddEC B E Δscore(C) Δscore(A) Δscore(A) A B E B E C A A C C B E A To recompute scores, only need to re-score families that changed in the last move ReverseEA Δscore(A) C Exploiting decomposability

Variations on a theme • Known structure, fully observable: only need to do parameter estimation • Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation • Known structure, missing values: use expectation maximization (EM) to estimate parameters • Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques • Unknown structure, hidden variables: too hard to solve!

Handling missing data • Suppose that in some cases, we observe earthquake, alarm, light-level, and moon-phase, but not burglary • Should we throw that data away?? • Idea: Guess the missing valuesbased on the other data Moon-phase Light-level Earthquake Burglary Alarm

EM (expectation maximization) • Guess probabilities for nodes with missing values (e.g., based on other observations) • Compute the probability distribution over the missing values, given our guess • Update the probabilities based on the guessed values • Repeat until convergence

EM example • Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 • We estimate the CPTs based on the rest of the data • We then estimate P(Burglary) for November 27 from those CPTs • Now we recompute the CPTs as if that estimated value had been observed • Repeat until convergence! Earthquake Burglary Alarm

CMSC 671 Fall 2001