270 likes | 344 Views
Today’s Lecture. Administrative Details Learning Decision trees: cleanup & details Belief nets Sub-symbolic learning Neural networks. Administrativia. Assignment 3 now available. Due in 1 week. <http://www.cim.mcgill.ca/~dudek/424/game.pdf>
E N D
Today’s Lecture • Administrative Details • Learning • Decision trees: cleanup & details • Belief nets • Sub-symbolic learning • Neural networks CS-424 Gregory Dudek
Administrativia • Assignment 3 now available. Due in 1 week. <http://www.cim.mcgill.ca/~dudek/424/game.pdf> • The game for the final project has been defined. It’s description can be found via the course home page at:<http://www.cim.mcgill.ca/~dudek/424/game.pdf> • Examine the game now, try it, see what good strategies might be. • Note: the normal late policy will not apply to the project. • You **must** submit the electronic (executable) on time, or it may not be evaluated (i.e. you get zero)! • It must run on LINUX. Be certain to compile and test it on one of the linux machines in the lab well before it’s due. • If you are developing on another platform, regularly test on linux during development. CS-424 Gregory Dudek
ID3 (more…) • Last class, discussed using entropy to select a question for building a decision tree. This idea first developed by Quinlan (1979) in the ID3 system, later improved resulting in C4.5 • Recap: • Entropy for classification into sets p+ and p- is I(p+,p-) • Consider information gain per attribute A. • For each subtree Ai we have a few bits given by the distribution of cases on the subtree I(pi+,pi-). To fully classify all sub-cases, we need Remainder(A) = i weight(i) I(pi+,pi-) • Thus, info gain is what’s left: Gain = I(p+,p-) - Remainder(A) CS-424 Gregory Dudek
Final thoughts on entropy... • Provoked by the seminar yesterday. • Idea: • Entropy tells you about unpredictability, or randomness. • When selecting a question, the one with highest entropy will carry the most information with respect to what you knew, because the answer is hardest to predict. • When asking if we know about something, then high entropy is bad • Consider the PDF of a robot’s position estimate… More entropy means more uncertainty. CS-424 Gregory Dudek
Training & testing • When constructing a learning system, we what to generalize to new example. (already discussed ad nauseam) • How can we tell if it’s working? • Look at how well we do with our training data. • But…. What if we just learned a “quirk” of the data? • Overfitting? Bad features? • (tank classification example; table lookup) • Look at how we do on a set of examples never used for any training: a “test set”. • But… what if we can’t afford the data? • Cross validation: learner L on training set X e(L:X) = i in XS=X-i error(L(S) on case i)2 CS-424 Gregory Dudek
Simple functions? Is there a fixed circuit network topology that can be used to represent a family of functions? Yes! Neural-like networks (a.k.a. artificial neural networks) allow us this flexibility and more; we can represent arbitrary families of continuous functions using fixed topology networks. CS-424 Gregory Dudek
Belief networks (ch. 15) - briefly • We will cover only R&N Section 15.1 & 15.2 (briefly), and then segue to chapter 19. You should read 15.3. This will be cursory coverage only. • A belief net is a formalism for describing probabilistic relationships (a.k.a. Bayes nets). • A graph (in fact a DAG) G(V,E). Nodes are random variables. Directed edges indicate a node has direct influence on another node. • Each node has an associated conditional probability table quantifying the effects of it’s “parents”. • No directed cycles. CS-424 Gregory Dudek
Why? • Objective: • Compute probabilities of variables of interest, query variables • Given observations of specific phenomena in the world, evidence variables. • The net you get depends on how you construct it, not just the problem and probabilities. • Seek compactness (fewer links, tighter clusters): called locally structured nets. CS-424 Gregory Dudek
See overheads... CS-424 Gregory Dudek
Not in text Issues • Where do the probabilities come from? • They can be learned or inferred from data. • Where does the causal structure come from (the topology)? • It’s (sometimes) very hard to learn. • Problem: lots of alternative topologies are possible. What’s really cause and what’s effect? • Did it really rain because I brought my umbrella? Can a computer infer this (or the opposite) just from weather data? • Both these topics are current research areas. CS-424 Gregory Dudek
Neural Networks? Artificial Neural Nets a.k.a. Connectionist Nets (connectionist learning) a.k.a. Sub-symbolic learning a.k.a. Perceptron learning (a special case) CS-424 Gregory Dudek
Networks that model the brain? • Note there is an interesting connection to Bayes nets: it isn’t considered in the book. • Something to reflect on. • Idea: model intelligence withour “jumping ahead” to symbolic representations. • Related to earliest work on cybernetics. CS-424 Gregory Dudek
The idealized neuron • Artificial neural networks come in several “flavors”. • Most of based on a simplified model of a neuron. • A set of (many) inputs. • One output. • Output is a function of the sum on the inputs. • Typical functions: • Weighted sum • Threshold • Gaussian CS-424 Gregory Dudek
Not in text Why neural nets? • Motives: • We wish to create systems with abilities akin to those of the human mind. • The mind is usually assumed to be be a direct consequence of the structure of the brain. • Let’s mimic the structure of the brain! • By using simple computing elements, we obtain a system that might scale up easily to parallel hardware. • Avoids (or solves?) the key unresolved problem of how to get from “signal domain” to symbolic representations. • Fault tolerance CS-424 Gregory Dudek
Not in text CS-424 Gregory Dudek
Not in text CS-424 Gregory Dudek
Real and fake neurons • Signals in neurons are coded by “spike rate”. • In ANN’s, inputs can be either: • 0 or 1 (binary) • [0,1] • [-1,1] • R (real) • Each input Ii has an associated real-valued weight wi • Learning by changing weights at synapses. CS-424 Gregory Dudek
Not in text CS-424 Gregory Dudek
Brains • The brain seems divided into functional areas • These are often seen as analogous to modules in a software system. • Why would it be like this? (2 possible answers) • Evolution: incremental improvement easier in a modular system. • Advantage of combining complementary solutions. • It isn’t! CS-424 Gregory Dudek
Not in text Inductive bias? • Where’s the inductive bias? • In the topology and architecture of the network. • In the learning rules. • In the input and output representation. • In the initial weights. CS-424 Gregory Dudek
Not in text Simple neural models • Oldest ANN model is McCulloch-Pitts neuron [1943] . • Inputs are +1 or -1 with real-valued weights. • If sum of weighted inputs is > 0, then the neuron “fires” and gives +1 as an output. • Showed you can comput logical functions. • Relation to learning proposed (later!) by Donald Hebb [1949]. • Perceptron model [Rosenblatt, 1958]. • Single-layer network with same kind of neuron. • Firing when input is about a threshold: ∑xiwi>t . • Added a learning rule to allow weight selection. CS-424 Gregory Dudek
Perceptron nets CS-424 Gregory Dudek
Perceptron learning • Perceptron learning: • Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors) • Have a set of desired output values associated with these inputs. • This is supervisedlearning. • Problem: how to adjust the weights to make the actual outputs match the training examples. • NOTE: we to not allow the topology to change! [You should be thinking of a question here.] • Intuition: when a perceptron makes a mistake, it’s weights are wrong. • Modify them so make the output bigger or smaller, as desired. CS-424 Gregory Dudek
Learning algorithm • Desired Ti Actual output Oi • Weight update formula (weight from unit j to i): Wj,i = Wj,I + k* xj * (Ti - Oi) Where k is the learning rate. • If the examples can be learned (encoded), then the perceptron learning rule will find the weights. • How? Gradient descent. Key thing to prove is the absence of local minima. CS-424 Gregory Dudek
Perceptrons: what can they learn? • Only linearly separable functions [Minsky & Papert 1969]. • N dimensions: N-dimensional hyperplane. CS-424 Gregory Dudek
More general networks • Generalize in 3 ways: • Allow continuous output values [0,1] • Allow multiple layers. • This is key to learning a larger class of functions. • Allow a more complicated function than thresholded summation [why??] Generalize the learning rule to accommodate this: let’s see how it works. CS-424 Gregory Dudek
The threshold • The key variant: • Change threshold into a differentiable function • Sigmoid, known as a “soft non-linearity” (silly). M = ∑xiwi O = 1 / (1 + e -k M ) CS-424 Gregory Dudek