1 / 34

Last lecture summary Naïve Bayes Classifier

Last lecture summary Naïve Bayes Classifier. Bayes Rule. Prior and likelihood must be learnt (i.e. estimated from the data). Likelihood. Prior. Posterior. Normalization Constant. learning prior

esme
Download Presentation

Last lecture summary Naïve Bayes Classifier

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Last lecture summaryNaïve Bayes Classifier

  2. Bayes Rule Prior and likelihood must be learnt (i.e. estimated from the data) Likelihood Prior Posterior Normalization Constant

  3. learning prior • A hundred independently drawn training examples will usually suffice to obtain a reasonable estimate of P(Y). • larning likelihood • The Naïve Bayes Assumption: Assume that all features are independent given the class label Y.

  4. Example – Play Tennis

  5. Example – Learning Phase P(Outlook=Sunny|Play=Yes) = 2/9 P(Play=Yes) = 9/14 P(Play=No) = 5/14

  6. Example - Prediction x’=(Outl=Sunny, Temp=Cool, Hum=High, Wind=Strong) Look up tables P(Outl=Sunny|Play=No) = 3/5 P(Temp=Cool|Play=No) = 1/5 P(Hum=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Outl=Sunny|Play=Yes) = 2/9 P(Temp=Cool|Play=Yes) = 3/9 P(Hum=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the factP(Yes|x’) < P(No|x’), we label x’ to be “No”.

  7. Last lecture summaryBinary classifier performance

  8. TP, TN, FP, FN Precision, Positive Predictive Value (PPV) TP / (TP + FP) Recall, Sensitivity, True Positive Rate (TPR), Hit rate TP / P = TP/(TP + FN) False Positive Rate (FPR), Fall-out FP / N = FP / (FP + TN) Specificity, True Negative Rate (TNR) TN / (TN + FP) = 1 - FPR Accuracy (TP + TN) / (TP + TN + FP + FN)

  9. Neural networks(new stuff)

  10. Biological motivation • The human brain has been estimated to contain (~1011) brain cells (neurons). • A neuron is an electrically excitable cell that processes and transmits information by electrochemical signaling. • Each neuron is connected with other neurons through the connections called synapses. • A typical neuron possesses a cell body (often called soma), dendrites (many, mm), and an axon (one, 10 cm – 1 m).

  11. Synapse permits a neuron to pass an electrical or chemical signal to another cell. • Synapse can be either excitatory, or inhibitory. • Synapses are of different strength (the stronger the synapse is, the more important it is). • The effects of synapses cumulate inside the neuron. • When the cumulative effect of synapses reaches certain threshold, the neuron gets activated, the signal is sent to the axon, through which the neuron is connected to other neuron(s).

  12. Simplistic view of the function of neuron • Neuron accumulates positive/negative stimuli from other neurons. • Then is processed further – – to produce an output, i.e. neuron sends an output signal to neurons connected to it.

  13. Neural networks for applied science and engineering, Samarasinghe

  14. Threshold neuron Warren McCulloch Walter Pitts 1899 - 1969 1923 - 1969

  15. 1st mathematical model of neuron – McCulloch & Pitts binary (threshold) neuron • only binary inputs and output • the weights are pre-set, no learning – inputs – weights – activation (tansfer) function - output

  16. In this exercise, both weights will be fixed • When the target is classified as 0 and when as 1? • Set the threshold. • If threshold, then it is classified as 1. • If threshold, then it is classified as 0. • Which threshold would you use? • e.g.

  17. Heavyside (threshold) activation function

  18. Threshold is incorporated as a weight of one additional input with input value . • Such input is called bias.

  19. Because the location of the threshold function defines the two categories, its value of 1.3 decides a classification boundary that can be formulated as

  20. Perceptron (1957) Frank Rosenblatt Developed the learning algorithm. Used his neuron (pattern recognizer = perceptron) for classification of letters.

  21. binary classifier, maps its input x (real-valued vector) to – a binary value (0 or 1) • (including bias) • 0 … otherwise • perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm

  22. Multiple output perceptron • for multicategory (i.e. more than 2 classes) classification • one output neuron for each class output layer input layer single layer (one-layered) vs. double layer (two-layered)

  23. Learning • Set the weights (including threshold ). • Supervised learning, we know the target values . • We want the outputs to be as close as possible to the desired values of . • We define an error (Sum of Squares Error, we already know this one)

  24. “ to be as close as possible to ” means that shoud be minimal • So we want to minimize , which is the function of weights . • is also called objective function or sometimes energy.

  25. requirements for the minimum Gradientgrad is a vector pointing in the direction of the greatest rate of increase of the function We want to decline, we take -grad.

  26. Delta rule • gradient descent • How to train linear neuron using delta rule? • Demonstration will be given for one neuron with one input , no bias, one output .

  27. Neuron is presented with an input pattern. • It calculates , and its outuput as (no threshold is used) • The error E: • If you draw against , which curve you get? gradient error

  28. To find a gradient , differentiate the error E with respect to w1: • According to the delta rule, weight change is proportional to the negative of the error gradient: • New weight:

  29. β is called a learning rate. It determines how far along the gradient it is necessary to move.

  30. the new weight after ith iteration

  31. This is an iterative algorithm, one pass through training set is not enough. • One pass of the whole training data set is called an epoch. • Adjusting the weights after each input pattern presentation (iteration) is called example-by-example (online) learning. • For some problems this can cause weights to oscillate – adjustment required by one pattern may be canceled by the next pattern. • More popular is the next method.

  32. Batch learning – wait until all input patterns (i.e. epoch) have been processed and then adjust weights in the average sense. • More stable solution. • Obtain the error gradient for each input pattern • Average them at the end of the epoch • Use this average value to adjust the weights using the delta rule

More Related