Final Exam Review

Final Exam Review Final Exam: May 10 Thursday

Bayesian reasoning • If event E occurs, then the probability thatevent H will occur is p(H|E) IF E (evidence) is true THEN H (hypothesis) is true with probability p

Bayesian reasoning Example: Cancer and Test • P(C) = 0.01 P(¬C) = 0.99 • P(+|C) = 0.9 P(-|C) = 0.1 • P(+|¬C) = 0.2 P(-|¬C) = 0.8 • P(C|+) = ?

Bayesian reasoning with multiple hypotheses and evidences • Expand the Bayesian rule to work with multiple hypotheses (H1...Hm) and evidences (E1...En) Assuming conditional independence among evidences E1...En

Bayesian reasoning Example • Expert data:

user observes E3E1 E2

Bayesian reasoning Example expert system computes posterior probabilities user observes E2

Propagation of CFs • For a single antecedent rule: • cf(E) is the certainty factor of the evidence. • cf(R) is the certainty factor of the rule.

Single antecedent rule example • IF patient has toothache THENproblem is cavity {cf 0.3} • Patient has toothache {cf 0.9} • What is the cf(cavity, toothache)?

Propagation of CFs (multiple antecedents) • For conjunctive rules: • IF <evidence E1> AND <evidence E2> ... AND <evidence En> THEN <Hypothesis H> {cf} • For two evidences E1 and E2: • cf(E1 AND E2) = min(cf(E1), cf(E2))

Propagation of CFs (multiple antecedents) • For disjunctive rules: • IF <evidence E1> OR<evidence E2> ... OR<evidence En> THEN <Hypothesis H> {cf} • For two evidences E1 and E2: • cf(E1 OR E2) = max(cf(E1), cf(E2))

Exercise • IF (P1 AND P2) OR P3 THEN C1 (0.7) AND C2 (0.3) • Assume cf(P1) = 0.6, cf(P2) = 0.4, cf(P3) = 0.2 • What is cf(C1), cf(C2)?

Defining fuzzy sets with fit-vectors • A can be defined as: • So, for example: • Tall men = (0/180, 1/190) • Short men=(1/160, 0/170) • Average men=(0/165,1/175,0/185)

Qualifiers & Hedges • What about linguistic values with qualifiers? • e.g. very tall, extremely short, etc. • Hedges are qualifying terms that modifythe shape of fuzzy sets • e.g. very, somewhat, quite, slightly, extremely, etc.

Representing Hedges

Crisp Set Operations

Fuzzy Set Operations • Complement • To what degree do elements not belong to this set? • tall men = {0/180, 0.25/182, 0.5/185, 0.75/187, 1/190}; • Not tall men = {1/180, 0.75/182, 0.5/185, 0.25/187, 1/190}; m¬A(x) = 1 – mA(x)

Fuzzy Set Operations Each element of the fuzzy subset has smaller membership than in the containing set • Containment • Which sets belong to other sets? • tall men = {0/180, 0.25/182, 0.5/185, 0.75/187, 1/190}; • very tall men = {0/180, 0.06/182, 0.25/185, 0.56/187, 1/190};

Fuzzy Set Operations • Intersection • To what degree is the element in both sets? mA∩B(x) = min[ mA(x), mB(x) ]

mA∩B(x) = min[ mA(x), mB(x) ] • tall men = {0/165, 0/175, 0/180, 0.25/182, 0.5/185, 1/190}; • average men = {0/165, 1/175, 0.5/180, 0.25/182, 0/185, 0/190}; • tall men ∩ average men = {0/165, 0/175, 0/180, 0.25/182, 0/185, 0/190}; • or • tall men ∩ average men = {0/180, 0.25/182, 0/185};

Fuzzy Set Operations • Union • To what degree is the element in either or both sets? mAB(x) = max[ mA(x), mB(x) ]

mAB(x) = max[ mA(x), mB(x) ] • tall men = {0/165, 0/175, 0/180, 0.25/182, 0.5/185, 1/190}; • average men = {0/165, 1/175, 0.5/180, 0.25/182, 0/185, 0/190}; • tall men  average men = {0/165, 1/175, 0.5/180, 0.25/182, 0.5/185, 1/190};

Choosing the Best Attribute:Binary Classification • Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction • Information theory (Shannon and Weaver 49) • Entropy: a measure of uncertainty of a random variable • A coin that always comes up heads --> 0 • A flip of a fair coin (Heads or tails) --> 1(bit) • The roll of a fair four-sided die --> 2(bit) • Information gain: the expected reduction in entropy caused by partitioning the examples according to this attribute

Formula for Entropy Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative:H(1/2,1/2) = -1/2log21/2 -1/2log21/2 = 1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100) = -.01log2.01 -.99log2.99 = .08 bits

Information gain • Information gain (from attribute test) = difference between the original information requirement and new requirement • Information Gain (IG) or reduction in entropy from the attribute test: • Choose the attribute with the largest IG

Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Example contd. • Decision tree learned from the 12 examples: • Substantially simpler than “true”

Perceptrons X = x1w1 + x2w2 Y = Ystep

Perceptrons • How does a perceptron learn? • A perceptron has initial (often random) weights typically in the range [-0.5, 0.5] • Apply an established training dataset • Calculate the error asexpected output minus actual output: errore= Yexpected – Yactual • Adjust the weights to reduce the error

Perceptrons • How do we adjust a perceptron’s weights to produce Yexpected? • If e is positive, we need to increase Yactual(and vice versa) • Use this formula: , where and • α is the learning rate (between 0 and 1) • e is the calculated error

Use threshold Θ = 0.2 andlearning rate α = 0.1 Perceptron Example – AND • Train a perceptron to recognize logical AND

Use threshold Θ = 0.2 andlearning rate α = 0.1 Perceptron Example – AND • Repeat until convergence • i.e. final weights do not change and no error

Final Exam Review