Classification II

Classification II Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

Overview • Hidden Markov Chain Model • Decision Tree Model Andrey Markov Evolution Tree

Independent Trials Toss a coin: H T T H T H H T … Throw a dice: 1 5 3 6 2 4 3 2 …

Sequence of Data

Deterministic Finite Automaton • Theory of Computation • A finite state machine where for each pair of state and input symbol there is one and only one transition to a next state • A DFA is a 5-tuple (Q, Σ, δ, q0, F): • A finite set of states (Q) • A finite set of input symbols called the alphabet (Σ) • A transition function (δ : Q × Σ → Q) • A start state (q0 ∈ Q) • A set of accept states (F ⊆ Q)

DFA Example • A DFA that determines the number of 0s • M = (Q, Σ, δ, q0, F) where • Q = {S1, S2}, • Σ = {0, 1}, • q0 = S1, • F = {S1}, • δ is defined by State Transition Table

DFA Example String 1: 0 1 0 1 1 1 0 String 2: 1 0 0 0 1 0 1

A Markov System • Has N states: {s1, s2, … , sN} • There are T discrete time steps: t= 1, 2, …, T • On the tth step, the system is on exactly one of the states st. • Between two time steps, the next state is chosen randomly. • The current state determines the probability for the next state. α11 s1 α12 α13 α21 α31 α22 s2 s3 α32 α33 α23

Markov Property • P(sj(t+1)|si(t)) =aij • aij is independent of time t. • s(t+1) is conditionally independent of {s(1), s(2), … , s(t-1)} given s(t). • P(sj(t+1)|si(t)) = P(sj(t+1)|si(t), any earlier history) • The probability of generating {s1, s3, s2, s2, s1, s3} • P(s6|θ) =a13a32a22a21a13 Sum to 1 Transition Matrix

Weather Example Q: What is the probability of being sunny at day 3? S S S R 0.9×0.9×0.9=0.729 0.9×0.1×0.5=0.045 0.1×0.5×0.5=0.025 0.1×0.5×0.9=0.045 ------------------------- 0.844 S S R R R S X(0) X(1) X(2) X(3)

A Smart Method

A Smart Method 1×0.9 0.9×0.9 0.86×0.9 1 0.9 0.86 0.844 sunny 0×0.5 0.1×0.5 0.14×0.5 0.9×0.1 0.86×0.1 1×0.1 0 0.1 0.14 0.156 rainy 0.1×0.5 0.14×0.5 0×0.5 X(0) X(1) X(2) X(3)

Hidden Markov Model • In a regular Markov Model, all states are directly visible to observer. • In a Hidden Markov Model (HMM), the states are not directly visible but outputs dependent on the states are visible. • Each state has a probability distribution over the possible outputs. • “Hidden”: the sequence of states x(t-1) x(t) x(t+1) y(t-1) y(t) y(t+1)

Structure of HMM

HMM Demo State Sequences 5 3 2 5 3 2 5 3 1 2 1 2 4 3 2 5 3 2 4 3 1 2 1 2 3 1 2 5 3 2

A Concrete Example of HMM • states = {'Rainy', 'Sunny‘} • observations = {'walk', 'shop', 'clean’} • start_probability = {'Rainy': 0.6, 'Sunny': 0.4} • transition_probability = { • 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, • 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } • emission_probability = { • 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, • 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }

A Concrete Example of HMM

Two Canonical Questions • Q1: Given the parameters of the model, compute the probability of a particular output sequence. • What is the probability of {“walk”  “shop” “clean”}? • Q2: Given the parameters of the model and an output sequence, find the state sequence that is most likely to generate that output sequence. • What is the most likely sequence of rainy/sunny days? • Both questions require calculations over all possible state sequences. • Fortunately, they can be solved more efficiently. • Recall how to smartly calculate P(sj(t)) in a Markov system.

Q1: The Brute Force Method

Q1: A Smart Way

Q1: A Smart Way v(3)=vk s1 s1 s1 s1 s1 . . . b2k ɑ12 ɑ22 s2 s2 s2 s2 s2 . . . ɑ32 s3 s3 s3 s3 s3 . . . . . . . . . . . . . . . . . ɑN2 sN sN sN sN sN . . . t = 1 2 3 T T-1

Q1: Example absorbing state s0 s1 s2 s3 s0 s1 s2 s3 v0 v1 v2 v3 v4 s0 s1 s2 s3

Q1: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 0.2×0 1 0.3×0.3 .09 .0052 .0024 0 s1 0.1×0.1 0 .01 .0077 .0002 0 s2 0.4×0.5 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

Q1: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 ×0 ×0.1 1 .09 ×0.3 .0052 .0024 0 s1 ×0.5 0 .01 .0077 .0002 0 s2 ×0.1 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

Q2: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 1 .09 .0052 .0024 0 s1 0 .01 .0077 .0002 0 s2 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

10 Minutes …

Decision Making

A Survey Dataset • Given the data collected from a promotion activity. • Could be tens of thousands of such records. • Can we find any interesting patterns? • All rural households responded … • To find out which factors most strongly affect a household’s response to a promotion. • Better understanding of potential customers • Need a classifier to examine the underlying relationships and make future predictions about the target concepts. • Send promotion brochures to selected households next time. • Targeted Marketing

A Survey Dataset

A Tree Model District Suburban 2 : 3 Rural 4 : 0 Urban 3 : 2 Previous Customer House Type Detached 0 : 2 Terrace 1 : 1 Semi-detached 1 : 0 Yes 0 : 2 No 3 : 0 Income Red: Responded Purple: Nothing High 0 : 1 Low 1 : 0

Another Tree Model District Suburban 2 : 3 Rural 4 : 0 Urban 3 : 2 Income Previous Customer Yes 0 : 2 No 3 : 0 High 0 : 3 Low 2 : 0 Red: Responded Purple: Nothing

Some Notes … • Rules can be easily extracted from the built tree. • (District = Rural) → (Outcome = Responded) • (District = Urban) AND (Previous Customer = Yes) → (Outcome = Nothing) • One dataset, many possible trees • Occam's Razor • The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest explanation. • “When you have two competing theories that make exactly the same predictions, the simpler one is the better.” • “The explanation of any phenomenon should make as few assumptions as possible, eliminating those making no difference in the observable predictions of the explanatory hypothesis or theory.” • Simpler trees are generally preferred.

ID3 • How to build a shortest tree from a dataset? • Iterative Dichotomizer 3 • Ross Quinlan: http://www.rulequest.com/ • One of the most influential Decision Trees models • Top-down, greedy search through the space of possible decision trees • Since we want to construct short trees … • It is better to put certain attributes higher up the tree. • Some attributes split the data more purely than others. • Their values correspond more consistently with the class labels. • Need to have some sort of measure to compare candidate attributes.

Entropy pi: the proportion of instances in the dataset that take the ith target value Sv is the subset of S where the attribute A takes the value v.

Attribute Selection

ID3 Framework • ID3(Examples, Target_attribute, Attributes) • Create a Root node for the tree. • If Examples have the same target attribute T, return Root with label=T. • If Attributes is empty, return Root with label=the most common value of Target_attribute in Examples. • A  the attribute from Attributes that best classifies Examples. • The decision attribute for Root  A. • For each possible value vi of A • Add a new tree branch below Root, corresponding to A= vi. • Let Examples (vi) be the subset of Examples that have value vi for A. • If Examples (vi) is empty • Below this new branch add a leaf node with label=the most common value of Target_attributein Examples. • Else below this new branch add the subtree • ID3(Examples(vi), Target_attribute, Attributes-{A}) • Return Root

Overfitting • It is possible to create a separate rule for each training sample. • Perfect Training Accuracy vs. Overfitting • Random Noise, Insufficient Samples • We want to capture the general underlying functions or trends. • Definition • Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h' ∈ H, such as h has smaller error than h' over the training samples, but h' has a smaller error than h over the entire distribution of instances. • Solutions • Stop growing the tree earlier. • Allow the tree to overfit the data and then post-prune the tree.

Pruning Training Set Validation Set Test Set 9 : 5 1 : 3 8 : 2 Decision Nodes Leaf Nodes

Entropy Bias • The entropy measure guides the entire tree building process. • There is a natural bias that favours attributes with many values. • Consider the attribute “Birth Date” • Separate the training data into very small subsets. • Very high information gain • A very poor predicator of the target function over unseen instances. • Such attributes need to be penalized!

Continuous Attributes Samples are Sorted based on Temperature Threshold A Threshold B

Reading Materials • Text Book (HMM) • Richard O. Duda et al., Pattern Classification, Chapter 3.10, John Wiley & Sons Inc. • Text Book (DT) • Tom Mitchell, Machine Learning, Chapter 3, McGraw-Hill. • Online Tutorial (HMM) • http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html • http://www.bioss.ac.uk/~dirk/talks/urfer02_hmm.pdf • Online Tutorial (DT) • http://www.decisiontrees.net/node/21 (with interactive demos) • http://www.autonlab.org/tutorials/dtree18.pdf • http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html • http://www.public.asu.edu/~kirkwood/DAStuff/decisiontrees/index.html • Wikipedia & Google

Review • What is a Markov system? • What is a hidden Markov system? • What are the two canonical questions in HMM? • What is a Decision Tree model? • What is Occam’s Razor? • What is information entropy? • How to use information entropy in DT? • What is the main issue with information entropy? • Why and how to do pruning in DT? • How to handle continuous attributes in DT?

Next Week’s Class Talk • Volunteers are required for next week’s class talk. • Topic 1: C4.5 Algorithm • Topic 2: CART • Hints: • Both are advanced DT models. • How to select attributes? • How to handle continuous attributes? • How to handle missing values? • What else can they do? • Length: 20 minutes plus question time

Classification II