1 / 44

Classification II

Classification II. Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn. Overview. Hidden Markov Chain Model Decision Tree Model. Andrey Markov. Evolution Tree. Independent Trials. Toss a coin : H T T H T H H T … Throw a dice : 1 5 3 6 2 4 3 2 …. Sequence of Data.

doris
Download Presentation

Classification II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification II Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn

  2. Overview • Hidden Markov Chain Model • Decision Tree Model Andrey Markov Evolution Tree

  3. Independent Trials Toss a coin: H T T H T H H T … Throw a dice: 1 5 3 6 2 4 3 2 …

  4. Sequence of Data

  5. Sequence of Data

  6. Deterministic Finite Automaton • Theory of Computation • A finite state machine where for each pair of state and input symbol there is one and only one transition to a next state • A DFA is a 5-tuple (Q, Σ, δ, q0, F): • A finite set of states (Q) • A finite set of input symbols called the alphabet (Σ) • A transition function (δ : Q × Σ → Q) • A start state (q0 ∈ Q) • A set of accept states (F ⊆ Q)

  7. DFA Example • A DFA that determines the number of 0s • M = (Q, Σ, δ, q0, F) where • Q = {S1, S2}, • Σ = {0, 1}, • q0 = S1, • F = {S1}, • δ is defined by State Transition Table

  8. DFA Example String 1: 0 1 0 1 1 1 0 String 2: 1 0 0 0 1 0 1

  9. A Markov System • Has N states: {s1, s2, … , sN} • There are T discrete time steps: t= 1, 2, …, T • On the tth step, the system is on exactly one of the states st. • Between two time steps, the next state is chosen randomly. • The current state determines the probability for the next state. α11 s1 α12 α13 α21 α31 α22 s2 s3 α32 α33 α23

  10. Markov Property • P(sj(t+1)|si(t)) =aij • aij is independent of time t. • s(t+1) is conditionally independent of {s(1), s(2), … , s(t-1)} given s(t). • P(sj(t+1)|si(t)) = P(sj(t+1)|si(t), any earlier history) • The probability of generating {s1, s3, s2, s2, s1, s3} • P(s6|θ) =a13a32a22a21a13 Sum to 1 Transition Matrix

  11. Weather Example Q: What is the probability of being sunny at day 3? S S S R 0.9×0.9×0.9=0.729 0.9×0.1×0.5=0.045 0.1×0.5×0.5=0.025 0.1×0.5×0.9=0.045 ------------------------- 0.844 S S R R R S X(0) X(1) X(2) X(3)

  12. A Smart Method

  13. A Smart Method 1×0.9 0.9×0.9 0.86×0.9 1 0.9 0.86 0.844 sunny 0×0.5 0.1×0.5 0.14×0.5 0.9×0.1 0.86×0.1 1×0.1 0 0.1 0.14 0.156 rainy 0.1×0.5 0.14×0.5 0×0.5 X(0) X(1) X(2) X(3)

  14. Hidden Markov Model • In a regular Markov Model, all states are directly visible to observer. • In a Hidden Markov Model (HMM), the states are not directly visible but outputs dependent on the states are visible. • Each state has a probability distribution over the possible outputs. • “Hidden”: the sequence of states x(t-1) x(t) x(t+1) y(t-1) y(t) y(t+1)

  15. Structure of HMM

  16. HMM Demo State Sequences 5 3 2 5 3 2 5 3 1 2 1 2 4 3 2 5 3 2 4 3 1 2 1 2 3 1 2 5 3 2

  17. A Concrete Example of HMM • states = {'Rainy', 'Sunny‘} • observations = {'walk', 'shop', 'clean’} • start_probability = {'Rainy': 0.6, 'Sunny': 0.4} • transition_probability = { • 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, • 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}, } • emission_probability = { • 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, • 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, }

  18. A Concrete Example of HMM

  19. Two Canonical Questions • Q1: Given the parameters of the model, compute the probability of a particular output sequence. • What is the probability of {“walk”  “shop” “clean”}? • Q2: Given the parameters of the model and an output sequence, find the state sequence that is most likely to generate that output sequence. • What is the most likely sequence of rainy/sunny days? • Both questions require calculations over all possible state sequences. • Fortunately, they can be solved more efficiently. • Recall how to smartly calculate P(sj(t)) in a Markov system.

  20. Q1: The Brute Force Method

  21. Q1: A Smart Way

  22. Q1: A Smart Way v(3)=vk s1 s1 s1 s1 s1 . . . b2k ɑ12 ɑ22 s2 s2 s2 s2 s2 . . . ɑ32 s3 s3 s3 s3 s3 . . . . . . . . . . . . . . . . . ɑN2 sN sN sN sN sN . . . t = 1 2 3 T T-1

  23. Q1: Example absorbing state s0 s1 s2 s3 s0 s1 s2 s3 v0 v1 v2 v3 v4 s0 s1 s2 s3

  24. Q1: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 0.2×0 1 0.3×0.3 .09 .0052 .0024 0 s1 0.1×0.1 0 .01 .0077 .0002 0 s2 0.4×0.5 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

  25. Q1: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 ×0 ×0.1 1 .09 ×0.3 .0052 .0024 0 s1 ×0.5 0 .01 .0077 .0002 0 s2 ×0.1 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

  26. Q2: Example v3 v2 v0 v1 0 0 0 0 .0011 s0 1 .09 .0052 .0024 0 s1 0 .01 .0077 .0002 0 s2 0 .2 .0057 .0007 0 s3 t= 0 1 2 3 4

  27. 10 Minutes …

  28. Decision Making

  29. A Survey Dataset • Given the data collected from a promotion activity. • Could be tens of thousands of such records. • Can we find any interesting patterns? • All rural households responded … • To find out which factors most strongly affect a household’s response to a promotion. • Better understanding of potential customers • Need a classifier to examine the underlying relationships and make future predictions about the target concepts. • Send promotion brochures to selected households next time. • Targeted Marketing

  30. A Survey Dataset

  31. A Tree Model District Suburban 2 : 3 Rural 4 : 0 Urban 3 : 2 Previous Customer House Type Detached 0 : 2 Terrace 1 : 1 Semi-detached 1 : 0 Yes 0 : 2 No 3 : 0 Income Red: Responded Purple: Nothing High 0 : 1 Low 1 : 0

  32. Another Tree Model District Suburban 2 : 3 Rural 4 : 0 Urban 3 : 2 Income Previous Customer Yes 0 : 2 No 3 : 0 High 0 : 3 Low 2 : 0 Red: Responded Purple: Nothing

  33. Some Notes … • Rules can be easily extracted from the built tree. • (District = Rural) → (Outcome = Responded) • (District = Urban) AND (Previous Customer = Yes) → (Outcome = Nothing) • One dataset, many possible trees • Occam's Razor • The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest explanation. • “When you have two competing theories that make exactly the same predictions, the simpler one is the better.” • “The explanation of any phenomenon should make as few assumptions as possible, eliminating those making no difference in the observable predictions of the explanatory hypothesis or theory.” • Simpler trees are generally preferred.

  34. ID3 • How to build a shortest tree from a dataset? • Iterative Dichotomizer 3 • Ross Quinlan: http://www.rulequest.com/ • One of the most influential Decision Trees models • Top-down, greedy search through the space of possible decision trees • Since we want to construct short trees … • It is better to put certain attributes higher up the tree. • Some attributes split the data more purely than others. • Their values correspond more consistently with the class labels. • Need to have some sort of measure to compare candidate attributes.

  35. Entropy pi: the proportion of instances in the dataset that take the ith target value Sv is the subset of S where the attribute A takes the value v.

  36. Attribute Selection

  37. ID3 Framework • ID3(Examples, Target_attribute, Attributes) • Create a Root node for the tree. • If Examples have the same target attribute T, return Root with label=T. • If Attributes is empty, return Root with label=the most common value of Target_attribute in Examples. • A  the attribute from Attributes that best classifies Examples. • The decision attribute for Root  A. • For each possible value vi of A • Add a new tree branch below Root, corresponding to A= vi. • Let Examples (vi) be the subset of Examples that have value vi for A. • If Examples (vi) is empty • Below this new branch add a leaf node with label=the most common value of Target_attributein Examples. • Else below this new branch add the subtree • ID3(Examples(vi), Target_attribute, Attributes-{A}) • Return Root

  38. Overfitting • It is possible to create a separate rule for each training sample. • Perfect Training Accuracy vs. Overfitting • Random Noise, Insufficient Samples • We want to capture the general underlying functions or trends. • Definition • Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the training data if there exists some alternative hypothesis h' ∈ H, such as h has smaller error than h' over the training samples, but h' has a smaller error than h over the entire distribution of instances. • Solutions • Stop growing the tree earlier. • Allow the tree to overfit the data and then post-prune the tree.

  39. Pruning Training Set Validation Set Test Set 9 : 5 1 : 3 8 : 2 Decision Nodes Leaf Nodes

  40. Entropy Bias • The entropy measure guides the entire tree building process. • There is a natural bias that favours attributes with many values. • Consider the attribute “Birth Date” • Separate the training data into very small subsets. • Very high information gain • A very poor predicator of the target function over unseen instances. • Such attributes need to be penalized!

  41. Continuous Attributes Samples are Sorted based on Temperature Threshold A Threshold B

  42. Reading Materials • Text Book (HMM) • Richard O. Duda et al., Pattern Classification, Chapter 3.10, John Wiley & Sons Inc. • Text Book (DT) • Tom Mitchell, Machine Learning, Chapter 3, McGraw-Hill. • Online Tutorial (HMM) • http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html • http://www.bioss.ac.uk/~dirk/talks/urfer02_hmm.pdf • Online Tutorial (DT) • http://www.decisiontrees.net/node/21 (with interactive demos) • http://www.autonlab.org/tutorials/dtree18.pdf • http://people.revoledu.com/kardi/tutorial/DecisionTree/index.html • http://www.public.asu.edu/~kirkwood/DAStuff/decisiontrees/index.html • Wikipedia & Google

  43. Review • What is a Markov system? • What is a hidden Markov system? • What are the two canonical questions in HMM? • What is a Decision Tree model? • What is Occam’s Razor? • What is information entropy? • How to use information entropy in DT? • What is the main issue with information entropy? • Why and how to do pruning in DT? • How to handle continuous attributes in DT?

  44. Next Week’s Class Talk • Volunteers are required for next week’s class talk. • Topic 1: C4.5 Algorithm • Topic 2: CART • Hints: • Both are advanced DT models. • How to select attributes? • How to handle continuous attributes? • How to handle missing values? • What else can they do? • Length: 20 minutes plus question time

More Related