300 likes | 391 Views
Learning From Observation Part II. KAIST Computer Science 20013221 박 명 제. Contents. Using Information Theory Learning General Logical Descriptions. Using Information Theory. 1. Introduction 2. Noise and Over-fitting 3. Issues related with the decision tree. Introduction.
E N D
Learning From ObservationPart II KAIST Computer Science 20013221 박 명 제
Contents • Using Information Theory • Learning General Logical Descriptions
Using Information Theory 1. Introduction 2. Noise and Over-fitting 3. Issues related with the decision tree
Introduction • Information on the flip of a Coin “The less you know, the more valuable the information”
History of Information Theory • C.E. Shannon, 1948,1949 papers • A Mathematical Theory of Communication • Provides the probabilistic theory of Encoding, Decoding, and Transmission of communication system. • Provides a mathematical basis for measuring the information content of a message. • Now, uses for Cryptography and Learning Theory, etc.
Amount of Information • Information content is measured in BITS. • Information of N cases is represented with log N bits. • If we know information about the result of N-cases event, this information is log N bits. • If the probability of this event is P => Information is log (1/P) = -log P bits.
Information Contents(Entropy) • The average information content of the various events (the –log P terms) weighted by the probabilities of the events • called Entropy H • measurement of disorder, randomness, information, uncertainty, and complexity of choice • Maximized when all is equal to 1/n
Information Gain(1/2) • Using Restaurant Problem (page 534) • Information from training set (P:positive examples N:negative examples) • Remainder
Information Gain(2/2) • Definition • The difference between the original information requirement and the new requirement • We select an attribute with the maximum value of Gain(A) • Example : Patrons has the highest gain of any of the attributes and would be chosen by the decision tree learning algorithm as the root.
Noise and Over-fitting(1/3) • Noise • Two or more examples with the same descriptions but different classifications. • Over-fitting • Example : Rolling a Die (p.542) • makes spurious distinctions • be careful not to use the resulting freedom to find meaningless regularity in the data
Noise and Over-fitting(2/3) • To prevent Over-fitting • Decision Tree Pruning • Prevent splitting by irrelevant attributes • How to find irrelevant attributes? • Attributes with very small information gain • Chi Square Pruning • Measure the deviation of clearly irrelevant hypothesis by comparing the actual numbers of positive and negative examples • The probability that the attribute is really irrelevant can be calculated with the help of standard chi-squared tables.
Noise and Over-fitting(3/3) • To prevent Over-fitting • Cross-Validation • Estimate how well the current hypothesis will predict unseen data • Done by setting aside some fraction of the known data, and using it to test the prediction performance of a hypothesis induced from the rest of the known data.
Broadening the Applicability • Missing data • Not all the attribute values will be know for every example in many domains. • Multi-valued attribute • When an attribute has a large number of possible values, the information gain measure gives an inappropriate indication of the attribute’s usefulness. • Continuous-valued attribute • Discretize the attribute • Example • Price for Restaurant problem (page 534)
Learning General Logical Descriptions Introduction Current-best-hypothesis Search Least-commitment Search
Introduction(1/3) • Steps to find hypotheses • Start out with a goal predicate.(generally Q) • Q will be a unary predicate • Find an equivalent logical expression that we can use to classify examples correctly.
Introduction(2/3) • Hypothesis=Candidate Definition+Goal • Hypothesis space • set of all hypotheses • H for the hypothesis space • The Learning algorithm believes that one of the hypothesis is correct, that is, it believes the sentence
Introduction(3/3) • Ways of being inconsistent with an example • An example is a false negative for the hypothesis. • Hypothesis says negative, but in fact it is positive. • An example is a false positive for the hypothesis. • Hypothesis says positive, but in fact it is negative. • Make hypothesis consistent with all the example sets.
Current-best-hypothesis Search(1/6) • Main Idea • Maintain a single hypothesis, and adjust it as new examples arrive in order to maintain consistency
Current-best-hypothesis Search(2/6) • When a new example e is entered • If e is consistent for a hypothesis h • Do nothing • If e is a false negative for a hypothesis h • Generalization of h to include e • If e is a false positive for a hypothesis h • Specialization of h to exclude e
Current-best-hypothesisSearch(3/6) • Generalization & Specialization • Shows logical relationship of hypotheses • If C implies D, then D is a generalization of C • If D implies C, then D is a specialization of D • In hypothesis space,
Current-best-hypothesis Search(4/6) • Examples from Restaurant problem(Page 534) • The first example x1 is positive. • H1 : (x) WillWait(x) Alternate(x) • The second example x2 is negative. • H2 : (x) WillWait(x) Alternate(x) Patrons(x, Some) • H1 predicts it to be positive, so it is a false positive -> specialization of H1 • The third example x3 is positive. • H3 : (x) WillWait(x)Patrons(x, Some) • H2 predicts it to be negative, so it is a false negative -> Generalization of H2 • The fourth example x4 is positive. • H4 : (x)WillWait(x)Patrons(x,Some)(Patrons(x,Full)Fri/Sat(x)) • H3 predicts it to be negative, so it is a false negative -> Generalization of H3
Current-best-hypothesis Search(5/6) • Simple, but described nondeterministically. • There may be several possible specializations or generalizations that can be applied. • Not necessarily lead to the simplest hypothesis • May lead to an unrecoverable situation => The program must backtrack to previous!!
Current-best-hypothesis Search(6/6) • With a large number of instances and a large space, some other difficulties arise. • Checking all the previous instances over again for each modification is very expensive. • Difficult to find good search heuristics, and backtracking all over the place can take forever due to the size of hypothesis space.
Least-commitment search(1/7) • Main Idea • Keeps around all and only those hypotheses that are consistent with all the data so far. • Removes hypotheses inconsistent with example • Version Space • The set of hypotheses remaining after elimination • Algorithm • Version space learning/Candidate elimination algorithm.
Least-commitment search(2/7) • Properties • Incremental algorithm • Never has to go back and reexamine the old • Least-commitment algorithm • Makes no arbitrary choices • Problems • Enormous hypothesis space • Uses an interval representation that just specifies the boundaries of the set
Least-commitment search(3/7) • Partial ordering on hypothesis space • With relationship of generalization and specialization • Boundary sets • G-set : most general boundary • No consistent hypotheses that are more general • S-set : most special boundary • No consistent hypotheses that are more specific • Everything in between is guaranteed to be consistent with the examples.
Least-commitment search(4/7) • Learning strategy • Needs the initial version space to represent all possible hypotheses • G-set : contains only ‘True’ • S-set : contains only ‘False’ • Two properties to show that the representation is sufficient • Every consistent hypothesis is more specific than some member of the G-set, and more general than some member of the S-set • Every hypothesis more specific than some member of the G-set and more general than some member of the S-set is a consistent hypothesis
Least-commitment search(5/7) • Update S and G for a new example • False positive for s • Too general, throw it out of the S-set • False negative for s • Too specific, replace it by generalization • False positive for g • Too general, replace it by specialization • False negative for g • Too specific, throw it out of the G-set
Least-commitment search(6/7) • Algorithm termination • Exactly one concept left in version space • Return it as unique hypothesis • The version space collapses – either S or G becomes empty • No consistent hypothesis for the training set • Learning failed • Run out of examples with several hypotheses remaining in the version space • The remaining version space represents a disjunction of hypotheses
Least-commitment search(7/7) • Discussion • Noise or insufficient attributes : the version space will always collapse? • No completely successful solution found. • Disjunction problem • By allowing limited forms of disjunction • By including a generalization hierarchy of more general predicates • Example :WaitEstimate(x,30-60)WaitEstimate(x,>60) -> LongWait(x)