1 / 7

Statistical Learning World as a Probability Distribution

Statistical Learning World as a Probability Distribution . Consider a distribution D over space X  Y X - the instance space; Y - set of labels. (e.g. +/-1 ) Can think about the data generation process as governed by D(x), and the labeling process as governed by D(y|x), such that

farhani
Download Presentation

Statistical Learning World as a Probability Distribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical LearningWorld as a Probability Distribution • Consider a distribution D over space XY • X - the instance space; Y - set of labels. (e.g. +/-1) • Can think about the data generation process as governed by D(x), and the labeling process as governed by D(y|x), such that D(x,y)=D(x) D(y|x) • This can be used to model both the case where labels are generated by a function y=f(x), as well as noisy cases and probabilistic generation of the label. • If the distribution D is known, there is no learning. We can simply predict y = argmaxy D(y|x) • If we are looking for a hypothesis, we can simply find the one that minimizes the probability of mislabeling: h = argminh E(x,y)~D [[h(x) y]] CS446-Fall ’06

  2. Statistical Learning • Inductive learning comes into play when the distribution is not known. • Then, there are two basic approaches to take. Generative or Bayesian (indirect) Learning model the world and infer the classifier Discriminative (direct) learning model the classifier *directly* • A third non-statistical approach learns a discriminant function f:XY without using probabilities CS446-Fall ’06

  3. 1: Direct Learning • Model the problem of text correction as a problem of learning from examples. • Goal: learn directly how to make predictions. PARADIGM • Look at many examples. • Discover some regularities in the data. • Use these to construct a prediction policy. • A policy (a function, a predictor) needs to be specific. [it/in] rule: if the occurs after the target in • Assumptions comes in the form of a hypothesis class. CS446-Fall ’06

  4. Direct Learning (2) • Consider a distribution D over space XY • X - the instance space; Y - set of labels. (e.g. +/-1) • Given a sample {(x,y)}1m,, and a loss function L(x,y) Find hH that minimizes i=1,mD(xi,yi)L(h(xi),yi) • L can be: L(h(x),y)=1, h(x)y, o/w L(h(x),y) = 0 (0-1 loss) L(h(x),y)=(h(x)-y)2 , (L2 ) L(h(x),y)= max{0,1-y h(x)} (hinge loss) L(h(x),y)= exp{- yh(x)} (exponential loss) • Find an algorithm that minimizes average loss; • Then, we know that things will be okay (as a function of expressiveness H). CS446-Fall ’06

  5. In practice: make assumptions on the distribution’s type • In practice: a decision policy depends on the assumptions 2: Generative Model • Model the problem of classification as generating data like the training data • Goal: learn a model of the phenomenon; use it to predict. PARADIGM • Learn a probability distribution over all labeled examples • Use it to estimate (e.g.) which sentence is more likely. Pr(I saw the girl it the park) <> Pr(I saw the girl in the park) CS446-Fall ’06

  6. Generative / Discriminative Examples • Try to predict tomorrows weather as accurately as possible • Measure some relevant parameters: • Current temperature, barometric pressure, wind speed • Make a prediction of the form: chances of rain tomorrow : 70% • Compile a rule for predicting when a student will be admitted to college • some students will be accepted regardless of who review their file • some are rejected categorically • For some, it depends who review their file • So, the best model for the chances of the borderline students is probabilistic. But, every student is either accepted or rejected. CS446-Fall ’06

  7. Generative / Discriminative Example • There are two coins: • a fair coin (P(Head) = 0.5) • a biased coin (P(Head) = 0.6) Coin 1 Coin 2 CS446-Fall ’06

More Related