140 likes | 248 Views
Classification and Validation. Stefan Bentink 1/21/2010. Problem. Fit model (e.g. logistic regression). ?. predict. Class 1. Class 2. ?. ?. Objects/Individuals. ?. Evaluation. How many prediction errors in future predictions?. Look at the residuals for evaluation?.
E N D
Classification and Validation Stefan Bentink 1/21/2010
Problem Fit model (e.g. logistic regression) ? predict Class 1 Class 2 ? ? Objects/Individuals ?
Evaluation How many prediction errors in future predictions? Look at the residuals for evaluation? X: Training data Y: Binary classification label β: Regression coefficients Y- βX Fit model of the form No! Y=βX
Evaluation In order to test the prediction accuracy on new data, we need to test the model on new data! Class 1 Class 2 Training set Test set Apply model, Prediction accuracy? Fit model
N-fold cross validation 1 2 3 … n Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 1 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 Train Train Train Test Train Train Test Train Train Test Train Train Test Train Train Train
Classification in R • Goto R website: http://www.r-project.org • Click on CRAN • Select mirror • Click on packages (left menu bar) • Click on CRAN Task Views • Select task classification
Problem 3 – Tutorial 2 (Lecture 4) • Read in the data file birthwt.txt. This file contains a data set on 189 births at a US hospital. The goal was to determine which set of covariates predict low birth weight. Currated version of data (birthwt_new.txt) generated by script generateBwtNew.r Binary response Predictors low age, lwt, smoke, ht, ui, ftv, ptd, race
Implement model validation • Function to fit model • Function to predict model • Randomly split data into training and test set
Multiple logistic regression model library(MASS) ##contains function stepAIC bw.new <- read.delim("birthwt_new.txt") model <- glm(low~.,family=binomial(link=logit),data=bw.new) model.opt <- stepAIC(model) log.odds <- predict(model,data=bw.new) probabilities <- exp(log.odds)/(1+exp(log.odds)) Remember from lecture 4
Splitting data into training and test set n <- nrow(bw.new) k <- 2 train.test.size <- floor(n/k) partition <- rep(1:k,each=train.test.size) partition[n] <- k ##randomly choose training and test set set.seed(123) s <- sample(1:n) training.set.1 <- s[partition==1] test.set.1 <- s[partition==2]
Train and validate model bw.train <- bw.new[training.set.1,] model.train.1 <- my.classify.logit(low~.,data=bw.train) bw.test <- bw.new[test.set.1,] true.predict.test.1 <- my.predict.logit(model.train.1,data=bw.test) class.test.1 <- as.numeric(true.predict.test.1>0.5) table(class.test.1,bw.test$low)
Function ##The general framework my.function <- function(x,y) { … do something with x and y … … assign result to z … return(z) } ##Example my.function <- function(x,y) { z <- x+y return(z) }
Function to fit model ##Function to fit logistic regression model ##f: formula (model specification) ##data: a data.fram my.classify.logit <- function(f,data) { require(MASS) model <- glm(f,family=binomial(link=logit), data=data) model.opt <- stepAIC(model) ##optimize model return(model.opt) }
Function to predict new samples given a model ##function that predicts class probabilities ##model: A model fitted by my.classify.logit ##data: a data.frame with new data my.predict.logit <- function(model,data) { log.odds <- predict(model,data) probabilities <- exp(log.odds)/(1+exp(log.odds)) return(probabilities) }