1 / 36

Lab exercises: regression, kNN and K-means

Lab exercises: regression, kNN and K-means. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4 b , February 14, 2014. Today. Linear regression K Nearest Neighbors K Means. The Dataset(s). http://escience.rpi.edu/data/DA Some new ones; nyt / and sales/ and the fb100/ (.mat files)

satya
Download Presentation

Lab exercises: regression, kNN and K-means

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab exercises: regression, kNN and K-means Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 14, 2014

  2. Today • Linear regression • K Nearest Neighbors • K Means

  3. The Dataset(s) • http://escience.rpi.edu/data/DA • Some new ones; nyt/ and sales/ and the fb100/ (.mat files) • 3 script (fragments, i.e. they will not run as-is, I think) to help with code for today: Lab4b_{1,2,3}.R

  4. Linear and least-squares > multivariate <- read.csv(”EPI_data.csv") > attach(EPI_data); > boxplot(ENVHEALTH,DALY,AIR_H,WATER_H) > lmENVH<-lm(ENVHEALTH~DALY+AIR_H+WATER_H) > lmENVH … (what should you get?) > summary(lmENVH) … > cENVH<-coef(lmENVH)

  5. Predict > DALYNEW<-c(seq(5,95,5)) > AIR_HNEW<-c(seq(5,95,5)) > WATER_HNEW<-c(seq(5,95,5)) > NEW<-data.frame(DALYNEW,AIR_HNEW,WATER_HNEW) > pENV<- predict(lmENV,NEW,interval=“prediction”) > cENV<- predict(lmENV,NEW,interval=“confidence”)

  6. Repeat for AIR_E CLIMATE

  7. Remember a few useful cmds head(<object>) tail(<object>) summary(<object>)

  8. K Nearest Neighbors (classification) > nyt1<-read.csv(“nyt1.csv") > nyt1<-nyt1[which(nyt1$Impressions>0 & nyt1$Clicks>0 & nyt1$Age>0),] > nnyt1<-dim(nyt1)[1] # shrink it down! > sampling.rate=0.9 > num.test.set.labels=nnyt1*(1.-sampling.rate) > training <-sample(1:nnyt1,sampling.rate*nnyt1, replace=FALSE) > train<-subset(nyt1[training,],select=c(Age,Impressions)) > testing<-setdiff(1:nnyt1,training) > test<-subset(nyt1[testing,],select=c(Age,Impressions)) > cg<-nyt1$Gender[training] > true.labels<-nyt1$Gender[testing] > classif<-knn(train,test,cg,k=5) # > classif > attributes(.Last.value) # interpretation to come!

  9. Regression > bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE) > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE) ) > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) • What’s wrong?

  10. Clean up… > bronx<-bronx[which(bronx$GROSS.SQUARE.FEET>0 & bronx$LAND.SQUARE.FEET>0 & bronx$SALE.PRICE>0),] > m1<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET),data=bronx) # > summary(m1)

  11. Call: lm(formula = log(SALE.PRICE) ~ log(GROSS.SQUARE.FEET), data = bronx) Residuals: Min 1Q Median 3Q Max -14.4529 0.0377 0.4160 0.6572 3.8159 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.0271 0.3088 22.75 <2e-16 *** log(GROSS.SQUARE.FEET) 0.7013 0.0379 18.50 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.95 on 2435 degrees of freedom Multiple R-squared: 0.1233, Adjusted R-squared: 0.1229 F-statistic: 342.4 on 1 and 2435 DF, p-value: < 2.2e-16

  12. Plot > plot(log(bronx$GROSS.SQUARE.FEET), log(bronx$SALE.PRICE)) > abline(m1,col="red",lwd=2) # then > plot(resid(m1))

  13. Another model (2)? Add two more variables to the linear model LAND.SQUARE.FEET and NEIGHBORHOOD Repeat but suppress the intercept (2a)

  14. Model 3/4 Model 3 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD, BUILDING.CLASS.CATEGORY Model 4 Log(SALE.PRICE) vs. no intercept Log(GROSS.SQUARE.FEET), Log(LAND.SQUARE.FEET), NEIGHBORHOOD*BUILDING.CLASS.CATEGORY

  15. Solution model 2 > m2<-lm(log(bronx$SALE.PRICE)~log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2) > plot(resid(m2)) # > m2a<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD),data=bronx) > summary(m2a) > plot(resid(m2a))

  16. Solution model 3 and 4 > m3<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)+factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m3) > plot(resid(m3)) # > m4<-lm(log(bronx$SALE.PRICE)~0+log(bronx$GROSS.SQUARE.FEET)+log(bronx$LAND.SQUARE.FEET)+factor(bronx$NEIGHBORHOOD)*factor(bronx$BUILDING.CLASS.CATEGORY),data=bronx) > summary(m4) > plot(resid(m4))

  17. And now… a complex example > install.packages("geoPlot") > install.packages(”xslx") > require(class) > require(gdata) > require(geoPlot) > require(”xslx”) #if not already read-in: bronx<-read.xlsx(”sales/rollingsales_bronx.xls",pattern="BOROUGH",stringsAsFactors=FALSE,sheetIndex=1,startRow=5,header=TRUE)

  18. View(bronx) #clean up with regular expressions bronx$SALE.PRICE<- as.numeric(gsub("[^]:digit:]]","",bronx$SALE.PRICE)) #missing values? sum(is.na(bronx$SALE.PRICE)) #zero sale prices sum(bronx$SALE.PRICE==0) #clean these numeric and date fields bronx$GROSS.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$GROSS.SQUARE.FEET)) bronx$LAND.SQUARE.FEET<- as.numeric(gsub("[^]:digit:]]","",bronx$LAND.SQUARE.FEET)) bronx$SALE.DATE<- as.Date(gsub("[^]:digit:]]","",bronx$SALE.DATE)) bronx$YEAR.BUILT<- as.numeric(gsub("[^]:digit:]]","",bronx$YEAR.BUILT)) bronx$ZIP.CODE<- as.character(gsub("[^]:digit:]]","",bronx$ZIP.CODE))

  19. More corrections #filter out low prices minprice<-10000 bronx<-bronx[which(bronx$SALE.PRICE>=minprice),] #how many left? nval<-dim(bronx)[1] #addresses contain apartment #'s even though there is another column for that - remove them - compresses addresses bronx$ADDRESSONLY<- gsub("[,][[:print:]]*","",gsub("[ ]+","",trim(bronx$ADDRESS))) #new data frame for sorting the addresses, fixing etc. bronxadd<-unique(data.frame(bronx$ADDRESSONLY, bronx$ZIP.CODE,stringsAsFactors=FALSE)) # fix the names names(bronxadd)<-c("ADDRESSONLY","ZIP.CODE") bronxadd<-bronxadd[order(bronxadd$ADDRESSONLY),]

  20. Yep, more… # duplicates? duplicates<-duplicated(bronxadd$ADDRESSONLY) ##if(duplicates) dupadd<-bronxadd[bronxadd$duplicates,1] ##bronxadd<-bronxadd[(bronxadd$ADDRESSONLY!=dupadd[1] & bronxadd$ADDRESSONLY != dupadd[2]),] #how many? nadd<-dim(bronxadd)[1]

  21. Oh, we want nearest neighbors? How? #problem, we need a spatial distribution since none of the columns have that #we will use google maps so limit the number to under 500 (ask me why) nsample=450 addsample<-bronxadd[sample.int(nadd,size=nsample),] #new data frame for the full address addrlist<-data.frame(1:nsample,addsample$ADDRESSONLY,rep("NEW YORK",times=nsample),rep("NY",times=nsample),addsample$ZIP.CODE,rep("US",times=nsample)) #look them up querylist<-addrListLookup(addrlist)

  22. Lots missing – why? # how many returned valid lat/long? querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) #MANY missing - what's up? unmatched<- length(unmatchedid) #WEST -> W and EAST -> E - do again. addrlist2<-data.frame(1:unmatched,gsub(" WEST "," W ",gsub(" EAST "," E ",addsample[unmatchedid,1])),rep("NEW YORK",times=unmatched),rep("NY",times=unmatched),addsample[unmatchedid,2],rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist2)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid)

  23. Not enough #this fixed a LOT but we need more: STREET and AVENUE (could have done PLACE) and others addrlist3<-data.frame(1:unmatched,gsub("WEST","W",gsub("EAST","E",gsub("STREET","ST ", gsub("AVENUE","AVE", addsample[unmatchedid,1])))),rep("NEW YORK", times=unmatched), rep("NY",times=unmatched), addsample[unmatchedid,2], rep("US",times=unmatched)) querylist[unmatchedid,1:4]<-addrListLookup(addrlist3)[,1:4] querylist$matched <- (querylist$latitude !=0) unmatchedid<- which(!querylist$matched) unmatched<- length(unmatchedid) # 9 left now? good enough.

  24. Rebuild! addsample<-cbind(addsample,querylist$latitude,querylist$longitude) ##names(addsample[3:4])<-c("latitude","longitude") - this was meant to correct the column names but did not work for me addsample<-addsample[addsample$'querylist$latitude'!=0,] # note ' ' to work around column name adduse<-merge(bronx,addsample) adduse<-adduse[!is.na(adduse$latitude),]

  25. Most satisfying part! mapcoord<-adduse[,c(2,4,24,25)] table(mapcoord$NEIGHBORHOOD) mapcoord$NEIGHBORHOOD <- as.factor(mapcoord$NEIGHBORHOOD) # geoPlot(mapcoord,zoom=12,color=mapcoord$NEIGHBORHOOD)

  26. Did you forget the KNN? #almost there. mapcoord$class<as.numeric(mapcoord$NEIGHBORHOOD) nclass<-dim(mapcoord)[1] split<-0.8 trainid<-sample.int(nclass,floor(split*nclass)) testid<-(1:nclass)[-trainid] ##mappred<-mapcoord[testid,] ##mappred$class<as.numeric(mappred$NEIGHBORHOOD)

  27. KNN! kmax<-10 knnpred<-matrix(NA,ncol=kmax,nrow=length(testid)) knntesterr<-rep(NA,times=kmax) for (i in 1:kmax){ # loop over k knnpred[,i]<-knn(mapcoord[trainid,3:4],mapcoord[testid,3:4],cl=mapcoord[trainid,2],k=i) knntesterr[i]<-sum(knnpred[,i]!=mapcoord[testid,2])/length(testid) }

  28. Finally K-Means! > mapmeans<-data.frame(adduse$ZIP.CODE, as.numeric(mapcoord$NEIGHBORHOOD), adduse$TOTAL.UNITS, adduse$"LAND.SQUARE.FEET", adduse$GROSS.SQUARE.FEET, adduse$SALE.PRICE, adduse$'querylist$latitude', adduse$'querylist$longitude') > mapobj<-kmeans(mapmeans,5, iter.max=10, nstart=5, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) > fitted(mapobj,method=c("centers","classes")) > plot(mapmeans,mapobj$cluster)

  29. Assignment 3 • Preliminary and Statistical Analysis. Due ~ Feb 28. 15% (written) • Distribution analysis and comparison, visual ‘analysis’, statistical model fitting and testing of some of the nyt1…31 datasets.

  30. Tentative assignments • Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ early March. 15% (10% written and 5% oral; individual); • Assignment 5: Term project proposal. Due ~ week 7. 5% (0% written and 5% oral; individual); • Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); • Term project. Due ~ week 13. 30% (25% written, 5% oral; individual).

  31. Admin info (keep/ print this slide) • Class: ITWS-4963/ITWS 6965 • Hours: 12:00pm-1:50pm Tuesday/ Friday • Location: SAGE 3101 • Instructor: Peter Fox • Instructor contact: pfox@cs.rpi.edu, 518.276.4862 (do not leave a msg) • Contact hours: Monday** 3:00-4:00pm (or by email appt) • Contact location: Winslow 2120 (sometimes Lally 207A announced by email) • TA: Lakshmi Chenicheri chenil@rpi.edu • Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014 • Schedule, lectures, syllabus, reading, assignments, etc.

  32. Table: Matlab/R/scipy-numpy • http://hyperpolyglot.org/numerical-analysis

More Related