1 / 62

Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference

Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a , March 3 , 2014, SAGE 3101. Contents. Weighted KNN. r equire( kknn ) data (iris) m <- dim(iris)[1]

tamera
Download Presentation

Interpreting weighted kNN , forms of clustering, decision trees and B ayesian i nference

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interpreting weighted kNN, forms of clustering, decision trees and Bayesian inference Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 7a, March 3, 2014, SAGE 3101

  2. Contents

  3. Weighted KNN require(kknn) data(iris) m <- dim(iris)[1] val <- sample(1:m, size = round(m/3), replace = FALSE, prob= rep(1/m, m)) iris.learn <- iris[-val,] # train iris.valid <- iris[val,] # test iris.kknn <- kknn(Species~., iris.learn, iris.valid, distance = 1, kernel = "triangular") # Possible choices are "rectangular" (which is standard unweightedknn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian", "rank" and "optimal".

  4. names(iris.kknn) • fitted.values Vector of predictions. • CL Matrix of classes of the k nearest neighbors. • W Matrix of weights of the k nearest neighbors. • D Matrix of distances of the k nearest neighbors. • C Matrix of indices of the k nearest neighbors. • prob Matrix of predicted class probabilities. • response Type of response variable, one of continuous, nominal or ordinal. • distance Parameter of Minkowski distance. • call The matched call. • terms The 'terms' object used.

  5. Look at the output > head(iris.kknn$W) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0.4493696 0.2306555 0.1261857 0.1230131 0.07914805 0.07610159 0.014184110 [2,] 0.7567298 0.7385966 0.5663245 0.3593925 0.35652546 0.24159191 0.004312408 [3,] 0.5958406 0.2700476 0.2594478 0.2558161 0.09317996 0.09317996 0.042096849 [4,] 0.6022069 0.5193145 0.4229427 0.1607861 0.10804205 0.09637177 0.055297983 [5,] 0.7011985 0.6224216 0.5183945 0.2937705 0.16230921 0.13964231 0.053888244 [6,] 0.5898731 0.5270226 0.3273701 0.1791715 0.15297478 0.08446215 0.010180454

  6. Look at the output > head(iris.kknn$D) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0.7259100 1.0142464 1.1519716 1.1561541 1.2139825 1.2179988 1.2996261 [2,] 0.2508639 0.2695631 0.4472127 0.6606040 0.6635606 0.7820818 1.0267680 [3,] 0.6498131 1.1736274 1.1906700 1.1965092 1.4579977 1.4579977 1.5401298 [4,] 0.2695631 0.3257349 0.3910409 0.5686904 0.6044323 0.6123406 0.6401741 [5,] 0.7338183 0.9272845 1.1827617 1.7344095 2.0572618 2.1129288 2.3235298 [6,] 0.5674645 0.6544263 0.9306719 1.1357241 1.1719707 1.2667669 1.3695454

  7. Look at the output > head(iris.kknn$C) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 86 38 43 73 92 85 60 [2,] 31 20 16 21 24 15 7 [3,] 48 80 44 36 50 63 98 [4,] 4 21 25 6 20 26 1 [5,] 68 79 70 65 87 84 75 [6,] 91 97 100 96 83 93 81 > head(iris.kknn$prob) setosaversicolorvirginica [1,] 0 0.3377079 0.6622921 [2,] 1 0.0000000 0.0000000 [3,] 0 0.8060743 0.1939257 [4,] 1 0.0000000 0.0000000 [5,] 0 0.0000000 1.0000000 [6,] 0 0.0000000 1.0000000

  8. Look at the output > head(iris.kknn$fitted.values) [1] virginicasetosaversicolorsetosavirginicavirginica Levels: setosaversicolorvirginica

  9. Contingency tables fitiris<- fitted(iris.kknn) table(iris.valid$Species, fitiris) fitiris setosa versicolor virginica setosa 17 0 0 versicolor 0 18 2 virginica 0 1 12 # rectangular – no weight fitiris2 setosa versicolor virginica setosa 17 0 0 versicolor 0 18 2 virginica 0 2 11

  10. The plot pcol<- as.character(as.numeric(iris.valid$Species)) pairs(iris.valid[1:4], pch = pcol, col = c("green3", "red”)[(iris.valid$Species != fit)+1])

  11. New dataset - ionosphere require(kknn) data(ionosphere) ionosphere.learn <- ionosphere[1:200,] ionosphere.valid <- ionosphere[-c(1:200),] fit.kknn <- kknn(class ~ ., ionosphere.learn, ionosphere.valid) table(ionosphere.valid$class, fit.kknn$fit) b g b 19 8 g 2 122

  12. Vary the parameters - ionosphere > (fit.train1 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 1)) Call: train.kknn(formula = class ~ ., data = ionosphere.learn, kmax = 15, distance = 1, kernel = c("triangular", "rectangular", "epanechnikov", "optimal")) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 > table(predict(fit.train1, ionosphere.valid), ionosphere.valid$class) b g b 25 4 g 2 120 b g b 19 8 g 2 122

  13. Alter distance - ionosphere > (fit.train2 <- train.kknn(class ~ ., ionosphere.learn, kmax = 15, kernel = c("triangular", "rectangular", "epanechnikov", "optimal"), distance = 2)) Type of response variable: nominal Minimal misclassification: 0.12 Best kernel: rectangular Best k: 2 > table(predict(fit.train2, ionosphere.valid), ionosphere.valid$class) b g b 20 5 g 7 119 #1 b g b 25 4 g 2 120 #0 b g b 19 8 g 2 122

  14. (Weighted) kNN • Advantages • Robust to noisy training data (especially if we use inverse square of weighted distance as the “distance”) • Effective if the training data is large • Disadvantages • Need to determine value of parameter K (number of nearest neighbors) • Distance based learning is not clear which type of distance to use and which attribute to use to produce the best results. Shall we use all attributes or certain attributes only?

  15. Additional factors • Dimensionality – with too many dimensions the closest neighbors are too far away to be considered close • Overfitting – does closeness mean right classification (e.g. noise or incorrect data, like wrong street address -> wrong lat/lon) – beware of k=1! • Correlated features – double weighting • Relative importance – including/ excluding features

  16. More factors • Sparseness – the standard distance measure (Jaccard) loses meaning due to no overlap • Errors – unintentional and intentional • Computational complexity • Sensitivity to distance metrics – especially due to different scales (recall ages, versus impressions, versus clicks and especially binary values: gender, logged in/not) • Does not account for changes over time • Model updating as new data comes in

  17. Lots of clustering options • http://wiki.math.yorku.ca/index.php/R:_Cluster_analysis • Clustergram - This graph is useful in exploratory analysis for non-hierarchical clustering algorithms like k-means and for hierarchical cluster algorithms when the number of observations is large enough to make dendrograms impractical. • (remember our attempt at a dendogram for mapmeans?)

  18. Cluster plotting source("http://www.r-statistics.com/wp-content/uploads/2012/01/source_https.r.txt") # source code from github require(RCurl) require(colorspace) source_https("https://raw.github.com/talgalili/R-code-snippets/master/clustergram.r") data(iris) set.seed(250) par(cex.lab = 1.5, cex.main = 1.2) Data <- scale(iris[,-5]) # scaling

  19. > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa > head(Data) Sepal.Length Sepal.Width Petal.Length Petal.Width [1,] -0.8976739 1.01560199 -1.335752 -1.311052 [2,] -1.1392005 -0.13153881 -1.335752 -1.311052 [3,] -1.3807271 0.32731751 -1.392399 -1.311052 [4,] -1.5014904 0.09788935 -1.279104 -1.311052 [5,] -1.0184372 1.24503015 -1.335752 -1.311052 [6,] -0.5353840 1.93331463 -1.165809 -1.048667

  20. Look at the location of the cluster points on the Y axis. See when they remain stable, when they start flying around, and what happens to them in higher number of clusters (do they re-group together) • Observe the strands of the datapoints. Even if the clusters centers are not ordered, the lines for each item might (needs more research and thinking) tend to move together – hinting at the real number of clusters • Run the plot multiple times to observe the stability of the cluster formation (and location) http://www.r-statistics.com/2010/06/clustergram-visualization-and-diagnostics-for-cluster-analysis-r-code/

  21. clustergram(Data, k.range = 2:8, line.width = 0.004) # line.width - adjust according to Y-scale

  22. Any good? set.seed(500) Data2 <- scale(iris[,-5]) par(cex.lab = 1.2, cex.main = .7) par(mfrow = c(3,2)) for(i in 1:6) clustergram(Data2, k.range = 2:8 , line.width = .004, add.center.points = T) # why does this produce different plots? # what defaults are used (kmeans) # PCA?? Remember your linear algebra

  23. How can you tell it is good? set.seed(250) Data <- rbind( cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3)), cbind(rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3),rnorm(100,2, sd = 0.3))) clustergram(Data, k.range = 2:5 , line.width = .004, add.center.points = T)

  24. More complex… set.seed(250) Data <- rbind( cbind(rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,1, sd = 0.3),rnorm(100,0, sd = 0.3)), cbind(rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,0, sd = 0.3),rnorm(100,1, sd = 0.3))) clustergram(Data, k.range = 2:8 , line.width = .004, add.center.points = T)

  25. Exercise - swiss par(mfrow = c(2,3)) swiss.x <- scale(as.matrix(swiss[, -1])) set.seed(1); for(i in 1:6) clustergram(swiss.x, k.range = 2:6, line.width = 0.01)

  26. clusplot

  27. Hierarchical clustering > dswiss<- dist(as.matrix(swiss)) > hs<- hclust(dswiss) > plot(hs)

  28. ctree require(party) swiss_ctree<- ctree(Fertility ~ Agriculture + Education + Catholic, data = swiss) plot(swiss_ctree)

  29. pairs(iris[1:4], main = "Anderson's Iris Data -- 3 species”, pch= 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

  30. splom extra! require(lattice) super.sym<- trellis.par.get("superpose.symbol") splom(~iris[1:4], groups = Species, data = iris, panel = panel.superpose, key = list(title = "Three Varieties of Iris", columns = 3, points = list(pch = super.sym$pch[1:3], col = super.sym$col[1:3]), text = list(c("Setosa", "Versicolor", "Virginica")))) splom(~iris[1:3]|Species, data = iris, layout=c(2,2), pscales = 0, varnames = c("Sepal\nLength", "Sepal\nWidth", "Petal\nLength"), page = function(...) { ltext(x = seq(.6, .8, length.out = 4), y = seq(.9, .6, length.out = 4), labels = c("Three", "Varieties", "of", "Iris"), cex = 2) })

  31. parallelplot(~iris[1:4] | Species, iris)

  32. parallelplot(~iris[1:4], iris, groups = Species, horizontal.axis = FALSE, scales = list(x = list(rot = 90)))

  33. hclust for iris

  34. plot(iris_ctree)

  35. Ctree > iris_ctree<- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) > print(iris_ctree) Conditional inference tree with 4 terminal nodes Response: Species Inputs: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width Number of observations: 150 1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264 2)* weights = 50 1) Petal.Length > 1.9 3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894 4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865 5)* weights = 46 4) Petal.Length > 4.8 6)* weights = 8 3) Petal.Width > 1.7 7)* weights = 46

  36. > plot(iris_ctree, type="simple”)

  37. New dataset to work with trees fitK <- rpart(Kyphosis ~ Age + Number + Start, method="class", data=kyphosis) printcp(fitK) # display the results plotcp(fitK) # visualize cross-validation results summary(fitK) # detailed summary of splits # plot tree plot(fitK, uniform=TRUE, main="Classification Tree for Kyphosis") text(fitK, use.n=TRUE, all=TRUE, cex=.8) # create attractive postscript plot of tree post(fitK, file = “kyphosistree.ps", title = "Classification Tree for Kyphosis") # might need to convert to PDF (distill)

  38. > pfitK<- prune(fitK, cp= fitK$cptable[which.min(fitK$cptable[,"xerror"]),"CP"]) > plot(pfitK, uniform=TRUE, main="Pruned Classification Tree for Kyphosis") > text(pfitK, use.n=TRUE, all=TRUE, cex=.8) > post(pfitK, file = “ptree.ps", title = "Pruned Classification Tree for Kyphosis”)

  39. > fitK <- ctree(Kyphosis ~ Age + Number + Start, data=kyphosis) > plot(fitK, main="Conditional Inference Tree for Kyphosis”)

  40. > plot(fitK, main="Conditional Inference Tree for Kyphosis",type="simple")

  41. randomForest > require(randomForest) > fitKF<- randomForest(Kyphosis ~ Age + Number + Start, data=kyphosis) > print(fitKF) # view results Call: randomForest(formula = Kyphosis ~ Age + Number + Start, data = kyphosis) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 20.99% Confusion matrix: absent present class.error absent 59 5 0.0781250 present 12 5 0.7058824 > importance(fitKF) # importance of each predictor MeanDecreaseGini Age 8.654112 Number 5.584019 Start 10.168591 Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).

  42. More on another dataset. # Regression Tree Example library(rpart) # build the tree fitM <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) printcp(fitM) # display the results …. Root node error: 1354.6/60 = 22.576 n=60 (57 observations deleted due to missingness) CP nsplitrel error xerrorxstd 1 0.622885 0 1.00000 1.03165 0.176920 2 0.132061 1 0.37711 0.51693 0.102454 3 0.025441 2 0.24505 0.36063 0.079819 4 0.011604 3 0.21961 0.34878 0.080273 5 0.010000 4 0.20801 0.36392 0.075650

  43. Mileage… plotcp(fitM) # visualize cross-validation results summary(fitM) # detailed summary of splits <we will leave this for Friday to look at>

  44. par(mfrow=c(1,2)) rsq.rpart(fitM) # visualize cross-validation results

  45. # plot tree plot(fitM, uniform=TRUE, main="Regression Tree for Mileage ") text(fitM, use.n=TRUE, all=TRUE, cex=.8) # prune the tree pfitM<- prune(fitM, cp=0.01160389) # from cptable # plot the pruned tree plot(pfitM, uniform=TRUE, main="Pruned Regression Tree for Mileage") text(pfitM, use.n=TRUE, all=TRUE, cex=.8) post(pfitM, file = ”ptree2.ps", title = "Pruned Regression Tree for Mileage”)

  46. # Conditional Inference Tree for Mileage fit2M <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary))

More Related