Lecture

Lecture Data Mining in R 732A44 Programming in R

Logistic regression: twoclasses • Consider Logistic modelwithonepredictor X=Price of the car Y=Equipment • Logistic model • Usefunctionglm(formula, family, data) • Formula: Response~Model • Modelconsistsofa+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors • Family: specifybinomial 732A44 Programming in R

Logistic regression: twoclasses reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata); 732A44 Programming in R

Logistic regression: severalpredictors Data about contraceptive use • Several analysis plots can be obtained by plot(lrfit) • Response: matrix success/failure 732A44 Programming in R

Logistic regression Furthercomments • Nominal logistic regressions (librarymlogit, functionmlogit) • Stepwisemodelselection: step() function. • Prediction: predict() function 732A44 Programming in R

Smoothing splines Minimize a penalized sum of squared residuals where λ is smoothing parameter. λ=0 : any function interpolating data λ=+ : least squares line fit 732A44 Programming in R

Smoothingsplines • smooth.spline(x, y, df, spar, cv,…) • Dfdegreesoffreedom • Spar: penalty parameter • CV= • TRUE=GCV • FALSE=CV • NA= no CV plot(m2$Kilometer,m2$Price, main="df=40"); res<-smooth.spline( m2$Kilometer, m2$Price,df=40); lines(res, col="blue"); 732A44 Programming in R

Generalized additive models A function of the expected response is additive in the set of inputs, i.e., Example: Nonlinear logistic regression of a binary response 732A44 Programming in R

GAM • gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp) • Method: method for selectionofsmoothing parameters • Select: TRUE – variableselection is performed • Sp: smoothing parameters (maximal df) • Formula: usual terms and spline terms s(…) Library: mgcv • Car properties • Predict.gam() can be used for predictions bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3) vis.gam(bp, theta=10, phi=30); 732A44 Programming in R

GAM Smoothingcomponents plot(bp, pages=1) 732A44 Programming in R

Principal components analysis Idea: Introduce a new coordinate system (PC1, PC2, …) where • The first principal component (PC1) is the direction that maximizes the variance of the projected data • The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed • … In the new coordinate system, coefficients corresponding to the last principal components are very small  can take away this columns PC1 PC2 732A44 Programming in R

Principal components analysis • princomp(x, ...) m4<-m3; m4$MODEL<-c(); res<-princomp(m4); loadings(res); plot(res); biplot(res); summary(res); 732A44 Programming in R

Decision trees 20 X1 <9 >=9 X2 X2 <16 >=16 <7 >=7 0 1 1 X1 10 <15 >=15 1 0 0 10 20 732A44 Programming in R

Regression tree example 732A44 Programming in R

Training-validation-test • Training-validation (60/40) • If training-validation-test is required, usesimilarstrategy sub <- sample(nrow(m2), floor(nrow(m2) * 0.6)) training <- m2[sub, ] validation <- m2[-sub, ] 732A44 Programming in R

Decision trees by CART Growing a full tree Library ”tree”. • Createtree: tree(formula, data, subset, split = c("deviance", "gini"),…) • Subset: ifsubsetofcasesneedsto be used for training • Split: splitting criterion • More parameters withcontrol parameter • Prunetreewithhelpofvalidation set: prune.tree(tree, newdata, method= c("deviance", "misclass”),…) • Prunetreewith cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...) • K is number of folds in cross-validation 732A44 Programming in R

Classificationtrees: CART Example: Olive oils in Italy sub <- sample(nrow(m5), floor(nrow(m5) * 0.6)) training <- m5[sub, ] validation <- m5[-sub, ] mytree<-tree(Area~.-Region-X,data=training); summary(mytree) plot(mytree,type="uniform"); text(mytree,cex=0.5); 732A44 Programming in R

Classificationtrees: CART • Dependenceof the misclassification rate on the lengthof the tree: treeseq1<-prune.tree(mytree, newdata=validation,method="misclass") plot(treeseq1); title("Validation"); treeseq2<-cv.tree(mytree, method="misclass") plot(treeseq2); title("CV"); 732A44 Programming in R

Regression trees: CART mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training); mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves print(mytree3) summary(mytree3) plot.tree(mytree3) text(mytree3) 732A44 Programming in R

Decision trees: othertechniques • Conditionalinferencetrees Library: party • CART, anotherlibrary ”rpart” training$X<-c(); training$Area<-c(); mytree4<-ctree(Region~.,data=training); print(mytree4) plot(mytree4, type= "simple");# gives nice plots 732A44 Programming in R

Neural network • Input nodes, input layer • [Hidden nodes, Hidden layer(s)] • Output nodes, output layer • Weights • Activation functions • Combination functions … f1 fK z1 z2 … zM … x1 x2 xp 732A44 Programming in R

Neural networks • Feed –forward NNs Library: neuralnet • neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm= "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…) • Hidden: vectorshowingamountofhidden neurons at eachlayer • Rep: amountofrunsofnetwork • Startweights: starting weights • Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr” • Err.fct: anyfunction +”sse”+”ce” (cross-entropy) • Act.fct:anyfunction+”logistic”+”tanh” • Linear.output: TRUE, if no activation at the output • confidence.interval(x, alpha = 0.05) Confidence intervals for weights • compute(x, covariate) Prediction • plot(x,…) plot given neural network 732A44 Programming in R

Neural networks • Example mynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh") plot(mynet); mynet$result.matrix 732A44 Programming in R

Neural networks • Predictionwithcompute() • Findingmisclassification rate: table(true_values,predictedvalues) – not only for neural networks • Another package, ready for qualitativeresponse (classicalnnet): mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3); coef(mynet1) predict(mynet1, data=validation); 732A44 Programming in R

Clustering • Purpose is toidentifygroupsof observations intointput space (separated) • K-means • Hierarchical • Density-based 732A44 Programming in R

K-means • Amountofseeds K should be given • Starting seed positions needed • kmeans(x, centers, iter.max = 10, nstart = 1) • X: data frame • Centers: either ”K” valueor set of initial cluster centers • Iter.max: maximum numberof iterations res<-kmeans(data.frame (m5$linoleic, m5$eicosenoic),2); 732A44 Programming in R

K-means • Onewaytovisualize plot(m5$linoleic, m5$eicosenoic, col=res$cluster); points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2) 732A44 Programming in R

Hierarchicalclustering • Agglomerative • Place eachpointinto a single cluster • Mergenearest clusters untilyou get 1 cluster • Meaningof ”twoobjectsareclose”? • Measureofproximity (ex: quantiative vars, Euclidiandistance) • Similaritymeasuresrs(=1 if same object, <1 otherwise) • Ex: correlation • Dissimilaritymeasureδrs(=0 if same object, >0 otherwise) • Ex: euclidiandistance 732A44 Programming in R

Hierarchicalclustering • hclust(d, method = "complete", members=NULL) • D: dissimilaritymeasure • Method: ”ward”,"single", "complete", "average", "mcquitty", "median" or "centroid". Returned: a treeshowingmergingsequence • cutree(tree, k = NULL, h = NULL) • K: number of clusters to make • H: at which level to cut Returned: cluster index 732A44 Programming in R

Hierarchicalclustering • Example x<-data.frame(m5$linolenic, m5$eicosenoic); m5_dist<-dist(x); m5_dend<-hclust(m5_dist, method="complete") plot(m5_dend); 732A44 Programming in R

Hierarchicalclustering • Example  DO NOT forgettostandardize! clust=cutree(m5_dend, k=2); plot(m5$linoleic, m5$eicosenoic, col=clust); 732A44 Programming in R

Density-basedclustering • Kernel-baseddensityestimation. Library: pdfcluster • pdfCluster(x, h = h.norm(x), hmult = 0.75,…) • X: Data to be partitioned • h: a vectorofsmoothing parameters • Hmult: shrinkagefactor x<-data.frame(m5$linolenic, m5$eicosenoic); res<-pdfCluster(x); plot(res) 732A44 Programming in R

Reference http://cran.r-project.org/doc/contrib/YanchangZhao-refcard-data-mining.pdf 732A44 Programming in R

Lecture

Lecture

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

LECTURE

Lecture

Lecture

LECTURE

LECTURE

Lecture

Lecture

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture

Lecture 6 Lecture 7

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture

Lecture

Lecture

Lecture