240 likes | 366 Views
PCA and return to Big Data infrastructure…. and assignment time. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014. Visual approaches for PCA/DR.
E N D
PCA and return to Big Data infrastructure…. and assignment time. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014
Visual approaches for PCA/DR • Screeplot - A plot, in descending order of magnitude, of the eigenvalues of a correlation matrix. In the context of factor analysis or principal components analysis a scree plot helps the analyst visualize the relative importance of the factors — a sharp drop in the plot signals that subsequent factors are ignorable.
require(graphics) ## the variances of the variables in the ## USArrests data vary by orders of magnitude, so scaling is appropriate prcomp(USArrests) # inappropriate prcomp(USArrests, scale = TRUE) prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) plot(prcomp(USArrests)) summary(prcomp(USArrests, scale = TRUE)) biplot(prcomp(USArrests, scale = TRUE))
prcomp > prcomp(USArrests) # inappropriate Standard deviations: [1] 83.732400 14.212402 6.489426 2.482790 Rotation: PC1 PC2 PC3 PC4 Murder 0.04170432 -0.04482166 0.07989066 -0.99492173 Assault 0.99522128 -0.05876003 -0.06756974 0.03893830 UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914 Rape 0.07515550 0.20071807 0.97408059 0.07232502 > prcomp(USArrests, scale = TRUE) Standard deviations: [1] 1.5748783 0.9948694 0.5971291 0.4164494 Rotation: PC1 PC2 PC3 PC4 Murder -0.5358995 0.4181809 -0.3412327 0.64922780 Assault -0.5831836 0.1879856 -0.2681484 -0.74340748 UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773 Rape -0.5434321 -0.1673186 0.8177779 0.08902432
> prcomp(~ Murder + Assault + Rape, data = USArrests, scale = TRUE) Standard deviations: [1] 1.5357670 0.6767949 0.4282154 Rotation: PC1 PC2 PC3 Murder-0.5826006 0.5339532 -0.6127565 Assault -0.6079818 0.2140236 0.7645600 Rape -0.5393836 -0.8179779 -0.1999436 > summary(prcomp(USArrests, scale = TRUE)) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.5749 0.9949 0.59713 0.41645 Proportion of Variance 0.6201 0.2474 0.08914 0.04336 Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
Line plots lab 6 prcomp (top) and metaPCA (bottom) Eigen Angle RobustAngle SparseAngle Looking for convergence as iteration increases http://cran.r-project.org/web/packages/MetaPCA/MetaPCA.pdf
Lab 9 library(dr) data(ais) # default fitting method is "sir" s0 <- dr(LBM~log(SSF)+log(Wt)+log(Hg)+log(Ht)+log(WCC)+log(RCC)+ log(Hc)+log(Ferr),data=ais) # Refit, using a different function for slicing to agree with arc. summary(s1 <- update(s0,slice.function=dr.slices.arc)) # Refit again, using save, with 10 slices; the default is max(8,ncol+3) summary(s2<-update(s1,nslices=10,method="save")) # Refit, using phdres. Tests are different for phd, and not # Fit using phdres; output is similar for phdy, but tests are not justifiable. summary(s3<- update(s1,method="phdres")) # fit using ire: summary(s4 <- update(s1,method="ire")) # fit using Sex as a grouping variable. s5 <- update(s4,group=~Sex)
> s0 dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais) Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.150963358 -0.0501785457 0.10898336 -0.002210206 log(Wt) -0.916480522 -0.1942298625 -0.20123696 -0.089722026 log(Hg) -0.131538894 0.6854750758 0.71997546 -0.663097774 log(Ht) -0.093358860 -0.0433408964 0.46445398 0.290838658 log(WCC) 0.004467838 0.0001833808 0.04497590 0.071904557 log(RCC) -0.188973540 0.3475652934 0.29496908 0.037056363 log(Hc) 0.274758965 -0.6058301419 -0.34196615 0.678877114 log(Ferr) -0.005631238 0.0130588502 -0.08702709 0.015547214 Eigenvalues: [1] 0.95766163 0.24504161 0.10707594 0.09041305
> summary(s1 <- update(s0,slice.function=dr.slices.arc)) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc) Method: sir with 11 slices, n = 202. Slice Sizes: 19 19 19 19 19 19 19 18 18 18 15 Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.143177 -0.0476079 -0.02815 0.003785 log(Wt) -0.879504 -0.1425841 0.23303 -0.094970 log(Hg) -0.195963 0.6318503 0.24483 -0.509424 log(Ht) -0.058923 -0.1100757 -0.87893 0.217803 log(WCC) -0.007276 -0.0029772 -0.05309 0.043056 log(RCC) -0.167736 0.3924936 -0.19711 -0.213689 log(Hc) 0.368652 -0.6418658 -0.26373 0.796849 log(Ferr) -0.002697 0.0002593 0.03492 0.039116 Dir1 Dir2 Dir3 Dir4 Eigenvalues 0.9572 0.2275 0.09368 0.07319 R^2(OLS|dr) 0.9980 0.9981 0.99839 0.99864 Large-sample Marginal Dimension Tests: Stat dfp.value 0D vs >= 1D 284.78 80 0.00000 1D vs >= 2D 91.43 63 0.01113 2D vs >= 3D 45.48 48 0.57690 3D vs >= 4D 26.55 35 0.84694
> summary(s2<-update(s1,nslices=10,method="save")) Call: dr(formula = LBM ~ log(SSF) + log(Wt) + log(Hg) + log(Ht) + log(WCC) + log(RCC) + log(Hc) + log(Ferr), data = ais, slice.function = dr.slices.arc, nslices = 10, method = "save") Method: save with 10 slices, n = 202. Slice Sizes: 21 21 20 20 20 25 24 22 20 9 Estimated Basis Vectors for Central Subspace: Dir1 Dir2 Dir3 Dir4 log(SSF) 0.127709 -0.00907 0.01018 -0.06144 log(Wt) -0.905004 -0.07107 -0.15734 0.25774 log(Hg) -0.056187 0.50674 -0.34064 -0.38087 log(Ht) 0.399868 0.36613 0.68439 -0.54216 log(WCC) 0.032608 0.02733 0.02277 0.03474 log(RCC) -0.008463 0.15137 -0.24136 -0.47219 log(Hc) -0.021630 -0.76164 0.57591 0.51526 log(Ferr) 0.002116 -0.01670 0.01631 -0.03360 Dir1 Dir2 Dir3 Dir4 Eigenvalues 0.9389 0.6611 0.5129 0.4653 R^2(OLS|dr) 0.9936 0.9950 0.9985 0.9989 Large-sample Marginal Dimension Tests: Stat df(Nor) p.value(Nor) p.value(Gen) 0D vs >= 1D 378.3 324 0.02012 0.1071 1D vs >= 2D 279.6 252 0.11214 0.3116 2D vs >= 3D 179.9 189 0.67101 0.5160 3D vs >= 4D 134.3 135 0.50176 0.2786
Infrastructure tools In R Studio • Install the rmongodb package • http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_cheat_sheet.pdf • http://cran.r-project.org/web/packages/rmongodb/vignettes/rmongodb_introduction.html • MongoDB- http://www.mongodb.org/ • http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis - get familiar with the choices • General idea: • These are “backend” stores that can do various “things”
Back-ends • Files (e.g. csv), application files (e.g. Rdata, xls, mat, …) – essentially for reading/input • Databases – for reading and writing • Also – for advanced operations inside the database!! • Operations range from simple summaries to array operations and analytics functions • Overhead is opening/ maintaining connections/ closing – easy on your laptop – harder when they are remote (network, authentication, etc.) • Overhead is also around their internal storage formats (e.g. BSON for MongoDB)
Functions versus languages • Libraries for R mean that you code in R and call functions and the result returns into R • Whatever the function does (i.e. how it is implemented) is what you get (subject to setting parameters) • Languages (like Pig) provide more direct access to efficiently using the underlying capabilities of the application engine/ database • Cost is learning this new language
Even further • http://projects.apache.org/indexes/category.html#database • Hadoop(MapReduce) – distributed execution (via disk when data is large) • Pig (http://wiki.apache.org/pig/RunPig ) • HIVE (http://hive.apache.org/releases.html ) • Spark – in memory (RSpark still not easy to find/ install) http://gigaom.com/2014/02/27/as-mapreduce-fades-apache-spark-is-now-a-top-level-project/
~ Objectives • Provide an application, i.e. predictive/ prescriptive model view of data analytics by focusing on the “front-end” (Rstudio) • Over a variety of data… • Provide enough of a view of the back-end to know how you will need to interface to them (both open-source and commercial)