190 likes | 206 Views
Revolution Analytics Certification Talk DublinR User Group, 12 th May 2015. Revolution Analytics R Platform. Revolution Analytics R Platform. RevoR: - performance enhanced R interpreter, multi-core processing ConnectR: - high-speed connectors with third party systems (SAS, Teradata, Hadoop)
E N D
Revolution Analytics Certification TalkDublinR User Group, 12th May 2015
Revolution Analytics R Platform RevoR: - performance enhanced R interpreter, multi-core processing ConnectR: - high-speed connectors with third party systems (SAS, Teradata, Hadoop) DistributedR: - distributed computing framework DevelopR: - visual step-in debugger and IDE for R DeployR: - WS (json + xml) , SDKs for Java, JS, .NET ScaleR: - Data preparation, descriptive statistics, correlation and covariance matrices, predictive modelling
Certification trivia - title : “Revolution R Enterprise Certified Specialist” - $200 examination fee - exam booked online at http://www.kryteriononline.com - three possible test center locations in Dublin (New Horizons, SureSkills, The Exam Centre/Leopardstown) - 90 minutes - 60 questions - 70% passing score
Help on RevoScaleR functions: http://www.rdocumentation.org/packages/RevoScaleR/ • rxAddInheritance RxAvroData RxAvroData-class RxAzureBurst RxAzureBurst-class rxBTrees rxCancelJob rxChiSquaredTest rxCleanup rxCompareContexts rxCompressXdf RxComputeContext RxComputeContext-class rxCovCor rxCovRegression rxCrossTabsrxCube rxDataFrameToXdf RxDataSource RxDataSource-class rxDataStep rxDForest rxDForestUtils RxDistributedHpa-class rxDistributeJob rxDTree rxDTreeBestCp rxElemArg rxExec rxExecuteSQLDDL rxExpression rxFactors RxFileData-class RxFileSystem rxFindFileInPath RxForeachDoPar RxForeachDoPar-class rxFormula rxGetAvailableNodes rxGetEnableThreadPool rxGetInfoXdf rxGetJobInfo rxGetJobOutput rxGetJobResults rxGetJobs rxGetNodeInfo rxGetNodes rxGetVarInfoXdf rxGetVarNames rxGLM rxHadoopCommand RxHadoopMR RxHadoopMR-class rxHdfsConnect RxHdfsFileSystem rxHistogram RxHPCServer RxHPCServer-class rxImport rxImportToXdf RxInTeradata RxInTeradata-class rxKmeans rxLaunchClusterTaskManager rxLinePlotrxLinMod RxLocalParallel RxLocalParallel-class RxLocalSeq RxLocalSeq-class rxLocateFile rxLogit rxLorenz RxLsfCluster RxLsfCluster-class rxMakeRNodeNames rxMarginals rxMergeXdf rxMultiTest RxNativeFileSystem rxNew RxOdbcData RxOdbcData-class rxOpen-methods rxOptions rxPairwiseCrosstab rxPingNodes rxPredict rxPredict.rxDForest rxPredict.rxDTree rxQuantile rxReadXdf rxRemoteCall rxRemoteGetId rxRemoteHadoopMRCall rxResultsDF rxRiskRatio rxRng rxRoc RxSasData RxSasData-class rxSetComputeContext rxSetFileSystem rxSetInfo rxSetVarInfoXdf rxSortXdf rxSplitXdf RxSpssData RxSpssData-class rxStepControl rxSummary RxTeradata RxTeradata-class rxTeradataSql RxTextData RxTextData-class rxTextToXdf rxTransform rxTweedie rxWaitForJob RxXdfData RxXdfData-class rxXdfFileName rxXdfToDataFrame rxXdfToText
What you need to know for the exam Workspace management • search() • ls() • rm() • save(), load() • write.table(), read.table()
What you need to know for the exam operations on data structures • X=-2:7; x[-4:-5] ( negative indexing ) • X=-2:7; sum(x[x<1]) ( boolean indexing ) • Replicating values, filling up data structures Array(3:5,1:3)[1,,2] • use of tapply/sapply/apply
What you need to know for the exam RevoScaleR XDF File Format: External Data Frame: • binary format • loads directly to memory • Data chunks • New rows and columns can be added to the file without re-writing the entire file
What you need to know for the exam • Importing data and export data -what will this one return: rxImport(inData, outFile = "abc.xdf",...) - what's returned? rxOdbcData(query, table, connectionString) rxTextToXdf()
What you need to know for the exam • Summary statistics rxGetInfo() rxHistogram() rxSummary() - three of four questions on rxSummary in combination with rxFormula.
What you need to know for the exam Using formulas for descriptive statistics rxSummary( formula = ~ F(age) : sex, data = censusWorkers) You may need to know how to build formulas: ~ to separate response from predictor vars + to separate predictor variables : to denote interactions between predictor vars F(x) to treat numeric var x as a categorical var N(x) – opposite to F(x) * adds all subsets of interactions to the model
What you need to know for the exam Data transformations rxDataStep( inData, returnTransformObjects , transformObjects = list(a,b,c), transformFunc = someCustomFunction, transformVars = c("x1", "x2") ) - remember you're processing a possibly large data set - there are special requirements on how to create custom functions
What you need to know for the exam Machine learning • RxKmeans - k-means clust. • RxDTree - decision trees • RxLinMod - linear models • RxLogit - logistic regr. • RxGLM - generalized LinMod
What you need to know for the exam Model fitting what about fitted values and residuals? What would you look for when observing residuals (zero mean, heteroskedacity, normal distribution, etc)? Predictive Modelling: For each type of models know its essential parameters (maxDepth or cp for Descision trees, numClusters for K-means, family for GLM). Example question: rxGlm formula is defined as rxGLM( y~x, family="binomial (link=logit)"). What can be assumed regarding discreet/continuous nature of variables and their relationship? linear ? log(y)~x ? x is categorical? y is binary?
What you need to know for the exam Model fitting what about fitted values and residuals? What would you look for when observing residuals (zero mean, heteroskedacity, normal distribution, etc)? Predictive Modelling: For each type of models know its essential parameters (maxDepth or cp for Descision trees, numClusters for K-means, family for GLM). Example question: rxGlm formula is defined as rxGLM( y~x, family="binomial (link=logit)"). What can be assumed regarding discreet/continuous nature of variables and their relationship? linear ? log(y)~x ? x is categorical? y is binary?
What you need to know for the exam Miscellaneous questions: • which functions you may use together with rxCrossTabs for testing independence of variables (rxFisherTest, rxKendallCor, rxChiSquaredTest) • What does rxCor function return? (Pearson's correlation matrix) • What graphics subsystem does rxPlotLine use underneath? (lattice? ggplot2? googleVis? base graphics?) • Two questions on Principal Component Analysis ( splits variables into ? independent? dependent? asymptotic? normally distributed? ) • Two other questions on covariance/correlation ( cov(xy)=cor(xy)*sd(x)*sd(y)) • Which operations are not supported for in-the-fly response variable transformations with rxSummary : F(y), N(y), rowSelection, transform=(<list()>) • Which functions to use for obtaining contingency tables (rxSummary? rxCube? rxCrossTabs)