590 likes | 601 Views
Explore the power of R - an open-source language for statistical computing and graphics, widely used by analysts and data scientists. Learn about Oracle's Advanced Analytics, R integration, and scalable predictive analytics solutions.
E N D
Session 1: Introduction to Oracle's R Technologies Mark Hornick Director, Advanced Analytics February 2016
Agenda 1 What is R Oracle’s Advanced Analytics Oracle R Distribution ROracle Package Oracle R Advanced Analytics for Hadoop Oracle R Enterprise Summary 2 3 4 5 6 7
What is R? • R is an Open Source scripting language and environment for statistical computing and graphicshttp://www.R-project.org/ • Started in 1994 as an Alternative to SAS, SPSS andother proprietary Statistical Environments • The R environment • R is an integrated suite of software facilities for data manipulation, calculation and graphical display • Millions of R users worldwide • Widely taught in Universities • Many Corporate Analysts and Data Scientists know and use R • Thousands of open sources packages to enhance productivity such as: • Bioinformatics with R • Spatial Statistics with R • Financial Market Analysis with R • Linear and Non Linear Modeling
CRAN Task View – Machine Learning & Statistical Learning • ahaz • arules • BayesTree • bigrf • bigRR • bmrm • Boruta • bst • C50 • caret • CORElearn • CoxBoost • Cubist • e1071 (core) • earth • elasticnet • ElemStatLearn • evtree • frbs • GAMBoost • gamboostLSS • gbm (core) • glmnet • glmpath • GMMBoost • grplasso • grpreg • hda • hdi • ipred • kernlab (core) • klaR • lars • lasso2 • LiblineaR • LogicForest • LogicReg • maptree • mboost (core) • mlr • ncvreg • nnet (core) • oblique.tree • pamr • party • partykit • penalized • penalizedLDA • penalizedSVM • quantregForest • randomForest (core) • randomForestSRC • rattle • rda • rdetools • REEMtree • relaxo • rgenoud • rgp • Rmalschains • rminer • ROCR • RoughSets • rpart (core) • RPMM • RSNNS • RWeka • RXshrink • sda • stabs • svmpath • tgp • tree • varSelRF • vcrpart
Why statisticians | data analysts | data scientists use R R is a statistics language similar to Base SAS or SPSS statistics http://cran.r-project.org/ R environment is .. Powerful Extensible Graphical Extensive statistics OOTB functionality with many ‘knobs’ but smart defaults Ease of installation and use Free
R leads the pack The top 10 tools by share of users were • R, 46.9% share ( 38.5% in 2014) • RapidMiner, 31.5% ( 44.2% in 2014) • SQL, 30.9% ( 25.3% in 2014) • Python, 30.3% ( 19.5% in 2014) • Excel, 22.9% ( 25.8% in 2014) • KNIME, 20.0% ( 15.0% in 2014) • Hadoop, 18.4% ( 12.7% in 2014) • Tableau, 12.4% ( 9.1% in 2014) • SAS, 11.3 (10.9% in 2014) • Spark, 11.3% ( 2.6% in 2014) http://www.kdnuggets.com/2016/05/poll-r-rapidminer-python-big-data-spark.html
Oracle Solution Addresses Analytics Pain Pointsfor example… “It takes too long to get my data or to get the ‘right’ data” “I can’t analyze or mine all of my data – it has to be sampled” “Putting R models and results into production is ad hoc and complex” “Recoding R models into SQL, C, or Java takes time and is error prone” “Our company is concerned about data security, backup and recovery” “We need to build 10s of thousands of models fast to meet business objectives”
Oracle’s Advanced Analytics Fastest Way to Deliver Scalable Enterprise-wide Predictive Analytics • Scalable in-Database + Hadoop data mining algorithms and R integration • Powerful predictive analytics and deployment platform • Drag and drop workflow, R and SQL APIs • Data analysts, data scientists & developers • Enables enterprise predictive analytics applications • Key Features
Oracle’s Advanced Analytics R Client Users Data / Business Analysts R programmers Business Analysts/Mgrs Domain End Users Multiple interfaces across platforms — SQL, R, GUI, Dashboards, Apps Oracle Database 12c Applications SQL Developer/ Oracle Data Miner OBIEE Platform Oracle Cloud Hadoop Oracle Database Enterprise Edition Oracle Advanced Analytics - Database Option SQL Data Mining & Analytic Functions + R Integration for Scalable, Distributed, Parallel in-Database ML Execution ORAAH Parallel, distributed algorithms
Oracle’s Advanced Analytics Polyglot Support: In-Database and Hadoop Data Management + Advanced Analytics Oracle Database + Advanced Analytics Option Oracle Big Data Appliance SQL Client SQL DM functions API Data Miner R R Client OAA/ORE R integration RStudio Big Data SQL
Manage and Analyze All Data—SQL & Oracle Big Data SQL OAA Oracle Big Data Appliance Oracle Database 12c SQL / R JSON • Structured and Unstructured Data Reservoir • JSON data • HDFS / Hive • NoSQL • Spatial and Graph data • Image and Video data • Social Media • Store business-critical data in Oracle • Customer data • Transactional data • Unstructured documents, comments • Spatial and Graph data • Image and Video data • Social Media • Data analyzed via SQL / R / GUI • R Clients • SQL Clients • Oracle Data Miner
Oracle’s Advanced Analytics Differentiators • Data remains in the Database and Hadoop • Brings algorithms to the data, not move data to the algorithms • Oracle’s advanced analytics parallel implementations run on Hadoop and Database data • Provides parallel implementations of commonly used algorithms with auto-data preparation • Enables data analysts and scientists to work directly on transactional data, star schemas, text • Leverage investment in Oracle IT • Leverages latest Oracle technology: Engineered Systems (Exadata, BDA), Big Data SQL, “smart scans” • Provides wide range of options: Cloud and on-premise – same Oracle code & products • Database provides industry leadership in data security, encryption, audit, backup, and recovery • Diverse data source gateways, social media data, geo-coding, ETL – on-premise / on-cloud
Oracle’s Advanced Analytics Differentiators • Fastest way to deliver enterprise-wide predictive analytics applications • Delivers big data + analytics platform for model build, deployment & applications • GUI to build analytical workflows with code generation and scheduling for deployment • Supports near real-time scoring for embedding in applications • Explains model logic and transparency via Prediction_Details • Tight integration with open source R • Leverage CRAN R packages using data- and task-parallel database infrastructure • Store and manage R scripts and R objects in-database and invoke R scripts from SQL • Use production quality infrastructure without custom plumbing or extra complexity
Oracle Data Miner GUI SQL Developer Extension—Free OTN Download • Easy to Use • Oracle Data Miner GUI for data analysts • “Work flow” paradigm • Powerful • Multiple algorithms & data transformations • Runs 100% in-DB • Build, evaluate and apply models • Automate and Deploy • Generate SQL scripts for deployment • Share analytical workflows
Dresner Advanced and Predictive Analytics Market Study2016 Edition
Sample Use cases • Detect fraud in customer transactions, insurance claims • Identify which patients are at risk of developing certain conditions • Target the right customer with the right offer • Discover hidden customer segments • Forecast customer demand for a product or service • Find most profitable selling opportunities • Anticipate and preventing customer churn • Identify customers likely to churn and why • Security and suspicious activity detection • Understand sentiments in customer conversations • Understand influencers in social networks • Predict credit risk
BI examines historical data, AA explains the past and predicts the future BI tools can help explore and prepare data for modelingwith Advanced Analytics Business Value of Advanced Analytics • Deliver better outcomes • Understand the reasons behind outcomes • Capitalize on an enterprise’s unique data • Create sustainable competitive advantage • Increase the rate of better decision making • Respond to market conditions faster • Attain a higher level of customer satisfaction • Know which of your big data you should pay attention to
Analytic Pain Points • It takes too long to get my data or to get the ‘right’ data • I can’t analyze or mine all of my data – it has to be sampled • Putting analytics/predictive models and results into production is ad hoc and complex • Recoding R or other models into SQL, C, or Java takes time and is error prone • Our company is concerned about data security, backup and recovery • We need to build 10s of thousands of models fast to meet business objectives See the blog series at https://blogs.oracle.com/R/entry/addressing_analytic_pain_points
Oracle’s R Technologies Supporting R, Oracle Database, and Big Data Appliance/Hadoop • Oracle R Distribution • ROracle • Oracle R EnterpriseComponent of the Oracle Advanced Analytics Option to Oracle Database • Oracle R Advanced Analytics for HadoopComponent of the Big Data Connectors Software Suite Software available to R Community for free
Oracle R Distribution Ability to dynamically load Intel Math Kernel Library AMD Core Math Library Solaris Sun Performance Library OracleSupport • An Oracle-Supported Redistribution of Open Source R • Enhanced linear algebra performance via dynamically loaded libraries • Improve scalability at client and database for embedded R execution • Enterprise support for customers of Oracle Advanced Analytics option, Big Data Appliance, and Oracle Linux • Free download • Oracle contributes bug fixes and enhancements to open source R
ROracle • R package enabling scalable and performant connectivity to Oracle Database • Open source, publicly available on CRAN • Oracle is maintainer • Oracle Database Interface (DBI) for R • Re-implemented and optimized driver based on OCI • Execute SQL statements from R interface • Enables transactional behavior for insert, update, and delete Oracle Database ROracle
ROracle - Requirements • Oracle Instant Client • allows running applications without installing the standard Oracle Database Client or having an ORACLE_HOME • OCI, OCCI, Pro*C, ODBC, and JDBC applications work without modification, while using significantly less disk space • SQL*Plus can also be used with Instant Client • Or, standard Oracle Database Client
ROracle Example – rolling back transactions drv <- dbDriver("Oracle") con <- dbConnect(drv, username = "scott", password = "tiger") dbReadTable(con, "EMP") rs <- dbSendQuery(con, "delete from emp where deptno = 10") dbReadTable(con, "EMP") if(dbGetInfo(rs, what = "rowsAffected") > 1){ warning("dubious deletion -- rolling back transaction") dbRollback(con) } dbReadTable(con, "EMP")
Oracle R Advanced Analytics for Hadoop Using Hadoop and HIVE Integration, plus R Engine and Open-Source R Packages HQL Hadoop Cluster with Oracle R Advanced Analytics for Hadoop (ORAAH) R Client • R interface to HQL Basic Statistics, Data Prep, Joins and View creation R Analytics Oracle R Advanced Analytics for Hadoop R • Parallel, distributed algorithms: • MLP Neural Nets*, GLM*, LM, PCA, k-Means, NMF, LMF • * Spark-Caching enabled • Use of Open-source R packages via custom R Mappers / Reducers Oracle Database with Advanced Analytics option SQL Client SQL Developer Other SQL Apps
Oracle R Advanced Analytics for Hadoop Advanced Analytics algorithms in a Hadoop Cluster: Map-Reduce and Spark based Classification Clustering Statistical Functions Generalized Linear Model Logistic Regression Hierarchical k-Means Correlation Covariance Cross Tabulation Summary statistics Regression Attribute Importance Feature Extraction Linear Regression Multi-Layer Neural Networks Principal Components Analysis Nonnegative Matrix Fact(NMF) Collaborative Filtering (LMF)
ORAAH: Machine Learning in Spark against HDFS data 4 1 5 2 3 Invoke ORAAH custom parallel distributed GLM Model using Spark Caching YARN: Apache Spark Job Oracle Distribution of R version 3.1.1 (--) -- "Sock it to Me" > Connects to Spark > spark.connect("yarn-client",memory="24g") > # Attaches the HDFS file for use within R> ont1bi <- hdfs.attach("/user/oracle/ontime_1bi") > # Formula definition: Cancelled flights (0 or 1) based on other attributes > form_oraah_glm2 <- CANCELLED ~ DISTANCE + ORIGIN + DEST + F(YEAR) + F(MONTH) ++ F(DAYOFMONTH) + F(DAYOFWEEK) > system.time(m_spark_glm <- orch.glm2(formula=form_oraah_glm2, ont1bi)) ORCH GLM: processed 6 factor variables, 25.806 sec ORCH GLM: created model matrix, 100128 partitions, 32.871 sec ORCH GLM: iter 1, deviance 1.38433414089348300E+09, elapsed time 9.582 sec ORCH GLM: iter 2, deviance 3.39315388583931150E+08, elapsed time 9.213 sec ORCH GLM: iter 3, deviance 2.06855738812683250E+08, elapsed time 9.218 sec ORCH GLM: iter 4, deviance 1.75868100359263200E+08, elapsed time 9.104 sec ORCH GLM: iter 5, deviance 1.70023181759611580E+08, elapsed time 9.132 sec ORCH GLM: iter 6, deviance 1.69476890425481350E+08, elapsed time 9.124 sec ORCH GLM: iter 7, deviance 1.69467586045954760E+08, elapsed time 9.077 sec ORCH GLM: iter 8, deviance 1.69467574351380850E+08, elapsed time 9.164 secuser system elapsed84.107 5.606 143.591 Spark-Based Machine Learning algorithms module • Oracle R Advanced Analytics • for Hadoop Client Packages Custom Spark Java Algorithm distributed in-Memory Computation Custom Spark Java Algorithm distributed in-Memory Computation /user/oracle/ontime_s
ORAAH’s Spark-based GLM against HDFS data Performance against ORAAH’s Map-Reduce GLM
ORAAH’s Spark-based GLM vs. Spark Mllib GLM Performance measured on same Hardware and same HDFS input Dataset 5x 13.9x 14.5x
Oracle R Advanced Analytics for Hadoop ORAAH Clientwith third-party RStudioOpen Source R or ORD Web browser ORAAH Clientwith third-partyRStudio ServerOpen Source R or ORD Oracle Database OAA (ORE)ORD sqoopor OLH Web browser BDA / HadoopORAAHORD Big Data Discovery Oracle Confidential – Internal/Restricted/Highly Restricted
Traditional R and Database Interaction Flat Files read extract / export • Access latency • Paradigm shift: R SQL R • Memory limitation – data size, call-by-value • Single threaded • Ad hoc production deployment • Issues for backup, recovery, security export load Database SQL RODBC / RJDBC / ROracle R scriptcron job
Oracle R EnterpriseOracle Advanced Analytics Option to Oracle Database • Use R and scale to Big Data • Eliminate memory constraint of client R engine • Minimize or eliminate data movement latency • Use Oracle Database as HPC environment • Manage R scripts and objects in Oracle Database • Use in-database parallel and distributed data mining algorithms • Easily integrate R results into applications and dashboards without recoding SQL Interfaces Client R Engine SQL*Plus,SQLDeveloper, … ORE packages Oracle Database Database ServerMachine In-dbstats User tables
OBIEE Dashboard Integration with R Graphics and Analytics Parameterized data selection and graph customization
Integrated Business Intelligence Integrate a range of in-DB SQL & R Predictive Analytics & Graphics In-database construction of predictive models that predict customer behavior OBIEE’s integrated spatial mapping shows where Customer “most likely” to be HIGH and VERY HIGHvalue customer in the future
Oracle R Enterprise 1.5 Predictive Analytics algorithms in-Database Classification Clustering Market Basket Analysis Decision TreeLogistic RegressionNaïve BayesRandomForestSupport Vector Machine Hierarchical k-Means Orthogonal Partitioning Clustering Apriori – Association Rules Regression Attribute Importance Feature Extraction Linear ModelGeneralized Linear ModelMulti-Layer Neural Networks Stepwise Linear Regression Support Vector Machine Minimum Description Length Nonnegative Matrix FactorizationPrincipal Component AnalysisSingular Value Decomposition Anomaly Detection Time Series 1 Class Support Vector Machine Single Exponential Smoothing Double Exponential Smoothing New in ORE 1.5
Invoke in-database aggregation function aggdata <- aggregate(ONTIME_S$DEST, by = list(ONTIME_S$DEST), FUN = length) class(aggdata) head(aggdata) Source data is an ore.frame ONTIME_S, which resides in Oracle Database The aggregate() function has been overloaded to accept ORE frames aggregate() transparently switches between code that works with standard R data.frames and ore.frames Returns an ore.frame R user on desktop select DEST, count(*) from ONTIME_S group by DEST Client R Engine Transparency Layer Oracle R Enterprise Oracle Database In-dbstats User tables
Transparency LayerOverloads graphics functions for in-database statistics ontime <- ONTIME_S delay <- ontime$ARRDELAY dayofweek <- ontime$DAYOFWEEK bd <- split(delay, dayofweek) boxplot(bd, notch = TRUE, col = "red", cex = 0.5, outline = FALSE, axes = FALSE, main = "Airline Flight Delay by Day of Week", ylab = "Delay (minutes)", xlab = "Day of Week") axis(1, at=1:7, labels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) axis(2)
ore.groupApply – partitioned data flow Client R Engine modList <- ore.groupApply( X=ONTIME_S, INDEX=ONTIME_S$DEST, function(dat) { lm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) }); summary(modList$BOS) ## return model for Boston 1 2 ORE Oracle Database 3 rq*Apply () interface User tables Also includes • ore.doEval • ore.tableApply • ore.rowApply • ore.indexApply extproc extproc DB R Engine 4 ORE 4 DB R Engine ORE
IoTUse Case: Sensor Data Analysis • Model each customer’s usage to understandbehavior and predict individual usageand overall aggregate demand • 200 thousand households, each with a utility “smart meter” • 1 reading / meter / hr • 200K x 8760 hrs / yr 1.752B readings • 3 years worth of data 5.256B readings • Each customer has 26280 readings • If each model takes 10 seconds to build, 555.6 hrs (23.2 days) …with 128 DOP 4.3 hrs
Scalable Sensor Data Analysis – Model BuildingSmart meter scenario Oracle Database Data R Script Repository R Datastore f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…) c1 c2 ci cn R Script buildmodel f(dat,args,…) { } Model c1 Model c2 Model ci Model cn