70 likes | 250 Views
IBM PureData for Analytics Clustering three ways with Open Source R. Using R with Puredata for Analytics. Small data outside database Single Model, Serial Model Processing. Pull data down from database Run R on desktop or dedicated server. Small data inside database
E N D
IBM PureData for AnalyticsClustering three ways with Open Source R
Using R with Puredata for Analytics Small data outside database Single Model, Serial Model Processing Pull data down from database Run R on desktop or dedicated server Small data inside database Single Model, Serial Model Processing Push R into database Process data directly against DB tables Large data inside database Single Model, Serial Model Processing Call INZA functions from R Process data directly against DB tables Many small data inside database Many Model, Parallel Model Processing e.g. Bulk Parallel Execution Push R into database Process data directly against DB tables
Using R with Puredata for Analytics Small data outside database Single Model, Serial Model Processing Pull data down from database Run R on desktop or dedicated server Small data inside database Single Model, Serial Model Processing Push R into database Process data directly against DB tables Large data inside database Single Model, Serial Model Processing Call INZA functions from R Process data directly against DB tables Many small data inside database Many Model, Parallel Model Processing e.g. Bulk Parallel Execution Push R into database Process data directly against DB tables Analysis only looks at the last three scenarios
Comparing performance for single model in-database Would expect nzKMeans to outperform cclust in-database between 5M and 6M observations Note: Tests run on a first-gen twin-fin Note: performance numbers variations are relative due to system being used during the testing
Bulk-parallel execution of cclust(10K observations for each) In general, these results would be significantly superior to running cclust serially in a dedicated environment simply due to R execution overhead and accounting for additional time required for data movement and/or partitioning
Clustering three ways with Open R and IBM Puredata for Analytics • Using wrapper for INZA KMEANS (Stores resulting model in-database), single model data.nz <- nz.data.frame("BENCHMARK_DATA") system.time( nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID", getLabels=F,randseed=1234, outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T) ) • Running R in-database, single model (Returns resulting model to client.) system.time( data.cclust <- nzSingleModel(data.nz[,2:16], function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000, verbose=FALSE,dist="euclidean",method="kmeans") } , force=TRUE )) • Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX) • # ua_ct is col 6, the “index” or grouping column • system.time( • data.cclust <- nzBulkModel(data.nz[data.nz$ID<1000001,2:16], 6, function(df){ require(cclust); • cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans") • }, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )