R Analytics in the Cloud

R Analytics in the Cloud

Introduction • Radek Maciaszek • DataMine Lab(www.dataminelab.com) - Data mining, businessintelligenceand data warehouseconsultancy. • MSc in Bioinformaticsat Birkbeck, University of London. • Project at UCL Institute of HealthyAgeingundersupervision of Dr Eugene Schuster.

Primer in Bioinformatics • Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc) • Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans). • Goal: find genes responsible for ageing CaenorhabditisElegans

Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other

Why R? • Very popular in bioinformatics • Functional, scripting programming language • Swiss-army knife for statistician • Designed by statisticians for statisticians • Lots of ready to use packages (CRAN)

R limitations & Hadoop • Data needs to fit in the memory • Single-threaded • Hadoop integration: • Hadoop Streaming • Rhipe: http://ml.stat.purdue.edu/rhipe/ • Segue: http://code.google.com/p/segue/

Segue • Works with Amazon Elastic MapReduce. • Creates a cluster for you. • Designed for Big Computations (rather than Big Data) • Implements a cloud version of lapply() function.

Segue workflow (emrlapply) List (local) R S3 Elastic MapReduce List (remote) Amazon AWS

R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean)$a [1] 5.5 $b [1] 4.535125 lapply(X, FUN)returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > pearson.cor<- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! RNA Probes

Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > # pearson.cor<- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor<- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) RNA Probes

Discovering genes Topomaps of clustered genes This work was based on a similar approach to:A Gene Expression Map for Caenorhabditiselegans, Stuart K. Kim, et al., Science 293, 2087 (2001)

Conclusions • R is great for statistics. • It’s easy to scale up R using Segue. • We are all going to live very long.

Thanks! • Questions? • References:http://code.google.com/r/radek-segue/http://www.dataminelab.com

R Analytics in the Cloud

R Analytics in the Cloud

Presentation Transcript

Next Generation Of Cloud Analytics

Cloud Based Analytics for Cloud Based Applications

Cloud Analytics Market

Cloud Business Analytics CON8877

R + Hadoop = big data analytics

R and the Cloud

Transforming the business of cloud Cloud Cruiser financial analytics for Microsoft Cloud OS

Cloud Services for Big Data Analytics

SmacExpert : Social Mobile Analytics Cloud

Latin America Cloud Analytics Market

Latin America Cloud Analytics Market

Cloud Services Global Market Analytics 2016

Cloud Call Analytics

The Innovative Productivity of Cloud Analytics

r-analytics online training in hyderabad

Cloud Analytics: Cloud Costs Visibility Across Organization

Analytics using R Programming

Cloud Analytics Market

Providing Social, Mobile, Analytics, Cloud Solutions

Unified Monitoring and Analytics [Seamless Operational Visibility] in the Cloud

Cloud Based Analytics for Cloud Based Applications

Cloud Analytics Training | Cloud Analytics Solutions with Microsoft Azure