1 / 14

R Analytics in the Cloud

R Analytics in the Cloud. Introduction. Radek Maciaszek DataMine Lab ( www.dataminelab.com ) - Data mining , business intelligence and data warehouse consultancy . MSc in Bioinformatics at Birkbeck , University of London.

caden
Download Presentation

R Analytics in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. R Analytics in the Cloud

  2. Introduction • Radek Maciaszek • DataMine Lab(www.dataminelab.com) - Data mining, businessintelligenceand data warehouseconsultancy. • MSc in Bioinformaticsat Birkbeck, University of London. • Project at UCL Institute of HealthyAgeingundersupervision of Dr Eugene Schuster.

  3. Primer in Bioinformatics • Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc) • Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans). • Goal: find genes responsible for ageing CaenorhabditisElegans

  4. Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other

  5. Why R? • Very popular in bioinformatics • Functional, scripting programming language • Swiss-army knife for statistician • Designed by statisticians for statisticians • Lots of ready to use packages (CRAN)

  6. R limitations & Hadoop • Data needs to fit in the memory • Single-threaded • Hadoop integration: • Hadoop Streaming • Rhipe: http://ml.stat.purdue.edu/rhipe/ • Segue: http://code.google.com/p/segue/

  7. Segue • Works with Amazon Elastic MapReduce. • Creates a cluster for you. • Designed for Big Computations (rather than Big Data) • Implements a cloud version of lapply() function.

  8. Segue workflow (emrlapply) List (local) R S3 Elastic MapReduce List (remote) Amazon AWS

  9. R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean)$a [1] 5.5 $b [1] 4.535125 lapply(X, FUN)returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

  10. Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > pearson.cor<- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! RNA Probes

  11. Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > # pearson.cor<- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor<- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) RNA Probes

  12. Discovering genes Topomaps of clustered genes This work was based on a similar approach to:A Gene Expression Map for Caenorhabditiselegans, Stuart K. Kim, et al., Science 293, 2087 (2001)

  13. Conclusions • R is great for statistics. • It’s easy to scale up R using Segue. • We are all going to live very long.

  14. Thanks! • Questions? • References:http://code.google.com/r/radek-segue/http://www.dataminelab.com

More Related