140 likes | 266 Views
R Analytics in the Cloud. Introduction. Radek Maciaszek DataMine Lab ( www.dataminelab.com ) - Data mining , business intelligence and data warehouse consultancy . MSc in Bioinformatics at Birkbeck , University of London.
E N D
Introduction • Radek Maciaszek • DataMine Lab(www.dataminelab.com) - Data mining, businessintelligenceand data warehouseconsultancy. • MSc in Bioinformaticsat Birkbeck, University of London. • Project at UCL Institute of HealthyAgeingundersupervision of Dr Eugene Schuster.
Primer in Bioinformatics • Bioinformatics - applying computer science to biology (DNA, Proteins, Drug discovery, etc) • Ageing strategy – solve it in simple organism and apply findings to more complex organisms (i.e. humans). • Goal: find genes responsible for ageing CaenorhabditisElegans
Central dogma of molecular biology Genes are encoded by the DNA. Microarray (100 x 100) • Database of 50 curated experiments. • 10k genes compare to each other
Why R? • Very popular in bioinformatics • Functional, scripting programming language • Swiss-army knife for statistician • Designed by statisticians for statisticians • Lots of ready to use packages (CRAN)
R limitations & Hadoop • Data needs to fit in the memory • Single-threaded • Hadoop integration: • Hadoop Streaming • Rhipe: http://ml.stat.purdue.edu/rhipe/ • Segue: http://code.google.com/p/segue/
Segue • Works with Amazon Elastic MapReduce. • Creates a cluster for you. • Designed for Big Computations (rather than Big Data) • Implements a cloud version of lapply() function.
Segue workflow (emrlapply) List (local) R S3 Elastic MapReduce List (remote) Amazon AWS
R very quick example m <- list(a = 1:10, b = exp(-3:3)) lapply(m, mean)$a [1] 5.5 $b [1] 4.535125 lapply(X, FUN)returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > pearson.cor<- lapply(probes, AnalysePearsonCorelation) Moving to the cloud in 3 lines of code! RNA Probes
Segue – large scale example > AnalysePearsonCorelation<- function(probe) { A.vector<- experiments.matrix[probe,] p.values <- c() for(probe.name in rownames(experiments.matrix)) { B.vector <- experiments.matrix[probe.name,] p.values <- c(p.values, cor.test(A.vector, B.vector)$p.value) } return (p.values) } > # pearson.cor<- lapply(probes, AnalysePearsonCorelation) > myCluster <- createCluster(numInstances=5, masterBidPrice="0.68”, slaveBidPrice="0.68”, masterInstanceType=”c1.xlarge”, slaveInstanceType=”c1.xlarge”, copy.image=TRUE) > pearson.cor<- emrlapply(myCluster, probes, AnalysePearsonCorelation) > stopCluster(myCluster) RNA Probes
Discovering genes Topomaps of clustered genes This work was based on a similar approach to:A Gene Expression Map for Caenorhabditiselegans, Stuart K. Kim, et al., Science 293, 2087 (2001)
Conclusions • R is great for statistics. • It’s easy to scale up R using Segue. • We are all going to live very long.
Thanks! • Questions? • References:http://code.google.com/r/radek-segue/http://www.dataminelab.com