350 likes | 570 Views
Elastic-R : A Secure Collaborative Virtual Environment in the Cloud for Computational Bioscience Research . Karim Chine Cloud Era Ltd karim.chine@cloudera.co.uk. Agenda : Scientific data analysis in the cloud, Elastic-R and the data deluge.
E N D
Elastic-R: A Secure Collaborative Virtual Environment in the Cloud for Computational Bioscience Research Karim Chine Cloud Era Ltd karim.chine@cloudera.co.uk
Agenda : • Scientific data analysis in the cloud, Elastic-R and the data deluge. • Collaborative research and development, Elastic-R as a Google Docs-like environment for data analysis. • Rapid scientific applications development and delivery, Elastic-R as an IaaS-based applications factory. • The cloud as a reproducible research platform for computational biosciences. • Applications convergence: Excel, Word, Eclipse, etc. as front-ends for the cloud • User-friendly high throughput computing
Jim Gray with his colleagues Gianfranco Putzulo and Irving Traiger in the late '70' / early '80s when they did groundbreaking work on concurrency control for databases (image courtesy of Heather Gray) Cyberinfrastructure Technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge (wikipedia)
What’swrongwith the GRID ? “the abstractions that Grids expose – to the end-user, to the deployers and to application developers – are inappropriate and they need to be higher level” (Jha, Merzky, & Fox, 2009) RI
Suppose [a person] had a basket full of apples and, being worried that some of the apples were rotten, wanted to take out the rotten ones to prevent the rot spreading. How would he proceed? Would he not begin by tipping the whole lot out of the basket? And would not the next step be to cast his eye over each apple in turn, and pick up and put back in the basket only those he saw to be sound, leaving the others? In just the same way, those who have never philosophized correctly have various opinions in their minds which they have begun to store up since childhood, and which they therefore have reason to believe may in many cases be false. They then attempt to separate the false beliefs from the others, so as to prevent their contaminating the rest and making the whole lot uncertain. Now the best way they can accomplish this is to reject all their beliefs together in one go, as if they were all uncertain and false. They can then go over each belief in turn and re-adopt only those which they recognize to be true and indubitable. (Replies 7, AT 7:481)
We've all heard about how on-demand computing and storage will transform scientific practice. But by focusing on resources alone, we're missing the real benefit of the large-scale outsourcing and consequent economies of scale that cloud is about. The biggest IT challenge facing science today is not volume but complexity. Sure, terabytes demand new storage and computing solutions. But they're cheap. It is establishing and operating the processes required to collect, manage, analyze, share, archive, etc., that data that is taking all of our time and killing creativity. And that's where outsourcing can be transformative. An entrepreneur can run a small business from a coffee shop, outsourcing essentially every business function to a software-as-a-service provider--accounting, payroll, customer relationship management, the works. Why can't a young researcher run a research lab from a coffee shop? For that to happen, we need to make it easy for providers to develop "apps" that encapsulate useful capabilities and for researchers to discover, customize, and apply these "apps" in their work. The effect, I will argue, will be a dramatic acceleration of discovery. Ian Foster, Argonne National Laboratory
The Definition of Cloud Computing Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics,three service models, and four deployment models.
Essential Characteristics: On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service’s provider. Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs). Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, network bandwidth, and virtual machines. Rapid elasticity. Capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time. Measured Service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
Service Models: Cloud Software as a Service (SaaS).The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based email). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings. Cloud Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations. Cloud Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
Deployment Models: Private cloud. The cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on premise or off premise. Community cloud.The cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on premise or off premise. Public cloud. The cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services. Hybrid cloud. The cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds). Note: Cloud software takes full advantage of the cloud paradigm by being service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability.
It is all about AUTOMATION ! Cloud = Scriptable Infrastructure
Elastic-R ScientificComputing-as-a-Service
www.scipy.org • Open-source (GPL) software environment for statistical computing and graphics • Lingua franca of data analysis. • Repositories of contributed R packages related to a variety of problem domains in life sciences, social sciences, finance, econometrics, chemo metrics, etc. are growing at an exponential rate. • R is Super Glue www.python.org www.sagemath.org www.scilab.org www.wolfram.com www.mathworks.com office.microsoft.com www.spss.com www.sas.com http://root.cern.ch Evolution of the CRAN Packages number
Elastic-R: Plug-and-play scientific computing Computational Components R packages : CRAN, Bioconductor, WrappedC,C++,Fortran code Scilab modules, MatlabToolkits, etc. Open source or commercial Computational User Interfaces Workbench within the browser Built-in views / Plugins / Spreadsheets Collaborative views Open source or commercial ComputationalResources Hardware & OS agnosticcomputingengine : R, Scilab,.. Clusters, grids, private or public clouds free: academicgridsor pay-per-use: EC2, Azure Computational Data Storage Local, NFS, FTP, Amazon S3, Amazon EBS free or commercial Computational Scripts R / Python / Groovy On client side: interactivity.. On server side: data transfer .. Computational Application Programming Interfaces Java / SOAP / REST, Stateless and stateful Generated Computational Web Services Stateful or stateless, automatic mapping of R data objects and functions
Elastic-R : a platform for scientific computing • on Infrastructure-as-a-Service style clouds
Scientific data analysis in the cloud, Elastic-R and the data deluge.
Yourbig data ishere Restful WS over SSL Restful WS over SSL SOAP over SSL Heartbeat Restful WS over SSL SSH You are here HTTPS
Collaborative research and development, Elastic-R as a Google Docs-like environment for data analysis.
Scientists can share their machine instances, computing engines, data, spreadsheets, GUIs, etc. and collaborate on real-time
Dedicated portals for decentralized and private collaboration Amazon Virtual Private Cloud Subnet 2 Subnet 1 Subnet 3
Rapid scientific applications development and delivery, Elastic-R as an IaaS-based applications factory.
Elastic-R: interoperability layer for public and private clouds
The cloud as a reproducible research platform for computational biosciences.
Author form the Sanger Institute communicating details about the virtual appliances used to produce a paper’s computational results to Journal Reviewer Elastic-R Amazon Machine Images Elastic-R AMI 1 R 2.10 + BioC 2.5 Elastic-R AMI 2 R 2.9 + BioC 2..3 Elastic-R AMI 2 R 2.9 + BioC 2.3 Elastic-R AMI 3 R 2.8+BioC 2.0 Elastic-R EBS 4 Data Set VVV Amazon Elastic Block Stores Elastic-R.org Elastic-R AMI 2 R 2.9 + BioC 2.3 Elastic-R EBS 4 Data Set VVV Elastic-R EBS1 Data Set XXX Elastic-R EBS 2 Data Set YYY Elastic-R EBS 3 Data Set ZZZ Elastic-R EBS 4 Data Set VVV
JANSSEN PHARMACEUTICA communicating to the FDA details about the virtual appliances used to produce the computational results in a new drug application Elastic-R Amazon Machine Images Elastic-R AMI 1 R 2.10 + BioC 2.5 Elastic-R AMI 2 R 2.9 + BioC 2..3 Elastic-R AMI 2 R 2.9 + BioC 2.3 Elastic-R AMI 3 R 2.8+BioC 2.0 Elastic-R EBS 4 Data Set VVV Amazon Elastic Block Stores Elastic-R.org Elastic-R AMI 2 R 2.9 + BioC 2.3 Elastic-R EBS 4 Data Set VVV Elastic-R EBS1 Data Set XXX Elastic-R EBS 2 Data Set YYY Elastic-R EBS 3 Data Set ZZZ Elastic-R EBS 4 Data Set VVV
Applications convergence: Excel, Word, Eclipse, etc. as front-ends for the cloud
Scientist can control in parallel any number of stateful R/Python engines from within an R/Python session on the cloud or on a local machine
Acknowledgments ACS: MadiNassiriAmazon: Simone Brunozzi, Deepak Singh AT&T Research Labs: Simon UrbanekAuckland Centre for eResearch: Nick Jones Banca d'Italia: Giuseppe Bruno Bio-IT World: Kevin Davies BNP Paribas: OusseynouNakoulimaCambridge Healthtech Institute: Cindy Crowninshield, Deborah Shear City University of New York: Mario Morales, MakramTalihColumbia University: Omar BesbesDassaultSystèmes: Omri Ben Ayoun, Patrick Johnson Dataspora: Michael E. Driscoll EDF: Alejandro RibesEBI: AlvisBrazma, Wolfgang Huber, KimmoKallio, MishaKapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-Serra, UgisSarkans, Kirsten Williams, Eamonn Maguire EPFL: Darlene Goldstein ESPRIT: Farouk Kammoun, Tahar. Benlakhdare-Taalim: NadhirDoumaETH Zürich: YohanChalabi, DiethelmWürtz, Martin MächlerEuropean Commission: KonstantinosGlinos, EnricMitjana, Monika Kacik, IoannisSagiasFHCRC: Martin Morgan, Nianhua Li, Seth Falcon Google: Olivier BosquetFVG LLC: Lisa Wood Harvard University: Tim Clark, Sudeshna Das, Douglas Burke,PaoloCiccareseIBM: Jean-Louis Bernaudin, Pascal Sempe, Loic Simon, Lea A Deleris, Alex Fleischer, Alain ChabrierImperial College London: AsifAkram, VasaCurcin, John Darlington, Brian Fuchs Indiana University:MichaelGrobeINRIA: David Monteau, Christian Saguez, Claude Gomez, SylvestreLedruJISC: John Wood, David Flanders Johnson & Johnson - Janssen Pharmaceutica: Patrick MarichalKXEN: Eric MarcadeLancaster University: Robert Crouchley, Daniel GroseLeibniz Universität Hannover: KorneliusRohmeierLIAMA:Baogang Hue, Kang CaiLimagrain: ZivanKaramanMekentosj: Alexander Griekspoor, Matt Wood Microsoft: Eric Le Marois, Tony Hey Mubadala: Ghazi Ben Amor Nature Publishing Group: Ian Mulvany, Steve Scott NCeSS: Peter Halfpenny, Rob Procter, MarziehAsgari-Targhi, Alex Voss, YuWei Lin, Mercedes ArgüelloCasteleiro, Wei Jie, MeikPoschen, Katy Middlebrough, Pascal Ekin, June Finch, FarzanaLatif, Elisa Pieri, Frank O'Donnell New York Java User Group: Frank D Greco OeRC: Dimitrina Spencer, MatteoTurilli, David Wallom, Steven Young OMII-UK: Neil Chue Hong, Steve Brewer OpenAnalytics: Tobias VerbekeOracle: Dominique van Deth, Andrew Bond OSS Watch: Ross GardlerPlatform Computing: Christopher Smith Royal Society: James WilsdonSan Diego Supercomputer Center: Nancy R. Wilkins-DiehrSanger Institute: Lars Jorgensen, Phil Butcher Shell: Wayne.W.Jones, Nigel Smith SociétéGénérale: Anis MaktoufStanford University: John Chambers, BalasubramanianNarasimhan, Gunter Walther SYSTEM@TIC: KarimAzoumTechnischeUniversität Dortmund: UweLigges, Bernd BischlTechnoforge: Pierre-Antoine DurgeatTekiano: Samy Ben NaceurTélécom-ParisTech: Isabelle Demeure, Georges Hebrail, NesrineGabsiThe Generations Network: Jim PorzakTotal: YannickPerigoisTunisian Ministry of Communication Technologies: NaceurAmmar, LamiaChaffai-Sghaier, Mohamed SaïdOuerghi, SyrineTliliTunisian EcolePolytechnique: RiadhRobbanaUC Berkeley: Noureddine El Karoui, Terry Speed UC Davis: Rudy Beran, Debashis Paul, Duncan Temple Lang UCL: Daniel JeffaresUCLA: IvoDinov, JeroenOomsUC San Diego: Anthony GamstUCSF: Tena Sakai UniversitéCatholique de Louvain: Christian Ritter University of Cambridge: Ian Roberts, Robert MacInnis Peter Murray-Rust, Jim Downing, Michael Simmons, Mark Hayes University of Manchester: Carole Goble, Len Gill, Simon Peters, Richard D Pearson, Iain Buchan, John Ainsworth University of Plymouth: Paul HewsonUniversity of Split: IvicaPuljakUTK: Ajay OhriWorld Bank Group-IFC: OualidAmmarYahoo: Laurent Mirguet, Rob Weltman