1 / 1

Examples of use cases Mining large-scale data: dealing with the data deluge

Biocep , towards a federative collaborative user-centric and grid-enabled computational open platform Karim Chine.

lukas
Download Presentation

Examples of use cases Mining large-scale data: dealing with the data deluge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biocep, towards a federative collaborative user-centric and grid-enabled computational open platform Karim Chine R is becoming the lingua franca of data analysis and statistical computing. It has a very powerful graphics system as well as cross-platform capabilities for packaging any computational code. Hundreds of available R packages implement the most up-to-date computational methods and reflect the state-of-the-art of research in various fields. R packages are foreseen as a reproducible research enabler. There is no obstacle to a large-scale deployment of R on public grids since it is a GPL software. However R is not multithreaded, doesn't operate as a server and has only a low-level non-object-oriented API. GUIs development for R remains non-standardized. R's potential as a computational back end engine for applications has yet to be fully exploited. While its user base is growing at a high rate, this growth rate would be significantly higher in the presence of a user-friendly and rich workbench Biocep is a general unified open source solution for integrating and virtualizing the access to R engines/servers and aims to become a federative user-friendly computational e-platform for research, finance and education. The Biocep virtual workbench enables the plugability for all the elements of a computational environment: the computational resource whether it is a local machine, a cluster, a grid or a cloud server via a simple URL, the computational components via the import of R packages and the computational GUIs via the import of plugins from repositories or the design of new views with a drag&drop GUI editor. Several dockable built-in views allow users to work interactively and collaboratively with R engines running anywhere. The views include a console, highly interactive remote graphic devices, a workspace explorer, PDF and SVG viewers, R data inspectors, linked plots and collaborative spreadsheets fully integrated with R functions and data. Biocep is also a toolbox that can be used to generate stateless and stateful web services automatically mapping R functions. It enables Python/Groovy scripting with remote R engines on client and on server sides. Using the Biocep frameworks, pools of R engines can be deployed on heterogeneous nodes, managed and used for parallel and distributed computing and for generating dynamic content on-the-fly for highly scalable web applications. A Biocep based R virtualization infrastructure has been successfully deployed on the National Grid Service. The result leaves no doubt about how useful this service would become for researchers. If the new platform was widely adopted, it would greatly enhance the usability of existing HPC infrastructures and would increase their usage. It may also work as an enabler of a new computing business model that would synergize the utility computing model (resources) and the pay-per-use software model (components/GUIs). Project Home : www.biocep.net R Virtualization Collaborative R Examples of use cases Mining large-scale data: dealing with the data deluge The volume of data generated in research is growing and the data can’t be moved any more to be analyzed. Biocep takes the computation to the data. An R engine is a « robotic hand » that can operate anywhere: “near” the big files or within a database. It is extensible via components (R packages and dynamically-loaded Java code) and remote-controllable from anywhere via the Biocep interfaces or the Virtual R Workbench (extensible with plugins, collaborative). R session alive for ever With Biocep, an R user can create an R engine anywhere (clusters, EC2 server..) and connect to it, use it and disconnect from it. Once reconnected again (from anywhere), the R user retrieves his full environment including his graphic devices. Workflows with computational Web Services The statefulness of the Web Services generated by Biocep solves the overhead problem caused by the transfer of intermediate results. Between Workflow nodes, only proxies referencing the data (kept in the memory of the R engine allocated for the computing session) are propagated. Automatic Web Sevices generation for R functions Scripting with one or many remote R engines Workflows with generated stateful Web Services R engines pools deployment ForthcomingRoadmap Documentation: User manuals, deployment documents, Javadoc.. Workbench: Plugins architecture finalization. Parallel computing: Finish implementing ‘Snow ‘like APIs with Biocep. Biocep based MapReduce. Remote engines fine-grain control from the R console. Security: Finish implementing the security architecture. Cloud Computing: Provide VMWare virtual machines and public AMIs (EC2) with R & Biocep pre-installed. Integrate basic AMIs management to Biocep. Workflows: Implement the automatic generation of Knime nodes for R functions. Evangelization: Seminars and tutorials planned for several institutions and companies. Tutorial at the 4th IEEE International Conference on e-Science (Indianapolis). Search for funding: Please get in touch with the author if you are Interested in being a sponsor or a partner for Biocep. Author’s contact details: karim.chine@m4x.org - +44(0)7769961344 Acknowledgements AT&T ResearchLabs: Simon UrbanekBAH: Ghazi Ben AmorEBI: AlvisBrazma, Wolfgang Huber, KimmoKallio, MishaKapushesky, Michael Kleen, Alberto Labarga, Philippe Rocca-Serra,UgisSarkans,Kirsten Williams EPFL: Darlene Goldstein ETH Zürich: YohanChalabi,DiethelmWürtz, MartinMächlerFHCRC: Seth Falcon, Martin Morgan Imperial College London: AsifAkram, Jeremy Cohen, Vasa Curcin, John Darlington, Brian Fuchs Lancaster University: Robert CrouchleyOeRC: Dimitrina Spencer, Matteo Turilli, David Wallom, Steven Young Oracle: Andrew Bond Platform Computing: Christopher Smith TechnischeUniversität Dortmund: UweLiggesUniversity of Manchester: Richard D Pearson University of Split:Ivica Puljak NGS

More Related