1 / 17

Data analysis with Jupyter Notebook for Open Science

Data analysis with Jupyter Notebook for Open Science. Prof Hans Fangohr European X-ray Free Electron Laser Photon and Neutron Open Science Cloud Amsterdam, 7 May 2019 Hans.Fangohr@xfel.eu https://fangohr.github.io @ ProfCompMod. P hoton a nd N eutron O pen S cience C loud

jinelle
Download Presentation

Data analysis with Jupyter Notebook for Open Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data analysis with Jupyter Notebook for Open Science Prof Hans Fangohr European X-ray Free Electron Laser Photon and Neutron Open Science Cloud Amsterdam, 7 May 2019 Hans.Fangohr@xfel.eu https://fangohr.github.io @ProfCompMod

  2. Photon andNeutron Open Science Cloud – teameffort • ESRF: European Synchrotron Radiation Facility • ILL: Institut Laue-Langevin • EuXFEL: European X-ray Free Electron Laser Facility • ESS: The European Spallation Source • CERIC-ERIC: The Central European Research Infrastructure Consortium • ELI: Extreme Light Infrastructure • Sandor Brockhauser (EuXFEL) • Aidan Campbell (ESRF) • Hans Fangohr (EuXFEL) • Andy Götz (ESRF) • Jamie Hall (ILL) • Jerome Kieffer (ESRF) • Thomas Kluyver (EuXFEL) • Eric Pellegrini (ILL) • Jean-François Perrin (ILL) • Carlos Reis (CERIC-ERIC) • Thomas Rod (ESS) • Robert Rosca (EuXFEL) • Jesper Selknaes (ESS) • Krzysztof Wrona (EuXFEL)

  3. Outline • Open Science, Data Analysis, Jupyter • PaNOSCrequirementsandvision • Challenges • Summary

  4. Data Analysis for Open Science • (FAIR) dataiscentralfor Open Science • Dataprovides potential fornewunderstanding • Data analysisextractsthemeaningfromthedata • Publications based on data • Data sourcesshouldbeknown • Central findings (figures, tables, numbers) shouldbereproducible

  5. Jupyter Notebook • Cells contain • (formatted) text • Code • Output fromcodeexecution • (multimedia) • Can executecellsinteractively • Can save, close, loadandre-execute • Can exporttootherfileformats • Remote execution via https possible

  6. Jupyter Notebook for Open Science • Combinationofcode, outputandannotationin onedocument • Ifusedappropriately, makespublicationsreproducible • Forexample: onenotebook per figure in publication (examples: [1], [2]) • Notebooks fromreproduciblepublicationsmaketheworkre-usable • Currently, lots of time isusedbyresearcherstorepeattheworkofothers, beforetheycanadvancescience. [1] Appl. Phys. Lett. 109, 122401 (2016), https://doi.org/10.1063/1.4962726, https://github.com/fangohr/paper-supplement-2016-dmi-nanocylinder-hysteresis[2] Nanotechnology 27, 455502 (2016), https://doi.org/10.1088/0957-4484/27/45/455502https://github.com/maxalbert/paper-supplement-nanoparticle-sensing

  7. Photon and Neutron Open Science Cloud (PaNOSC) • Research facilitiesusing Photons and Neutrons forimaging • Underpins a widerangeof fundamental andappliedresearch, frombasicphysicstodrug design • European XFEL: useshort-wavelengthphotonstoimagesmallthings • Infrastructure • 3.4km tunnel • Accelerateelectrons, thencreatephotonsfromelectrons • Data volume • Typical detector: 1 million pixels, each using 2 bytes • up to 27,000 X-ray pulses per second •  2 byte * 1,000,000 * 27,000 / s = 54 GB/s •  194 TB/h (theoretical peak) https://www.esrf.eu/home/news/general/content-news/general/clear-view-of-robo-neuronal-receptor-opens-door-for-new-cancer-drugs.html

  8. Currentstateofdataanalysis: example European XFEL • Data source: • 2d imagedetectors (upto 27,000 images /s) • Scalarvalues (typically 10 values per second) • others • Data analysis • Calibration, datafiltering, reduction, azimuthalintegration, real spacereconstruction, … • Manystepsinvolving different programs • connectvia ssh –X to HPC cluster • someprogramsuse GUIs (X display at EuXFEL) • someprogramsusecommandline • IncreasinglyJupyter Notebooks

  9. https://karabo-data.readthedocs.io/en/latest/index.html

  10. https://in.xfel.eu/readthedocs/docs/data-analysis-user-documentation/en/latest/software.html#karabo-data-interactivehttps://in.xfel.eu/readthedocs/docs/data-analysis-user-documentation/en/latest/software.html#karabo-data-interactive http://ftp.esrf.fr/pub/scisoft/PyNX/example_notebooks/Wavefront-propagation-operators.html

  11. Solution architectureforPaNOSCvision: • Findingdata: • Web interfacewithdatabaseofexperimentmetadata • Exploringandanalysingdataremotely (=in thecloud): • JupyterHubserving relevant notebooks • Move dataanalysiscodeintothenotebook • remote desktop in browser, connectedto Desktop ofvirtualmachine

  12. PaNOSCUsecase 1: reproducibilityandre-usabilitypublishedresults • For a givenpublicationbased on facilitydata, userscan • Find thedata (throughthe EOSC web portalor URL/DOI in paper) • Access thedatathrough web portal • Inspectthedataanalysis (notebook) thatledtokeyfigures / statements in thepublication • Re-executethedataanalysisthrough ( reproducibility) • Modify andextendthenotebook ( reusability) • Users mayincludescientists, interestedpublic, journaleditorsandreviewers, representatitvesfromresearchcouncils, . . .

  13. PaNOSCUsecase 2: enablenewdataanalysis on existingdatasets • Users can • Search and find datasetsfromexperimentsthrough web portal • Access thedatathrough web portal • Choosefromappropriateselectionofdataanalysistools (=Jupyter Notebook templates) • Execute thenotebook • Modify andextendthenotebook

  14. Challenges 1 • Different facilities • Currently 6 facilitiesinvolved • Generally use different waystostoremetadatadata • Common wayofclassifyingdatasets (andexperimenttypes) ? • Data scale • Forsomedatasets, thedatacannotbemovedtothecomputeresource

  15. Challenges 2 • Analysis in Notebook • Computationalenvironment – software(containers?) • Need toprovidetherightcomputationalenvironmentforeachanalysis type • Howcanwemaintaincomputationalenvironments in thefuture(Binder-like?) • Extendinguptothelife-time ofpublicationsanddatasets • Whichanalysisisappropriatefordatasetclassificationofdatasets • Making analysiscapabilitiesavailable in theJupyter Notebook • Command linedrivedand Python basedcomputationstraightforward, • GUI-basedtoolmoredifficult / impossible • Whatto do withresultinganalysis? • Someanalysisnotebooksrequiresignificant HPC resources(executejobsfromnotebook?) • Computationalenvironment – hardware, GPUs?

  16. Challenges 3 • Concurrentdevelopmentwith EOSC hub • Complicatedaccessrightsfordatasets • Data policieswithembargoperiod • Publicationusecase: • Requirescollaborationwithscientists • Preparationofdataanalysisnotebooks • Social / culturalchallenge • Helpedbychangingmetricsandexpectationsfromfundingbodiesandjournals • Research facilitiescanleadbyexample

  17. Summary • IntroductionPaNOSCproject (Photon and Neutron Open Science Cloud), http://panosc-eu.github.io • Focus on dataanalysis • Usecases • Makepublicationsreproducible • EOSC MyBinderinstanceasfirststep? • Allowconvenientexplorationofexistingdatasets • Contributions / brainstorming / collaborationwelcome This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 823852. Contact presenter: Hans.Fangohr@xfel.eu, @ProfCompMod, Contact PaNOSC project: Andy Götz, Andy.Gotz@esrf.fr

More Related