180 likes | 716 Views
Data analysis with Jupyter Notebook for Open Science. Prof Hans Fangohr European X-ray Free Electron Laser Photon and Neutron Open Science Cloud Amsterdam, 7 May 2019 Hans.Fangohr@xfel.eu https://fangohr.github.io @ ProfCompMod. P hoton a nd N eutron O pen S cience C loud
E N D
Data analysis with Jupyter Notebook for Open Science Prof Hans Fangohr European X-ray Free Electron Laser Photon and Neutron Open Science Cloud Amsterdam, 7 May 2019 Hans.Fangohr@xfel.eu https://fangohr.github.io @ProfCompMod
Photon andNeutron Open Science Cloud – teameffort • ESRF: European Synchrotron Radiation Facility • ILL: Institut Laue-Langevin • EuXFEL: European X-ray Free Electron Laser Facility • ESS: The European Spallation Source • CERIC-ERIC: The Central European Research Infrastructure Consortium • ELI: Extreme Light Infrastructure • Sandor Brockhauser (EuXFEL) • Aidan Campbell (ESRF) • Hans Fangohr (EuXFEL) • Andy Götz (ESRF) • Jamie Hall (ILL) • Jerome Kieffer (ESRF) • Thomas Kluyver (EuXFEL) • Eric Pellegrini (ILL) • Jean-François Perrin (ILL) • Carlos Reis (CERIC-ERIC) • Thomas Rod (ESS) • Robert Rosca (EuXFEL) • Jesper Selknaes (ESS) • Krzysztof Wrona (EuXFEL)
Outline • Open Science, Data Analysis, Jupyter • PaNOSCrequirementsandvision • Challenges • Summary
Data Analysis for Open Science • (FAIR) dataiscentralfor Open Science • Dataprovides potential fornewunderstanding • Data analysisextractsthemeaningfromthedata • Publications based on data • Data sourcesshouldbeknown • Central findings (figures, tables, numbers) shouldbereproducible
Jupyter Notebook • Cells contain • (formatted) text • Code • Output fromcodeexecution • (multimedia) • Can executecellsinteractively • Can save, close, loadandre-execute • Can exporttootherfileformats • Remote execution via https possible
Jupyter Notebook for Open Science • Combinationofcode, outputandannotationin onedocument • Ifusedappropriately, makespublicationsreproducible • Forexample: onenotebook per figure in publication (examples: [1], [2]) • Notebooks fromreproduciblepublicationsmaketheworkre-usable • Currently, lots of time isusedbyresearcherstorepeattheworkofothers, beforetheycanadvancescience. [1] Appl. Phys. Lett. 109, 122401 (2016), https://doi.org/10.1063/1.4962726, https://github.com/fangohr/paper-supplement-2016-dmi-nanocylinder-hysteresis[2] Nanotechnology 27, 455502 (2016), https://doi.org/10.1088/0957-4484/27/45/455502https://github.com/maxalbert/paper-supplement-nanoparticle-sensing
Photon and Neutron Open Science Cloud (PaNOSC) • Research facilitiesusing Photons and Neutrons forimaging • Underpins a widerangeof fundamental andappliedresearch, frombasicphysicstodrug design • European XFEL: useshort-wavelengthphotonstoimagesmallthings • Infrastructure • 3.4km tunnel • Accelerateelectrons, thencreatephotonsfromelectrons • Data volume • Typical detector: 1 million pixels, each using 2 bytes • up to 27,000 X-ray pulses per second • 2 byte * 1,000,000 * 27,000 / s = 54 GB/s • 194 TB/h (theoretical peak) https://www.esrf.eu/home/news/general/content-news/general/clear-view-of-robo-neuronal-receptor-opens-door-for-new-cancer-drugs.html
Currentstateofdataanalysis: example European XFEL • Data source: • 2d imagedetectors (upto 27,000 images /s) • Scalarvalues (typically 10 values per second) • others • Data analysis • Calibration, datafiltering, reduction, azimuthalintegration, real spacereconstruction, … • Manystepsinvolving different programs • connectvia ssh –X to HPC cluster • someprogramsuse GUIs (X display at EuXFEL) • someprogramsusecommandline • IncreasinglyJupyter Notebooks
https://in.xfel.eu/readthedocs/docs/data-analysis-user-documentation/en/latest/software.html#karabo-data-interactivehttps://in.xfel.eu/readthedocs/docs/data-analysis-user-documentation/en/latest/software.html#karabo-data-interactive http://ftp.esrf.fr/pub/scisoft/PyNX/example_notebooks/Wavefront-propagation-operators.html
Solution architectureforPaNOSCvision: • Findingdata: • Web interfacewithdatabaseofexperimentmetadata • Exploringandanalysingdataremotely (=in thecloud): • JupyterHubserving relevant notebooks • Move dataanalysiscodeintothenotebook • remote desktop in browser, connectedto Desktop ofvirtualmachine
PaNOSCUsecase 1: reproducibilityandre-usabilitypublishedresults • For a givenpublicationbased on facilitydata, userscan • Find thedata (throughthe EOSC web portalor URL/DOI in paper) • Access thedatathrough web portal • Inspectthedataanalysis (notebook) thatledtokeyfigures / statements in thepublication • Re-executethedataanalysisthrough ( reproducibility) • Modify andextendthenotebook ( reusability) • Users mayincludescientists, interestedpublic, journaleditorsandreviewers, representatitvesfromresearchcouncils, . . .
PaNOSCUsecase 2: enablenewdataanalysis on existingdatasets • Users can • Search and find datasetsfromexperimentsthrough web portal • Access thedatathrough web portal • Choosefromappropriateselectionofdataanalysistools (=Jupyter Notebook templates) • Execute thenotebook • Modify andextendthenotebook
Challenges 1 • Different facilities • Currently 6 facilitiesinvolved • Generally use different waystostoremetadatadata • Common wayofclassifyingdatasets (andexperimenttypes) ? • Data scale • Forsomedatasets, thedatacannotbemovedtothecomputeresource
Challenges 2 • Analysis in Notebook • Computationalenvironment – software(containers?) • Need toprovidetherightcomputationalenvironmentforeachanalysis type • Howcanwemaintaincomputationalenvironments in thefuture(Binder-like?) • Extendinguptothelife-time ofpublicationsanddatasets • Whichanalysisisappropriatefordatasetclassificationofdatasets • Making analysiscapabilitiesavailable in theJupyter Notebook • Command linedrivedand Python basedcomputationstraightforward, • GUI-basedtoolmoredifficult / impossible • Whatto do withresultinganalysis? • Someanalysisnotebooksrequiresignificant HPC resources(executejobsfromnotebook?) • Computationalenvironment – hardware, GPUs?
Challenges 3 • Concurrentdevelopmentwith EOSC hub • Complicatedaccessrightsfordatasets • Data policieswithembargoperiod • Publicationusecase: • Requirescollaborationwithscientists • Preparationofdataanalysisnotebooks • Social / culturalchallenge • Helpedbychangingmetricsandexpectationsfromfundingbodiesandjournals • Research facilitiescanleadbyexample
Summary • IntroductionPaNOSCproject (Photon and Neutron Open Science Cloud), http://panosc-eu.github.io • Focus on dataanalysis • Usecases • Makepublicationsreproducible • EOSC MyBinderinstanceasfirststep? • Allowconvenientexplorationofexistingdatasets • Contributions / brainstorming / collaborationwelcome This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 823852. Contact presenter: Hans.Fangohr@xfel.eu, @ProfCompMod, Contact PaNOSC project: Andy Götz, Andy.Gotz@esrf.fr