1 / 52

Reproducible Data Analysis Workflows

Explore reproducibility in scientific computing with best practices, workflow management, and hands-on examples. Learn about tools like Snakemake, OpenBIS, containers, and Conda.

shephard
Download Presentation

Reproducible Data Analysis Workflows

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reproducible Data Analysis Workflows OPENBIS UGM 2019 Michal Okoniewski, Andrei Plamadă ETH Zürich – Scientific IT Services Michal Okoniewski & Andrei Plamadă

  2. Outline • Reproducibility and Scientific Computing • Best Practices • Workflow Management Systems • Introduction • Snakemake and Hands-On • Snakemake with Genomics Example • Reproducible Environment • Introduction • OpenBIS Integration • Containers and Conda Hands-On Michal Okoniewski & Andrei Plamadă

  3. Getting to know each other • Which OS do you use: Windows 7, Windows 10, Linux, macOS, other? • How often do you program: weekly, monthly, yearly? • Do you use Python / R? • What is your background: formal (Math+CS), physical, social, life sciences; engineering, medicine? • Did you have difficulties in reproducing your own work? • Did you hear about / use git? • Did you hear about / use workflow management systems? • Did you hear about / use containers? • Did you hear about / use conda? • Did you hear about / use MPI? Michal Okoniewski & Andrei Plamadă

  4. Outline • Reproducibility and Scientific Computing • Best Practices • Workflow Management Systems • Introduction • Snakemake and Hands-On • Snakemake with Genomics Example • Reproducible Environment • Introduction • OpenBIS Integration • Containers and Conda Hands-On Michal Okoniewski & Andrei Plamadă

  5. What is Reproducibility in Scientific Computing Michal Okoniewski & Andrei Plamadă

  6. What is Reproducibility in Scientific Computing Michal Okoniewski & Andrei Plamadă

  7. What is Reproducibility in Scientific Computing Docker Hub Michal Okoniewski & Andrei Plamadă

  8. Reproducibility PI Manifesto "Reproducibility PI Manifesto", L. A. Barba. (13 December 2012). 10.6084/m9.figshare.104539 • I will teach my graduate students about reproducibility: • lab notebook, • version control, • workflow, • publication-quality plots at group meeting. • All our research code (and writing) is under version control. • We will always carry out verification and validation (V&V reports are posted to figshare). • For main results in a paper, we will share data, plotting script & figure under CC-BY. • We will upload the preprint to arXiv at the time of submission of a paper. • We will release code at the time of submission of a paper. • We will add a "Reproducibility" declaration at the end of each paper. • I will keep an up-to-date web presence. Michal Okoniewski & Andrei Plamadă

  9. Best Practices for Reproducibility in Scientific Computing Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.Lessons Learned – Kathryn Huff https://www.practicereproducibleresearch.org/core-chapters/5-lessons.html • Very common: • Version control your code • Open your data • Automate everywhere possible • Document your process • Test everything • Use free and open tools • Less common: • Avoid excessive dependencies • When dependencies can’t be avoid, package their installation • Host code on collaborative platforms (e.g. GitHub) • Get a Digital Object Identifier for your data and code • Avoid spreadsheets, plain text data is preferred • Explicitly set pseudorandom number generator seeds • Workflow and provenance framework may be too clunky for most scientist Michal Okoniewski & Andrei Plamadă

  10. Best Practices for Scientific Computing Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, et al. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745. https://doi.org/10.1371/journal.pbio.1001745 • Write Programs for People, not Computers • Readability and Style • Let the Computer Do the Work • Scripts -> Automated workflows • Unique version for code, data, dependencies • Make Incremental Changes • Version control (git) • Don’t Repeat Yourself (or others) • Re-use the code • Plan for Mistakes • Testing and Continuous Integrations • Optimize Software Only after It Works Correctly • 5. + Profiling • Document Design and Purpose, Not Mechanism • Documentation • Collaborate • Issue tracking and Code Review (e.g. github, gitlab) Michal Okoniewski & Andrei Plamadă

  11. So many things to learn!Where to start? Michal Okoniewski & Andrei Plamadă

  12. Outline • Reproducibility and Scientific Computing • Best Practices • Workflow Management Systems • Introduction • Snakemake and Hands-On • Snakemake with Genomics Example • Reproducible Environment • Introduction • OpenBIS Integration • Containers and Conda Hands-On Michal Okoniewski & Andrei Plamadă

  13. A Zoo of Data Workflow Systems • An incomplete list of 254 Computational Data Analysis Workflow Systems • https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems • A curated list of 90 Awesome Pipeline frameworks & libraries + 27 Workflow platforms • https://github.com/pditommaso/awesome-pipeline Michal Okoniewski & Andrei Plamadă

  14. Orchestration strategies: workflow managers tool A tool B Snakemakea Python workflow manager result.txt result.txt raw.txt intermediate.txt Michal Okoniewski & Andrei Plamadă

  15. Snakemake • Workflow management system • Designed by Johannes Köster • Now PI at UniEssen • Python3 – based • cmake philosophy • conda installation • conda support • http://snakemake.readthedocs.io/ Michal Okoniewski & Andrei Plamadă

  16. Installation • Install miniconda • Download and run the installer (eg.Miniconda3-latest-Linux-x86_64.sh) • Install snakemake with conda • conda install -c bioconda -c conda-forge snakemake • conda install -c bioconda -c conda-forge snakemake-minimal • Test • snakemake --version Michal Okoniewski & Andrei Plamadă

  17. Parsing the workflow • rule_all defines the final product • Snakemake parses searches for files needed to do this final products • Then, recursively, searches for what needs to be done for the “substrates” • After successful parsing (in syntax and content): • Workflow is started from the “substrates” of lowest level • Proceeds as DAG (directed acyclic graph) towards the final product Michal Okoniewski & Andrei Plamadă

  18. Snakefile– rule all Michal Okoniewski & Andrei Plamadă

  19. Snakefile– wildcards: generating contents and use Michal Okoniewski & Andrei Plamadă

  20. Snakefile– rules Michal Okoniewski & Andrei Plamadă

  21. Snakefile– rules with python Michal Okoniewski & Andrei Plamadă

  22. Running snakemake on the cluster LSF snakemake -p -j 999 --cluster-config cluster.json --cluster "bsub -W {cluster.time} -n {cluster.n}” SLURM snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n} -t {cluster.time}" Kubernetes snakemake --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX Michal Okoniewski & Andrei Plamadă

  23. Cluster settings: cluster.json Michal Okoniewski & Andrei Plamadă

  24. Running snakemake on the cluster Michal Okoniewski & Andrei Plamadă

  25. Demo on the computing cluster • Genomic example • 6 BAM (genome alignment) files on the input • Operations: sorting, indexing, counting of read in genes, count table production • Cluster.json specific for LSF on Euler cluster Michal Okoniewski & Andrei Plamadă

  26. Visualizing of what we actually done by snakemake • Directed acycylic graph of jobs • Can be seen with snakemake --dag > graph.dag dot -Tpdfgraph.dag > aaa.pdf • Visualizes dependencies of rules Michal Okoniewski & Andrei Plamadă

  27. Examples of rules graph Michal Okoniewski & Andrei Plamadă

  28. Other examples of rules graph Michal Okoniewski & Andrei Plamadă

  29. Snakemake happily finished Michal Okoniewski & Andrei Plamadă

  30. Advantages and difficulties of snakemake • Reproducibility • Control over workflow • Re-running • Encapsulation of typical tasks • “One-click” starting of a large process • You need to “speak python” • Learning curve steep at the beginning Michal Okoniewski & Andrei Plamadă

  31. Other reproducibility mechanisms that can be used by snakemake • Common Workflow Language • Remote files • Integrated package management with Conda • Running jobs in containers • Wrappers Michal Okoniewski & Andrei Plamadă

  32. Combining openBIS and snakemake (under development) HPC Cluster remote function dropbox Michal Okoniewski & Andrei Plamadă

  33. Practical advice • Test your workflow with a “dry run”: snakemake –np • Real run test – with small number of input files, eg 3 • On the cluster • run snakemake in a screen session on a login node • run snakemake on personal scratch or other permanent storage • use local rules whenever possible • I/O rules – define as single core jobs in cluster.json • check time, memory, cores settings for each job • Consider deleting intermediate files after use • Sometimes deleting .snakemake may be needed for re-run Michal Okoniewski & Andrei Plamadă

  34. Hands-on exercise on a single machine • https://github.com/michalogit/snakemaketax Michal Okoniewski & Andrei Plamadă

  35. Outline • Reproducibility and Scientific Computing • Best Practices • Workflow Management Systems • Introduction • Snakemake and Hands-On • Snakemake with Genomics Example • Reproducible Environment • Introduction • OpenBIS Integration • Containers and Conda Hands-On Michal Okoniewski & Andrei Plamadă

  36. Reproducible Environment • Main idea: bundle your application and all dependencies • Virtual Machine (VM): VirtualBox, VMware • Container - lightweight VM: Docker, Singularity • Isolated environment: • Python: Virtual Environment, Conda • R: Conda • As a side effect: No more version conflicts (Dependency hell) Michal Okoniewski & Andrei Plamadă

  37. Environment Container Based Shared Host OS kernel Bare Metal VM Based Michal Okoniewski & Andrei Plamadă

  38. VMs vs Container Michal Okoniewski & Andrei Plamadă

  39. Data Data Data Container Registry Image Container Docker workflow Code push pull pull run Environment Michal Okoniewski & Andrei Plamadă

  40. Nice but Docker requires root access What about HPC systems? Michal Okoniewski & Andrei Plamadă

  41. Singularity as the container solution for HPC • Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) • Singularity: • Developed initially at LBL - Berkeley Lab - for HPC use case (multi-tenancy, single file) • Open source with standard BSD 3 clause license https://github.com/sylabs/singularity • Under active development with 12 contributors with more than 100 commits • Available also with commercial support: Singularity Pro • Used world wide and recommended by vendors, e.g. NVIDIA, Azure Batch • Big worldwide community (google groups, slack) • Swiss community - EnhanceR • 2 major versions: Singularity 2 and Singularity 3 Michal Okoniewski & Andrei Plamadă

  42. Singularity as the container solution for HPC • Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) • Main idea Michal Okoniewski & Andrei Plamadă

  43. User Experience for Containers – Docker + Singularity v2.6 • Multi-node: MPICH ABI Compatibility initiative Michal Okoniewski & Andrei Plamadă

  44. Why to bother with Containers? I use only Python / R Michal Okoniewski & Andrei Plamadă

  45. Isolated Environment for R and Python - Conda • Condahttps://docs.conda.io/en/latest/ • Open source • Runs on Windows, macOS, Linux • Package management system https://anaconda.org/search • Supported Programming languages: Python, R, … • Repository: https://anaconda.org/ • Environment management system Michal Okoniewski & Andrei Plamadă

  46. Data Package Repository Conda workflow export Environment Data Code Code Michal Okoniewski & Andrei Plamadă

  47. What can go wrong? • Containers: • The image is updated - same tag different content: e.g. centos:latest • The image is deleted by the owner • The old container does not work with the new Docker/Singularity (not very likely) • The new container does not work with old Docker/Singularity • Conda • The package metadata (dependency list) is updated (not very likely) • The package is deleted by the owner • Python: you mix pip and conda and do a conda update • Conda packages are not platform independent Michal Okoniewski & Andrei Plamadă

  48. Things to consider • Stay up to date: • you might need to update your code and dependencies (latest releases), • container technologies are rapidly developing, e.g. podman, Sarus • Floating point numbers - IEEE 754 : • approximation of real numbers (double 15 digits) - trade-off between range and precision • the arithmetic is different • transcendental functions (e.g. ) not standardized - recommended see Section 9.2 • Round the floating point numbers to the desired precision • HPC systems • MPI_REDUCE does not guarantee the order of operation - advised see page 175, line 9 • Randomized algorithms: • Pseudo random numbers • Explicitly set the seed (when you do statistics use for each sample a different seed) Michal Okoniewski & Andrei Plamadă

  49. OpenBIS Integration • OpenBIS can be your single source of truth for: • Data • Code releases • Containers squashed in a single file • OpenBIS – Snakemake Integration: • Download: natively via SFTP • Upload: python script using pyBIS Michal Okoniewski & Andrei Plamadă

  50. Hands-on exercise on a single machine • https://siscourses.ethz.ch/openbis_ugm_2019/Containers_and_Conda_Hands_On.html Michal Okoniewski & Andrei Plamadă

More Related