Reproducible Data Analysis Workflows

Reproducible Data Analysis Workflows OPENBIS UGM 2019 Michal Okoniewski, Andrei Plamadă ETH Zürich – Scientific IT Services Michal Okoniewski & Andrei Plamadă

Outline • Reproducibility and Scientific Computing • Best Practices • Workflow Management Systems • Introduction • Snakemake and Hands-On • Snakemake with Genomics Example • Reproducible Environment • Introduction • OpenBIS Integration • Containers and Conda Hands-On Michal Okoniewski & Andrei Plamadă

Getting to know each other • Which OS do you use: Windows 7, Windows 10, Linux, macOS, other? • How often do you program: weekly, monthly, yearly? • Do you use Python / R? • What is your background: formal (Math+CS), physical, social, life sciences; engineering, medicine? • Did you have difficulties in reproducing your own work? • Did you hear about / use git? • Did you hear about / use workflow management systems? • Did you hear about / use containers? • Did you hear about / use conda? • Did you hear about / use MPI? Michal Okoniewski & Andrei Plamadă

What is Reproducibility in Scientific Computing Michal Okoniewski & Andrei Plamadă

What is Reproducibility in Scientific Computing Docker Hub Michal Okoniewski & Andrei Plamadă

Reproducibility PI Manifesto "Reproducibility PI Manifesto", L. A. Barba. (13 December 2012). 10.6084/m9.figshare.104539 • I will teach my graduate students about reproducibility: • lab notebook, • version control, • workflow, • publication-quality plots at group meeting. • All our research code (and writing) is under version control. • We will always carry out verification and validation (V&V reports are posted to figshare). • For main results in a paper, we will share data, plotting script & figure under CC-BY. • We will upload the preprint to arXiv at the time of submission of a paper. • We will release code at the time of submission of a paper. • We will add a "Reproducibility" declaration at the end of each paper. • I will keep an up-to-date web presence. Michal Okoniewski & Andrei Plamadă

Best Practices for Reproducibility in Scientific Computing Kitzes, J., Turek, D., & Deniz, F. (Eds.). (2018). The Practice of Reproducible Research: Case Studies and Lessons from the Data-Intensive Sciences. Oakland, CA: University of California Press.Lessons Learned – Kathryn Huff https://www.practicereproducibleresearch.org/core-chapters/5-lessons.html • Very common: • Version control your code • Open your data • Automate everywhere possible • Document your process • Test everything • Use free and open tools • Less common: • Avoid excessive dependencies • When dependencies can’t be avoid, package their installation • Host code on collaborative platforms (e.g. GitHub) • Get a Digital Object Identifier for your data and code • Avoid spreadsheets, plain text data is preferred • Explicitly set pseudorandom number generator seeds • Workflow and provenance framework may be too clunky for most scientist Michal Okoniewski & Andrei Plamadă

Best Practices for Scientific Computing Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy RT, et al. (2014) Best Practices for Scientific Computing. PLoS Biol 12(1): e1001745. https://doi.org/10.1371/journal.pbio.1001745 • Write Programs for People, not Computers • Readability and Style • Let the Computer Do the Work • Scripts -> Automated workflows • Unique version for code, data, dependencies • Make Incremental Changes • Version control (git) • Don’t Repeat Yourself (or others) • Re-use the code • Plan for Mistakes • Testing and Continuous Integrations • Optimize Software Only after It Works Correctly • 5. + Profiling • Document Design and Purpose, Not Mechanism • Documentation • Collaborate • Issue tracking and Code Review (e.g. github, gitlab) Michal Okoniewski & Andrei Plamadă

So many things to learn!Where to start? Michal Okoniewski & Andrei Plamadă

A Zoo of Data Workflow Systems • An incomplete list of 254 Computational Data Analysis Workflow Systems • https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems • A curated list of 90 Awesome Pipeline frameworks & libraries + 27 Workflow platforms • https://github.com/pditommaso/awesome-pipeline Michal Okoniewski & Andrei Plamadă

Orchestration strategies: workflow managers tool A tool B Snakemakea Python workflow manager result.txt result.txt raw.txt intermediate.txt Michal Okoniewski & Andrei Plamadă

Snakemake • Workflow management system • Designed by Johannes Köster • Now PI at UniEssen • Python3 – based • cmake philosophy • conda installation • conda support • http://snakemake.readthedocs.io/ Michal Okoniewski & Andrei Plamadă

Installation • Install miniconda • Download and run the installer (eg.Miniconda3-latest-Linux-x86_64.sh) • Install snakemake with conda • conda install -c bioconda -c conda-forge snakemake • conda install -c bioconda -c conda-forge snakemake-minimal • Test • snakemake --version Michal Okoniewski & Andrei Plamadă

Parsing the workflow • rule_all defines the final product • Snakemake parses searches for files needed to do this final products • Then, recursively, searches for what needs to be done for the “substrates” • After successful parsing (in syntax and content): • Workflow is started from the “substrates” of lowest level • Proceeds as DAG (directed acyclic graph) towards the final product Michal Okoniewski & Andrei Plamadă

Snakefile– rule all Michal Okoniewski & Andrei Plamadă

Snakefile– wildcards: generating contents and use Michal Okoniewski & Andrei Plamadă

Snakefile– rules Michal Okoniewski & Andrei Plamadă

Snakefile– rules with python Michal Okoniewski & Andrei Plamadă

Running snakemake on the cluster LSF snakemake -p -j 999 --cluster-config cluster.json --cluster "bsub -W {cluster.time} -n {cluster.n}” SLURM snakemake -j 999 --cluster-config cluster.json --cluster "sbatch -A {cluster.account} -p {cluster.partition} -n {cluster.n} -t {cluster.time}" Kubernetes snakemake --kubernetes --use-conda --default-remote-provider $REMOTE --default-remote-prefix $PREFIX Michal Okoniewski & Andrei Plamadă

Cluster settings: cluster.json Michal Okoniewski & Andrei Plamadă

Running snakemake on the cluster Michal Okoniewski & Andrei Plamadă

Demo on the computing cluster • Genomic example • 6 BAM (genome alignment) files on the input • Operations: sorting, indexing, counting of read in genes, count table production • Cluster.json specific for LSF on Euler cluster Michal Okoniewski & Andrei Plamadă

Visualizing of what we actually done by snakemake • Directed acycylic graph of jobs • Can be seen with snakemake --dag > graph.dag dot -Tpdfgraph.dag > aaa.pdf • Visualizes dependencies of rules Michal Okoniewski & Andrei Plamadă

Examples of rules graph Michal Okoniewski & Andrei Plamadă

Other examples of rules graph Michal Okoniewski & Andrei Plamadă

Snakemake happily finished Michal Okoniewski & Andrei Plamadă

Advantages and difficulties of snakemake • Reproducibility • Control over workflow • Re-running • Encapsulation of typical tasks • “One-click” starting of a large process • You need to “speak python” • Learning curve steep at the beginning Michal Okoniewski & Andrei Plamadă

Other reproducibility mechanisms that can be used by snakemake • Common Workflow Language • Remote files • Integrated package management with Conda • Running jobs in containers • Wrappers Michal Okoniewski & Andrei Plamadă

Combining openBIS and snakemake (under development) HPC Cluster remote function dropbox Michal Okoniewski & Andrei Plamadă

Practical advice • Test your workflow with a “dry run”: snakemake –np • Real run test – with small number of input files, eg 3 • On the cluster • run snakemake in a screen session on a login node • run snakemake on personal scratch or other permanent storage • use local rules whenever possible • I/O rules – define as single core jobs in cluster.json • check time, memory, cores settings for each job • Consider deleting intermediate files after use • Sometimes deleting .snakemake may be needed for re-run Michal Okoniewski & Andrei Plamadă

Hands-on exercise on a single machine • https://github.com/michalogit/snakemaketax Michal Okoniewski & Andrei Plamadă

Reproducible Environment • Main idea: bundle your application and all dependencies • Virtual Machine (VM): VirtualBox, VMware • Container - lightweight VM: Docker, Singularity • Isolated environment: • Python: Virtual Environment, Conda • R: Conda • As a side effect: No more version conflicts (Dependency hell) Michal Okoniewski & Andrei Plamadă

Environment Container Based Shared Host OS kernel Bare Metal VM Based Michal Okoniewski & Andrei Plamadă

VMs vs Container Michal Okoniewski & Andrei Plamadă

Data Data Data Container Registry Image Container Docker workflow Code push pull pull run Environment Michal Okoniewski & Andrei Plamadă

Nice but Docker requires root access What about HPC systems? Michal Okoniewski & Andrei Plamadă

Singularity as the container solution for HPC • Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) • Singularity: • Developed initially at LBL - Berkeley Lab - for HPC use case (multi-tenancy, single file) • Open source with standard BSD 3 clause license https://github.com/sylabs/singularity • Under active development with 12 contributors with more than 100 commits • Available also with commercial support: Singularity Pro • Used world wide and recommended by vendors, e.g. NVIDIA, Azure Batch • Big worldwide community (google groups, slack) • Swiss community - EnhanceR • 2 major versions: Singularity 2 and Singularity 3 Michal Okoniewski & Andrei Plamadă

Singularity as the container solution for HPC • Containers improve portability and can address the reproducibility issue in research (EnhanceR Survey - Science IT Consultants) • Main idea Michal Okoniewski & Andrei Plamadă

User Experience for Containers – Docker + Singularity v2.6 • Multi-node: MPICH ABI Compatibility initiative Michal Okoniewski & Andrei Plamadă

Why to bother with Containers? I use only Python / R Michal Okoniewski & Andrei Plamadă

Isolated Environment for R and Python - Conda • Condahttps://docs.conda.io/en/latest/ • Open source • Runs on Windows, macOS, Linux • Package management system https://anaconda.org/search • Supported Programming languages: Python, R, … • Repository: https://anaconda.org/ • Environment management system Michal Okoniewski & Andrei Plamadă

Data Package Repository Conda workflow export Environment Data Code Code Michal Okoniewski & Andrei Plamadă

What can go wrong? • Containers: • The image is updated - same tag different content: e.g. centos:latest • The image is deleted by the owner • The old container does not work with the new Docker/Singularity (not very likely) • The new container does not work with old Docker/Singularity • Conda • The package metadata (dependency list) is updated (not very likely) • The package is deleted by the owner • Python: you mix pip and conda and do a conda update • Conda packages are not platform independent Michal Okoniewski & Andrei Plamadă

Things to consider • Stay up to date: • you might need to update your code and dependencies (latest releases), • container technologies are rapidly developing, e.g. podman, Sarus • Floating point numbers - IEEE 754 : • approximation of real numbers (double 15 digits) - trade-off between range and precision • the arithmetic is different • transcendental functions (e.g. ) not standardized - recommended see Section 9.2 • Round the floating point numbers to the desired precision • HPC systems • MPI_REDUCE does not guarantee the order of operation - advised see page 175, line 9 • Randomized algorithms: • Pseudo random numbers • Explicitly set the seed (when you do statistics use for each sample a different seed) Michal Okoniewski & Andrei Plamadă

OpenBIS Integration • OpenBIS can be your single source of truth for: • Data • Code releases • Containers squashed in a single file • OpenBIS – Snakemake Integration: • Download: natively via SFTP • Upload: python script using pyBIS Michal Okoniewski & Andrei Plamadă

Hands-on exercise on a single machine • https://siscourses.ethz.ch/openbis_ugm_2019/Containers_and_Conda_Hands_On.html Michal Okoniewski & Andrei Plamadă

Reproducible Data Analysis Workflows

Reproducible Data Analysis Workflows

Presentation Transcript

Reproducible Computational Experiments

ShelterPoint version 5.2.3 Data-Entry Workflows

Reproducible Research

Formal Analysis of Problem Domain Workflows

Galaxy: Integrative, Reproducible Analysis of Genomics Data

Reproducible Research

TRD 2 Update: An annotation scheme to foster reproducible NMR data analysis

SkanPoint™ Data-Entry Workflows

ShelterPoint ™ Data-Entry Workflows

Reproducible Experimentation

Data Management Challenges of Data-Intensive Scientific Workflows

Workflows

Workflows

Distributed Data for Science Workflows

Analysis of workflows : Verification, validation, and performance analysis .

Flow Cytometry and Reproducible Analysis