Data analysis infrastructure

Data analysisinfrastructure Miguel Vazquez, Victor de la Torre National Cancer Research Centre Spanish National Bioinformatics Institute

Use case examples • Data analysis • Determine the mutational impact on e.g. protein structure, catalytic activity (Basically, going beyond conservation or hydrophobicity scores) • Retrieve Networks & pathways of the identified genes and proteins • Intersect patients based on affected networks and pathways, not based on variants • Compare/analyse different types of ‘omicsdata on a single individual (e.g. exome data with proteomic and transcriptomic data) • Analyse aggregate/whole-cohort data on a set of patients with a particular diagnosis or particular phenotype

Similar software

Use cases are divided into 2 main areas: Answer research questions Data management

Platform architecture principles • Structured filesystem as the centralized version of the truth • Automatic workflow management based on the RBBT framework • Pre-computed results are daily stored in a fast access database to enable real-time queries. • User friendly frontend for accessing the data and to explore the results

Rbbt: Filesystem approach Rbbt brokers access to resources in the file-system Rbbt is design to produce completely reproducible and reusable code. This implies downloading resources and building databases transparently and on demand. For efficiency these resources placed on predictable locations in the system are shared across all functionalities. Lockingmechanisms and exception handling ensures stability and robustness Process Process Process Process RBBT: Finds files, produces them as needed, brokers access Intelligent updating system allows for efficiency FILE SYSTEM: Resources, Caches, Job results, etc.

Rbbt: Workflows BED Our workflow are collections of tasks, which could have dependencies between them. Each task produces a result file, and may use the results from previous steps. The results depend only on the global set of parameters. The parameters are encoded in the filename: we can perform intelligent updates, and we have full provenance. In the example workflow if the quality threshold is changed, all tasks downstream produced their results files with alternative names, but reuse everything up-stream VCF Genomic Mutations Quality threshold (default 200) Protein Mutations Affected Prot. Features Prioritized variants Suggested therapies

Rbbt: Writing workflows Workflows are defined in real code with a very simple syntax. The example at the right is all it takes To build a workflow that wraps the ANNOVAR package: It takes mutations in the Rbbt format (chr:position:mutant_allele) to the ANNOVAR format. Since they are real code they are more confortable to write for a programmer and, since code is more expressive than XML files, they provide more flexibility. Note that the software must be in-place. For software without access restrictions it would be downloaded and installed automatically by the workflow

Rbbt: Using workflows All Rbbt workflows can be enacted in different ways: Command-line $ rbbt workflow task Annovar analysis -g all_COSMIC_mutations Programmatically Annovar.job(:analysis, “Jobname”, :genomic_mutations => all_COSMIC_mutations).run REST/HTML interface $ rbbt workflow server Annovar –p <port> Remotely $ rbbt workflow remote add http://<server>:<port>/Annovar If a workflow is configured as remote, any other method of enactment will transparently relay the task to the remote server. Workflow jobs can run executing the code directly, through the workflow interface that persists the results, or persisted and asynchronously, with job monitoring features.

Rbbt: Available workflows (sample) Sequence: Mutation consequence prediction Translation: translate between gene and protein identifier formats Enrichment, MutationEnrichment: Hypergeometric, rank-based, and mutation density based enrichment. Structure: Uses PDBs to find neighboring residues with functional annotations, PPI interface contacts, etc GEO: Automatically download and analyze GEO datasets KinMut: Predicts protein mutation severity for protein kinases

Data analysis infrastructure

Data analysis infrastructure

Presentation Transcript

Spatial Data Infrastructure

Data Infrastructure Subcommittee

SIM- Data Infrastructure Subcommittee

What is the problem? Broad Data and Infrastructure Analysis

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …)

SIM- Data Infrastructure Subcommittee

WDCC Data Infrastructure

Data Management Infrastructure

SIM- Data Infrastructure Subcommittee

Data Centre Infrastructure

Computing Infrastructure for Gravitational Wave Data Analysis

What is the problem? Broad Data and Infrastructure Analysis

Data Center Infrastructure

SIM- Data Infrastructure Subcommittee

Global Spatial Data Infrastructure

A European High-Throughput Data Analysis infrastructure

Data Center Infrastructure