1 / 15

Data analysis infrastructure

Data analysis infrastructure. Miguel Vazquez , Victor de la Torre National Cancer Research Centre Spanish National Bioinformatics Institute. Use case examples. Data analysis

wyanet
Download Presentation

Data analysis infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data analysisinfrastructure Miguel Vazquez, Victor de la Torre National Cancer Research Centre Spanish National Bioinformatics Institute

  2. Use case examples • Data analysis • Determine the mutational impact on e.g. protein structure, catalytic activity (Basically, going beyond conservation or hydrophobicity scores) • Retrieve Networks & pathways of the identified genes and proteins • Intersect patients based on affected networks and pathways, not based on variants • Compare/analyse different types of ‘omicsdata on a single individual (e.g. exome data with proteomic and transcriptomic data) • Analyse aggregate/whole-cohort data on a set of patients with a particular diagnosis or particular phenotype

  3. Similar software

  4. Use cases are divided into 2 main areas: Answer research questions Data management

  5. Platform architecture principles • Structured filesystem as the centralized version of the truth • Automatic workflow management based on the RBBT framework • Pre-computed results are daily stored in a fast access database to enable real-time queries. • User friendly frontend for accessing the data and to explore the results

  6. Rbbt: Filesystem approach Rbbt brokers access to resources in the file-system Rbbt is design to produce completely reproducible and reusable code. This implies downloading resources and building databases transparently and on demand. For efficiency these resources placed on predictable locations in the system are shared across all functionalities. Lockingmechanisms and exception handling ensures stability and robustness Process Process Process Process RBBT: Finds files, produces them as needed, brokers access Intelligent updating system allows for efficiency FILE SYSTEM: Resources, Caches, Job results, etc.

  7. Rbbt: Workflows BED Our workflow are collections of tasks, which could have dependencies between them. Each task produces a result file, and may use the results from previous steps. The results depend only on the global set of parameters. The parameters are encoded in the filename: we can perform intelligent updates, and we have full provenance. In the example workflow if the quality threshold is changed, all tasks downstream produced their results files with alternative names, but reuse everything up-stream VCF Genomic Mutations Quality threshold (default 200) Protein Mutations Affected Prot. Features Prioritized variants Suggested therapies

  8. Rbbt: Writing workflows Workflows are defined in real code with a very simple syntax. The example at the right is all it takes To build a workflow that wraps the ANNOVAR package: It takes mutations in the Rbbt format (chr:position:mutant_allele) to the ANNOVAR format. Since they are real code they are more confortable to write for a programmer and, since code is more expressive than XML files, they provide more flexibility. Note that the software must be in-place. For software without access restrictions it would be downloaded and installed automatically by the workflow

  9. Rbbt: Using workflows All Rbbt workflows can be enacted in different ways: Command-line $ rbbt workflow task Annovar analysis -g all_COSMIC_mutations Programmatically Annovar.job(:analysis, “Jobname”, :genomic_mutations => all_COSMIC_mutations).run REST/HTML interface $ rbbt workflow server Annovar –p <port> Remotely $ rbbt workflow remote add http://<server>:<port>/Annovar If a workflow is configured as remote, any other method of enactment will transparently relay the task to the remote server. Workflow jobs can run executing the code directly, through the workflow interface that persists the results, or persisted and asynchronously, with job monitoring features.

  10. Rbbt: Available workflows (sample) Sequence: Mutation consequence prediction Translation: translate between gene and protein identifier formats Enrichment, MutationEnrichment: Hypergeometric, rank-based, and mutation density based enrichment. Structure: Uses PDBs to find neighboring residues with functional annotations, PPI interface contacts, etc GEO: Automatically download and analyze GEO datasets KinMut: Predicts protein mutation severity for protein kinases

More Related