290 likes | 461 Views
Robust Software Tools for Variant Identification and Functional Assessment (Boston College & University of Michigan). Gabor Marth, Goncalo Abecasis, PIs. Informatics challenges for genomic analysis. Tool building. Widening accessibility. Facilitating analysis. Intentions of the RFA.
E N D
Robust Software Tools for Variant Identification and Functional Assessment(Boston College & University of Michigan) Gabor Marth, Goncalo Abecasis, PIs
Informatics challenges for genomic analysis • Tool building • Widening accessibility • Facilitating analysis
Our approach • Complete toolbox including variant interpretation • Full pipelines for start-to-finish analysis • Easily accessible and well documented methods • Cloud deployment (in addition to single machine/local compute cluster) • Open development model
Progress in first 6 months • Starting with two sets of tools and pipelines, geared toward high quality local analysis, battle-tested in the 1000GP data and medical sequencing projects • The two groups follow a “divide and conquer” strategy to put critical pieces in place for making our algorithms available for the wider genomics community • Boston College • A universal tool/pipeline launcher application • Infrastructure for dissemination • Cloud access via Galaxy • University of Michigan • Integration of variant annotation/impact assessment • Pipeline/workflow control infrastructure • Adaptation for Amazon Cloud Services
Tools constantly evolving (as they must to remain relevant) Our community toolbox to be updated with new tools as they become available Include latest versions ref: TATAGAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT alt: TATAGAGAGAGAGAGAGCGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT New algorithms for complex variant detection (FreeBayes) ref: TATAGAGAGAGAGAGAGAGC--GAGAGAGAGAGAGAGAGGGAGAGACGGAGTT alt: TATAGAGAGAGAGAGAG--CGAGAGAGAGAGAGAGAGAGGGAGAGACGGAGTT
Include tools when ready for prime time The BC mobile element insertion caller performs best in its class
EPACTS variant interpretation tools (Efficient and Parallelizable Association Container Toolbox) • Genetic analysis tool based on VCF • Fast and parallelizable access to large VCF files • Built-in widely used single variant and burden tests • R/C++ interface for extending to newer tests • Binary & quantitative phenotypes with covariates • Useful visualization tools of association results • Automated visualization
The UM pipeline Genotype Likelihood samtools glfMultiples BAM Unfiltered VCF Genotype Likelihood BAM Genotype Likelihood BAM vcfCooker Hard-filtered VCF SVM Beagle/Thunder Filtered/Phased VCF Filtered VCF Optional LD-aware step EPACTS Filtered/Phased VCF
UMAKE workflow system • Makefile based approach • The Make utility is very good for representing dependencies • Pick up where left off on Failure • Flexible deployment • Local Machine • Local Cluster (Mosix) • Amazon Web Services Elastic Compute Cloud (EC2) • Default options • User configurable
Application of UMAKE to large-scale projects Computational cost is ~1 week / 1000 samples in a 5 node mini-cluster
The Boston College tool hub http://gkno.me (genome)
Simplified installation & use • Unified launcher application (gkno) • single tools (e.g. Mosaik) • tool “macros” (e.g. map) • pipelines (e.g. exome variant calling) • Download and installation • All tools pulled in a single step from github • All tools installed • All tools tested
Easily configurable pipeline system • Part of our new unified launcher system (gkno) • Pipeline types (e.g. mapping, variant calling) and instances (exome, whole-genome) • User-configurable: tools can be swapped in and out, parameters configured via config files
Support • Documentation • Tutorials / Blog • User forum • Bug reports
Software deployment • All software is ready for running locally on a single machine • UMAKE adds cluster support • Cloud deployment • Simple Michigan pipelines ported to Amazon • Portation of all project software on the way
Integration • Our workflows leverage 3rd party tools for specific functionality • All our tools are open-source, available on github (many clones, community contributed code) • Ensemble approach (multiple tools for critical tasks)
Ensemble approach • Multiple tools usually benefit analysis
Ensemble approach • Our pipelines will use multiple aligners (BWA, Mosaik) and variant callers (Freebayes, glfMultiples), developed by BC/UM
In progress • Expanding pipelines to integrate all tools • Michigan tools -> gkno • BC tools -> Michigan cloud ready pipelines • Large data set analysis on the cloud • Integrate variant interpretation tools • Integrate SV tools as they become more robust • Integrate consensus analysis (SVM and MLP approaches to callset aggregation) • Minimal, functional pipeline -> Galaxy
Team Boston College University of Michigan Mary-Kate Trost Tom Blackwell Hyun-Min Kang Youna Hu Adrian Tan XiaoweiZhan Dajiang Liu Goncalo Abecasis • Alistair Ward • Derek Barnett • Chase Miller • Wan-Ping Lee • Erik Garrison • Gabor Marth