Viral Metagenomics Analysis Pipeline: Investigating Human Virome Variations

Bioinformatics analysis pipeline forviral metagenomics Davit Bzhalava, PhD Dept. of Laboratory Medicine, KarolinskaInstitutet, Sweden Davit Bzhalava

Human Microbiota Davit Bzhalava We are born 100% human and we die 90% microbial. The term human microbiome or microbiota, defines the collection of microorganisms that reside in the human body. The viral fraction of human microbiome is referred to as the human virome. Viruses constitute only a small part of human microbiota, but their proportion and composition seems to change in diseased individuals.

Tumor Viruses Davit Bzhalava • 2 million (16%) of new cancer cases worldwide was estimated to be attributable to infections in 2008. • 1300000 (65%) of these cancers were attributable to viral infections • There is epidemiological indication that additional cancer-associated viruses may exist: • Increased incidence of some cancer types among immunosuppressed individuals; • Space and time clustering of childhood leukemias.

Purpose of viral metagenomics Davit Bzhalava Who is there? What are they doing? How are they doing it?

Needle in a haystack Davit Bzhalava Viruses usually constitute <0.1% of the whole metagenomic datasets Small changes in the data analysis pipeline can drastically alter results

Library Preparation Sequencing Data Analysis Bioinformatics Pipeline Filter out Human, bacterial, phage and vector sequences Normalize k-mer frequencies Genome assembly Assembly validation & number of reads estimation Taxonomic classification Final characterization of virus related sequences Case-control comparison of virus related & “unknown” sequences / OR estimation Davit Bzhalava

de novo assembly • NGS technologies produce billions of short reads from random locations in the genome by oversampling it. • Assembly algorithms, in the process called de novo assembly, reconstruct original genomes present in the sample by merging short genomic fragments into longer contiguous sequences (“contigs”). • There are two main types of de novo assembly programs: • Overlap/Layout/Consensus (OLC) assemblers • de Bruijn Graph Assemblers Davit Bzhalava

OLC assembly Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Davit Bzhalava

de Bruijngraph assembly • de Bruijngraph assemblers model the relationship between exact substrings of length k extracted from the input reads. • In de Bruijngraph the reads themselves are not directly modelled but they are implicitly represented as paths through the de Bruijn graph. • Most de Bruijn graph assemblers use the read information to refine the graph structure and to remove graph patterns that are not consistent with the reads. • de Bruijn graph approach is based on exact matches, thus error correction approaches (used both before and during assembly) are crucial for achieving high-quality assemblies. Davit Bzhalava

Challenges in assembly Davit Bzhalava • If we have 2 sequences • the_quick_brown_fox_jumps • jumps_over_the_lazy_dog • Will be decomposed into k-mers • Kmer = 5 • put both sentences into the same graph and follow the links in the graph • the_q-> he_qu -> e_qui -> _quic -> quick -> uick_ -> ick_b -> ck_br • to spell out the 'assembled' sentence, • the_quick_brown_fox_jumps_over_the_lazy_dog • If kmer= 6: there's no 6-mer word that is in common between the sentence fragments. • If k-mer = 4, the graph becomes complicated: the word the_ appears twice ***Example taken from: http://ivory.idyll.org/blog/the-k-parameter.html

Challenges in assembly Davit Bzhalava • Solution is to try as many assemblers and with as many parameters as possible. • Resources including time is limited • Assemblies are RAM thirsty • NextSeq, 300m reads ≈250GB RAM • kmer based assemblers scale poorly

K-mer normalization Number of reads before normalization 1’642’160’122 paired reads Davit Bzhalava

Number of reads after normalization 282’961’022 paired reads (17% of initial reads) Davit Bzhalava

Human genome coverage before normalization Davit Bzhalava

Human genome coverage after normalization Davit Bzhalava

Number of reads after HG clean up 6’745’443 paired reads (0.02 % normalized data and 0.004% of initial reads) Davit Bzhalava

Taxonomic classification Davit Bzhalava NCBI BLAST - One of the most famous similarity-based taxonomic classification NCBI BLAST compares sequences to known genomes

Challenges in taxonomic classification http://www.ncbi.nlm.nih.gov/genbank/statistics AccessedonNov 08, 2015 Davit Bzhalava Genome sequencing has led to massive data generation requiring a significant increase in the speed of execution of these algorithms. Necessity to search new and ever expanding databases

Challenges in taxonomic classification Davit Bzhalava • NCBI BLAST-based search tools • are extremely time consuming • may take days or even weeks to complete when large metagenomic datasets need to be compared against nucleotide or protein databases • Paracel Blast a commercial software • Achieved the same results, on same file, on same machine 10 times faster • Scalable open source NCBI BLAST solutions are needed

Thank you! Davit Bzhalava

Viral Metagenomics Analysis Pipeline: Investigating Human Virome Variations

Viral Metagenomics Analysis Pipeline: Investigating Human Virome Variations

Presentation Transcript

Bioinformatics for Genome data analysis

http://www.ebi.ac.uk/metagenomics

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Bioinformatics Cluster Analysis

Bioinformatics Sequence Analysis I

Tools Needed for Data Analysis Pipeline :

Freshwater metagenomics

GWAS Analysis Pipeline

Spectral Analysis Pipeline for LAMOST Project

Bioinformatics for Genomic and Proteomic data analysis

Bioinformatics for Genome data analysis

Course Sequence Analysis for Bioinformatics Master’s

A pipeline for fingerprinting data analysis

Viral Conjunctivitis - Pipeline Review, H1 2015

Aarkstore - Epstein-Barr Viral Infections - Pipeline Review,

Ebola Viral Infections Therapeutic Pipeline Review, H1 2015

Viral Conjunctivitis Pipeline Review: JSBMarketResearch

Metagenomics sequencing

Viral Conjunctivitis Pipeline Review and Market Scope 2017

Microbiome: Metagenomics

Metagenomics and biogeochemistry

Association Analysis Techniques for Bioinformatics Problems