1 / 23

Bioinformatics analysis pipeline for viral metagenomics

Bioinformatics analysis pipeline for viral metagenomics. Davit Bzhalava, PhD Dept. of Laboratory Medicine, Karolinska Institutet , Sweden. Human Microbiota. We are born 100% human and we die 90% microbial.

chutton
Download Presentation

Bioinformatics analysis pipeline for viral metagenomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics analysis pipeline forviral metagenomics Davit Bzhalava, PhD Dept. of Laboratory Medicine, KarolinskaInstitutet, Sweden Davit Bzhalava

  2. Human Microbiota Davit Bzhalava We are born 100% human and we die 90% microbial. The term human microbiome or microbiota, defines the collection of microorganisms that reside in the human body. The viral fraction of human microbiome is referred to as the human virome. Viruses constitute only a small part of human microbiota, but their proportion and composition seems to change in diseased individuals.

  3. Tumor Viruses Davit Bzhalava • 2 million (16%) of new cancer cases worldwide was estimated to be attributable to infections in 2008. • 1300000 (65%) of these cancers were attributable to viral infections • There is epidemiological indication that additional cancer-associated viruses may exist: • Increased incidence of some cancer types among immunosuppressed individuals; • Space and time clustering of childhood leukemias.

  4. Purpose of viral metagenomics Davit Bzhalava Who is there? What are they doing? How are they doing it?

  5. Needle in a haystack Davit Bzhalava Viruses usually constitute <0.1% of the whole metagenomic datasets Small changes in the data analysis pipeline can drastically alter results

  6. Library Preparation Sequencing Data Analysis Bioinformatics Pipeline Filter out Human, bacterial, phage and vector sequences Normalize k-mer frequencies Genome assembly Assembly validation & number of reads estimation Taxonomic classification Final characterization of virus related sequences Case-control comparison of virus related & “unknown” sequences / OR estimation Davit Bzhalava

  7. Library Preparation Sequencing Data Analysis Bioinformatics Pipeline Filter out Human, bacterial, phage and vector sequences Normalize k-mer frequencies Genome assembly Assembly validation & number of reads estimation Taxonomic classification Final characterization of virus related sequences Case-control comparison of virus related & “unknown” sequences / OR estimation Davit Bzhalava

  8. de novo assembly • NGS technologies produce billions of short reads from random locations in the genome by oversampling it. • Assembly algorithms, in the process called de novo assembly, reconstruct original genomes present in the sample by merging short genomic fragments into longer contiguous sequences (“contigs”). • There are two main types of de novo assembly programs: • Overlap/Layout/Consensus (OLC) assemblers • de Bruijn Graph Assemblers Davit Bzhalava

  9. OLC assembly Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Davit Bzhalava

  10. de Bruijngraph assembly • de Bruijngraph assemblers model the relationship between exact substrings of length k extracted from the input reads. • In de Bruijngraph the reads themselves are not directly modelled but they are implicitly represented as paths through the de Bruijn graph. • Most de Bruijn graph assemblers use the read information to refine the graph structure and to remove graph patterns that are not consistent with the reads. • de Bruijn graph approach is based on exact matches, thus error correction approaches (used both before and during assembly) are crucial for achieving high-quality assemblies. Davit Bzhalava

  11. Challenges in assembly Davit Bzhalava • If we have 2 sequences • the_quick_brown_fox_jumps • jumps_over_the_lazy_dog • Will be decomposed into k-mers • Kmer = 5 • put both sentences into the same graph and follow the links in the graph • the_q-> he_qu -> e_qui -> _quic -> quick -> uick_ -> ick_b -> ck_br • to spell out the 'assembled' sentence, • the_quick_brown_fox_jumps_over_the_lazy_dog • If kmer= 6: there's no 6-mer word that is in common between the sentence fragments. • If k-mer = 4, the graph becomes complicated: the word the_ appears twice ***Example taken from: http://ivory.idyll.org/blog/the-k-parameter.html

  12. Challenges in assembly Davit Bzhalava • Solution is to try as many assemblers and with as many parameters as possible. • Resources including time is limited • Assemblies are RAM thirsty • NextSeq, 300m reads ≈250GB RAM • kmer based assemblers scale poorly

  13. Library Preparation Sequencing Data Analysis Bioinformatics Pipeline Filter out Human, bacterial, phage and vector sequences Normalize k-mer frequencies Genome assembly Assembly validation & number of reads estimation Taxonomic classification Final characterization of virus related sequences Case-control comparison of virus related & “unknown” sequences / OR estimation Davit Bzhalava

  14. K-mer normalization Number of reads before normalization 1’642’160’122 paired reads Davit Bzhalava

  15. Number of reads after normalization 282’961’022 paired reads (17% of initial reads) Davit Bzhalava

  16. Human genome coverage before normalization Davit Bzhalava

  17. Human genome coverage after normalization Davit Bzhalava

  18. Number of reads after HG clean up 6’745’443 paired reads (0.02 % normalized data and 0.004% of initial reads) Davit Bzhalava

  19. Library Preparation Sequencing Data Analysis Bioinformatics Pipeline Filter out Human, bacterial, phage and vector sequences Normalize k-mer frequencies Genome assembly Assembly validation & number of reads estimation Taxonomic classification Final characterization of virus related sequences Case-control comparison of virus related & “unknown” sequences / OR estimation Davit Bzhalava

  20. Taxonomic classification Davit Bzhalava NCBI BLAST - One of the most famous similarity-based taxonomic classification NCBI BLAST compares sequences to known genomes

  21. Challenges in taxonomic classification http://www.ncbi.nlm.nih.gov/genbank/statistics AccessedonNov 08, 2015 Davit Bzhalava Genome sequencing has led to massive data generation requiring a significant increase in the speed of execution of these algorithms. Necessity to search new and ever expanding databases

  22. Challenges in taxonomic classification Davit Bzhalava • NCBI BLAST-based search tools • are extremely time consuming • may take days or even weeks to complete when large metagenomic datasets need to be compared against nucleotide or protein databases • Paracel Blast a commercial software • Achieved the same results, on same file, on same machine 10 times faster • Scalable open source NCBI BLAST solutions are needed

  23. Thank you! Davit Bzhalava

More Related