230 likes | 443 Views
Microbial Genome Assembly. Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy. Outline-summary. 1 . QUICK INTRODUCTION. 2 . GENOME ASSEMBLY. 3 . ASSEMBLY STRATEGIES. 4 . CASE STUDY. DNA packaging. DNA packaging.
E N D
Microbial Genome Assembly Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy
Outline-summary 1. QUICK INTRODUCTION 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 4. CASE STUDY
Outline-summary 1. QUICK INTRODUCTION 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 4. CASE STUDY
Next Generation Sequencing ACGTAGGCTAGCGTTAGCGA ........ CTGCAT C TCTTATTGTGACC TAGGCTAGCTTAG GCAATGCAGTAAC TCCAGCTAGGTTC
Genome Assembly OVERLAPPING SEQUENCE ALIGMENT GENOME SEQUENCING PRELIMINARY ANALYSIS ASSEMBLY ADVANCED BIOINFORMATIC ANALYSIS
On the feasibility of sequence assembly Sequencing the human genome with shotgun sequencing + assembly is the only feasible strategy Weber, James L., and Eugene W. Myers. "Human whole-genome shotgun sequencing." Genome Research 7.5 (1997): 401-409. Computational assembly of shotgun sequencing data is simply unfeasible, and a bad idea anyway Green, Philip. "Against a whole-genome shotgun.“ Genome Research 7.5 (1997): 410-417. They were both right! (…well, Weber and Myers were a bit more right from the practical viewpoint…)
Outline-summary 1. QUICK INTRODUCTION 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 4. CASE STUDY
Genome assembly strategies • Greedyapproach → SSAKE • De Bruijngraph(DBG) → Velvet, SOAPdenovo • OverlapConsensus Layout (OLC) → MIRA • Mixed approaches → MaSuRCA
Genome assembly strategies • DE BRUIJN GRAPH APPROACH (DBG) • Nodes = overlapping sequences of reads of uniform length • Edges = kmer (unique subsequences within reads) • Velvet, SOAPdenovo2 EULERIAN PATH
Genome assembly strategies • OVERLAP CONSENSUS LAYOUT (OLC) • Nodes =reads • Edges = overlap between reads • MIRA • OVERLAP • LAYOUT • CONSENSUS HAMILTONIAN PATH
Genome assembly strategies • Greedyapproach → SSAKE • De Bruijngraph(DBG) → Velvet, SOAPdenovo • OverlapConsensus Layout (OLC) → MIRA • Mixed approaches → MaSuRCA
GenomeAssemblers Average Coverage Number of Contigs Number of Contigs > 1Kb N50 contig size Fraction of reads assembled Total consensus (in nt) Number of scaffolds N50 scaffolds size Ion Torrent PGM → MIRA 3.9 Illumina → MaSuRCA MIRA 3.9 too produced good quality results, but it has a longer execution time and it becomes unstable with large amount of small reads
Outline-summary 1. QUICK INTRODUCTION 2. GENOME ASSEMBLY 3. ASSEMBLY STRATEGIES 4. CASE STUDY
Mycobacteria Assembly: Case Study • Responsible for many animal and human diseases • M. tuberculosisand M. leprae (TM) • M. fortuitum (NTM) outbreak(nailsalon, 2002) • M. chelonae (NTM) outbreak(face lifts, 2004) • Illumina HiSeqsequencing (NGS Facility – CIBIO/UNITN) • Twentymycobacterialstrains • From 20 differentMycobacteriaspecies • → MaSuRCA Novelmycobacteriadetectionclinicaltests
Raw data qualityassessment and pre-processing • Fastq-mcftool • poor quality ends of reads • Ns, duplicates and sequencing adapters • reads that are too short • Reduction up to 73%
Assembly parameterssetting K-mers: strings of a particular length k, which are shorter than entire reads Best empirical k-mer length: 91 bases long High coverage
MaSuRCA results of Mycobacteria Genomesizetoo high Abnormal GC content
GC contentbasedqualityanalysis Examples of environmentalcontaminations Staphylococcus epidermidis
Thanks http://gcat.davidson.edu/phast/#methods Photo coming soon