250 likes | 367 Views
ALLPATHS-LG. a new standard for assembling a billion-piece genome puzzle. CS 681. presented by Ömer Köksal. High-quality draft assemblies of mammalian genomes from massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25 th , 2011. Agenda.
E N D
ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle
CS 681 presented by Ömer Köksal High-quality draft assemblies of mammalian genomesfrom massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25th, 2011
Agenda Introduction Results Model for Input Data Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplications Understanding Gaps Discussion
Introduction • High-quality assembly of a genome sequence is critical • Particularly challenging for large, repeat rich genomes such as those of mammals • Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each. • New massively parallel technologies are expected to lower cost dramatically but they could not, because of • short sequencing (~100 bases in length) • less accuracy • difficult to assemble
Introduction (cont’d) ALLPATHS-LG • de novo assembly of large (and small) genomes • it should be possible to generate high quality draft assemlies of Large Genomes • ~1000 fold lower cost than a decade ago • Previous versions: • ALLPATHS 1.0 (2008) • ALLPATHS 2.0 (2009)
Results RESULTS • Model for Input Data • Sequencing Data • ALLPATHS-LG Assembly Method • Uncertainty in Assemblies • Human and Mouse Assemblies • Human Genome • Mouse Genome • Segmental Duplication • Understanding Gaps
Results - Model for Input Data • De novo genome assembly depends on • computational methods • nature and quantity of sequence data used • Fairly standard model of Capillary-based sequence was modified • Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage • Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing • illumina sequencing was used (Table-1)
Results - Model for Input Data (cont’d) Table 1 – Provisional sequencing model for de novo assembly
Results – Sequencing Data • Using the model above generated sequences are: • Human Genome • Mouse Genome • Human Genome: • GM12878 (Coriell Institute) of 1000 Genomes Pilot Project • NCBI Short Read: Human_NA_12878_Genome_on_illumina • Mouse Genome: • C57BL/6J female DNA • NCBI Short Read: Mouse_B6_Genome_on_illumina
Results - ALLPATHS-LG Assembly Method • previous versions were improved extensively • can assembly small genomes • freely available at: http://www.broadinstitute.org/science/programs/genome-biology/crd
Results - ALLPATHS-LG Assembly Method (cont’d) • Some key innovations in ALLPATH-LG • Handling repetitive sequences • more resilient to repeats • Error Correction • for every 24-mer the algorithm examines the stack of all reads containing 24-mer • incidence of incorrect error correction was reduced • Use of jumping data • it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions • Efficient memory usage • can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week • 3 week for mouse & 3.5 weeks for human)
Results – Uncertainty in Assemblies • The goal of assembly is to reconstruct the genome as accurately as possible • However in some locations the data may be compatible with more than one solutions • Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives • ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices • ATC{A,T}GGTTTTTTT{T,TT}ACAC • Variant Call Format (.VCF file)
Results – Uncertainty in Assemblies(cont’d) NOTE: • Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties • Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors) • It would be desirable to assign probabilities to each alternative
Results – Human & Mouse Assemblies • Resulting genome assemblies provide good coverage of the human and mouse genomes • ALLPATHS-LG assemblies were compared with previously published assemblies • Capillary-based sequencing • SOAP (massively sequencing parallel sequencing)
Results – Human Genome • N50 contig length of 24 kb • Scaffold length of 11.5 Mb • Contiguity is > 4fold longer than SOAP algorithm • Connectivity is > 25 fold longer than SOAP algorithm • Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%) • Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%) • Results are similar to capillary based assemblies
Results – Human Genome (cont’d) • Local assembly error: 3.5 % • Capillary: 4.1 % • SOAP: 6.2 % • Long range accuracy: 99.1% • Capillary: 99.7 % • SOAP: 99.5 %
Results – Mouse Genome • Results are broadly similar for the mouse genome • N50 contig length of 16 kb • Scaffold length of 7.2 Mb • Connectivity is > 20 fold larger than SOAP algorithm • Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb) • Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%) • Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%) • Results are considerably better than SOAP
Results – Mouse Genome (cont’d) • Local assembly error: 3.0 % • Capillary: 2.7 % • SOAP: 14.2 % • Long range accuracy: 99.0 % • Capillary: 99.1 % • SOAP: 98.8 %
Results – Segmental Duplications • Segmental duplications shows a challange • ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications • Capillary: 60% • SOAP: 12% NOTE: Clearly additional work is needed here
Results – Understanding Gaps • Rougly three quarters of the gaps captured • Remaining gaps are not spanned • Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps: • For mouse: LINE elements are major contributors to GAPS • For human: LINE & SINE elements
Discussion • High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome • Costing tens of millions $ each to generate with capillary based sequencing • In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.
Discussion (cont’d) • ALLPATHS-LG • Good long range connectity, • Good accuracy, • Good coverage • wrt capillary based sequencing and • better than SOAP • ALLPATHS-LG • Quality of the assembliesis considerably better: • scaffolds are > 25 times longer
Discussion (cont’d) • ALLPATHS-LG is anticipated to yield even better results in the improved version. • ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT} • Computational hardware requirements: • SOAP is faster (takes 3 days) but accuracy is low • ALLPATHS-LG is slower but produces high quality assemblies • ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)
Thank you. Questions ?