ALLPATHS-LG

ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle

CS 681 presented by Ömer Köksal High-quality draft assemblies of mammalian genomesfrom massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25th, 2011

Agenda Introduction Results Model for Input Data Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplications Understanding Gaps Discussion

Introduction • High-quality assembly of a genome sequence is critical • Particularly challenging for large, repeat rich genomes such as those of mammals • Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each. • New massively parallel technologies are expected to lower cost dramatically but they could not, because of • short sequencing (~100 bases in length) • less accuracy • difficult to assemble

Introduction (cont’d) ALLPATHS-LG • de novo assembly of large (and small) genomes • it should be possible to generate high quality draft assemlies of Large Genomes • ~1000 fold lower cost than a decade ago • Previous versions: • ALLPATHS 1.0 (2008) • ALLPATHS 2.0 (2009)

Results RESULTS • Model for Input Data • Sequencing Data • ALLPATHS-LG Assembly Method • Uncertainty in Assemblies • Human and Mouse Assemblies • Human Genome • Mouse Genome • Segmental Duplication • Understanding Gaps

Results - Model for Input Data • De novo genome assembly depends on • computational methods • nature and quantity of sequence data used • Fairly standard model of Capillary-based sequence was modified • Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage • Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing • illumina sequencing was used (Table-1)

Results - Model for Input Data (cont’d) Table 1 – Provisional sequencing model for de novo assembly

Results – Sequencing Data • Using the model above generated sequences are: • Human Genome • Mouse Genome • Human Genome: • GM12878 (Coriell Institute) of 1000 Genomes Pilot Project • NCBI Short Read: Human_NA_12878_Genome_on_illumina • Mouse Genome: • C57BL/6J female DNA • NCBI Short Read: Mouse_B6_Genome_on_illumina

Results - ALLPATHS-LG Assembly Method • previous versions were improved extensively • can assembly small genomes • freely available at: http://www.broadinstitute.org/science/programs/genome-biology/crd

Results - ALLPATHS-LG Assembly Method (cont’d) • Some key innovations in ALLPATH-LG • Handling repetitive sequences • more resilient to repeats • Error Correction • for every 24-mer the algorithm examines the stack of all reads containing 24-mer • incidence of incorrect error correction was reduced • Use of jumping data • it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions • Efficient memory usage • can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week • 3 week for mouse & 3.5 weeks for human)

Results – Uncertainty in Assemblies • The goal of assembly is to reconstruct the genome as accurately as possible • However in some locations the data may be compatible with more than one solutions • Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives • ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices • ATC{A,T}GGTTTTTTT{T,TT}ACAC • Variant Call Format (.VCF file)

Results – Uncertainty in Assemblies(cont’d) NOTE: • Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties • Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors) • It would be desirable to assign probabilities to each alternative

Results – Human & Mouse Assemblies • Resulting genome assemblies provide good coverage of the human and mouse genomes • ALLPATHS-LG assemblies were compared with previously published assemblies • Capillary-based sequencing • SOAP (massively sequencing parallel sequencing)

Results – Human & Mouse Assemblies (cont’d)

Results – Human Genome • N50 contig length of 24 kb • Scaffold length of 11.5 Mb • Contiguity is > 4fold longer than SOAP algorithm • Connectivity is > 25 fold longer than SOAP algorithm • Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%) • Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%) • Results are similar to capillary based assemblies

Results – Human Genome (cont’d) • Local assembly error: 3.5 % • Capillary: 4.1 % • SOAP: 6.2 % • Long range accuracy: 99.1% • Capillary: 99.7 % • SOAP: 99.5 %

Results – Mouse Genome • Results are broadly similar for the mouse genome • N50 contig length of 16 kb • Scaffold length of 7.2 Mb • Connectivity is > 20 fold larger than SOAP algorithm • Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb) • Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%) • Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%) • Results are considerably better than SOAP

Results – Mouse Genome (cont’d) • Local assembly error: 3.0 % • Capillary: 2.7 % • SOAP: 14.2 % • Long range accuracy: 99.0 % • Capillary: 99.1 % • SOAP: 98.8 %

Results – Segmental Duplications • Segmental duplications shows a challange • ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications • Capillary: 60% • SOAP: 12% NOTE: Clearly additional work is needed here

Results – Understanding Gaps • Rougly three quarters of the gaps captured • Remaining gaps are not spanned • Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps: • For mouse: LINE elements are major contributors to GAPS • For human: LINE & SINE elements

Discussion • High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome • Costing tens of millions $ each to generate with capillary based sequencing • In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

Discussion (cont’d) • ALLPATHS-LG • Good long range connectity, • Good accuracy, • Good coverage • wrt capillary based sequencing and • better than SOAP • ALLPATHS-LG • Quality of the assembliesis considerably better: • scaffolds are > 25 times longer

Discussion (cont’d) • ALLPATHS-LG is anticipated to yield even better results in the improved version. • ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT} • Computational hardware requirements: • SOAP is faster (takes 3 days) but accuracy is low • ALLPATHS-LG is slower but produces high quality assemblies • ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

Thank you. Questions ?

ALLPATHS-LG

ALLPATHS-LG

Presentation Transcript

LG S365

LG 629

LG 629

LG 228

DENVER LG

LG 637

LG 637

LG 629

LG 228

LG 228

LG 637

LG 629

LG Inverter

LG CHEM.

LG 629

LG it500P

LG 60UJ7700

LG Canada Parts | LG Parts Mississauga

LG 5231JA2006A

LG 5231JA2006A

LG Assist