1 / 25

ALLPATHS-LG

ALLPATHS-LG. a new standard for assembling a billion-piece genome puzzle. CS 681. presented by Ömer Köksal. High-quality draft assemblies of mammalian genomes from massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25 th , 2011. Agenda.

Download Presentation

ALLPATHS-LG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ALLPATHS-LG a new standard for assembling a billion-piece genome puzzle

  2. CS 681 presented by Ömer Köksal High-quality draft assemblies of mammalian genomesfrom massively parallel sequence data ALLPATHS-LG by Sante Gnerre et al. (20 Authors) Jan 25th, 2011

  3. Agenda Introduction Results Model for Input Data Sequencing Data ALLPATHS-LG Assembly Method Uncertainty in Assemblies Human and Mouse Assemblies Human Genome Mouse Genome Segmental Duplications Understanding Gaps Discussion

  4. Introduction • High-quality assembly of a genome sequence is critical • Particularly challenging for large, repeat rich genomes such as those of mammals • Using traditional capillary-based sequencing (>700 bases) such assemblies produced for multiple mammals at a cost of tens of millions dollars each. • New massively parallel technologies are expected to lower cost dramatically but they could not, because of • short sequencing (~100 bases in length) • less accuracy • difficult to assemble

  5. Introduction (cont’d) ALLPATHS-LG • de novo assembly of large (and small) genomes • it should be possible to generate high quality draft assemlies of Large Genomes • ~1000 fold lower cost than a decade ago • Previous versions: • ALLPATHS 1.0 (2008) • ALLPATHS 2.0 (2009)

  6. Results RESULTS • Model for Input Data • Sequencing Data • ALLPATHS-LG Assembly Method • Uncertainty in Assemblies • Human and Mouse Assemblies • Human Genome • Mouse Genome • Segmental Duplication • Understanding Gaps

  7. Results - Model for Input Data • De novo genome assembly depends on • computational methods • nature and quantity of sequence data used • Fairly standard model of Capillary-based sequence was modified • Sets a target of 100 fold sequence coverage to compensate shorter reads & nonuniform coverage • Despite using higher coverage, proposed model is dramatically cheaper since the per-base cost ~10000 fold lower than capillary sequencing • illumina sequencing was used (Table-1)

  8. Results - Model for Input Data (cont’d) Table 1 – Provisional sequencing model for de novo assembly

  9. Results – Sequencing Data • Using the model above generated sequences are: • Human Genome • Mouse Genome • Human Genome: • GM12878 (Coriell Institute) of 1000 Genomes Pilot Project • NCBI Short Read: Human_NA_12878_Genome_on_illumina • Mouse Genome: • C57BL/6J female DNA • NCBI Short Read: Mouse_B6_Genome_on_illumina

  10. Results - ALLPATHS-LG Assembly Method • previous versions were improved extensively • can assembly small genomes • freely available at: http://www.broadinstitute.org/science/programs/genome-biology/crd

  11. Results - ALLPATHS-LG Assembly Method (cont’d) • Some key innovations in ALLPATH-LG • Handling repetitive sequences • more resilient to repeats • Error Correction • for every 24-mer the algorithm examines the stack of all reads containing 24-mer • incidence of incorrect error correction was reduced • Use of jumping data • it coult work even with such data: it trim bases beyond junction points and treats each read pair as belonging to one of two possible distributions • Efficient memory usage • can asseble human genomes on commercial servers (48 processors & 512 GB Ram) in a few week • 3 week for mouse & 3.5 weeks for human)

  12. Results – Uncertainty in Assemblies • The goal of assembly is to reconstruct the genome as accurately as possible • However in some locations the data may be compatible with more than one solutions • Rather than making an arbitrary choice (& introducing errors) algortihm retains alternatives • ALLPATHS-LG algorithm generates an assembly graph whose edges are sequences and braches represent alternate choices • ATC{A,T}GGTTTTTTT{T,TT}ACAC • Variant Call Format (.VCF file)

  13. Results – Uncertainty in Assemblies(cont’d) NOTE: • Current version of ALLPATHS-LG only captures single-base and simple sequence indel uncertainties • Better way to capture alternatives are needed (many of which are still lost in the current version and giving rise to errors) • It would be desirable to assign probabilities to each alternative

  14. Results – Human & Mouse Assemblies • Resulting genome assemblies provide good coverage of the human and mouse genomes • ALLPATHS-LG assemblies were compared with previously published assemblies • Capillary-based sequencing • SOAP (massively sequencing parallel sequencing)

  15. Results – Human & Mouse Assemblies (cont’d)

  16. Results – Human Genome • N50 contig length of 24 kb • Scaffold length of 11.5 Mb • Contiguity is > 4fold longer than SOAP algorithm • Connectivity is > 25 fold longer than SOAP algorithm • Assembled sequence contains 91.1% of the reference genome (SOAP: 74.3%) • Assembled sequence contains 95.1% of the reference genome (SOAP: 81.2%) • Results are similar to capillary based assemblies

  17. Results – Human Genome (cont’d) • Local assembly error: 3.5 % • Capillary: 4.1 % • SOAP: 6.2 % • Long range accuracy: 99.1% • Capillary: 99.7 % • SOAP: 99.5 %

  18. Results – Mouse Genome • Results are broadly similar for the mouse genome • N50 contig length of 16 kb • Scaffold length of 7.2 Mb • Connectivity is > 20 fold larger than SOAP algorithm • Approach Capillary results (contig: 25 kb, scaffold: 16.9 Mb) • Assembled sequence contains 88.7% of the reference genome (Capillary: 94.2%) • Assembled sequence contains 96.7% of the reference genome (Capillary: 97.3%) • Results are considerably better than SOAP

  19. Results – Mouse Genome (cont’d) • Local assembly error: 3.0 % • Capillary: 2.7 % • SOAP: 14.2 % • Long range accuracy: 99.0 % • Capillary: 99.1 % • SOAP: 98.8 %

  20. Results – Segmental Duplications • Segmental duplications shows a challange • ALLPATHS-LG assemlies (human and mouse) cover only ~40% segemental duplications • Capillary: 60% • SOAP: 12% NOTE: Clearly additional work is needed here

  21. Results – Understanding Gaps • Rougly three quarters of the gaps captured • Remaining gaps are not spanned • Majority of the sequence within the gaps consists of repetitive elements, 61.9% of gaps: • For mouse: LINE elements are major contributors to GAPS • For human: LINE & SINE elements

  22. Discussion • High quality vertebrate genomes provided an essential foundation for comperative analysis of human genome • Costing tens of millions $ each to generate with capillary based sequencing • In this work, ALLPATHS-LG was presented lowering the cost ~1000 fold.

  23. Discussion (cont’d) • ALLPATHS-LG • Good long range connectity, • Good accuracy, • Good coverage • wrt capillary based sequencing and • better than SOAP • ALLPATHS-LG • Quality of the assembliesis considerably better: • scaffolds are > 25 times longer

  24. Discussion (cont’d) • ALLPATHS-LG is anticipated to yield even better results in the improved version. • ALLPATHS-LG introduced a preliminary syntax for expressing alternatives: TTTT{T, TT} • Computational hardware requirements: • SOAP is faster (takes 3 days) but accuracy is low • ALLPATHS-LG is slower but produces high quality assemblies • ALLPATHS-LG is anticipated to be speeded up with algorithmic improvements (considering in mind the trade-off between speed and the accuracy)

  25. Thank you. Questions ?

More Related