1 / 23

Problems of Genome Assembly

Problems of Genome Assembly. James Yorke and Aleksey Zimin University of Maryland, College Park. WGS sequencing. Multiple copies of DNA. Fragments of 200 - 200,000 bases. No information is retained on which part of the DNA the fragments came from. WGS sequencing: fragments.

akaren
Download Presentation

Problems of Genome Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park

  2. WGS sequencing Multiple copies of DNA Fragments of 200 - 200,000 bases No information is retained on which part of the DNA the fragments came from.

  3. WGS sequencing: fragments Sequencing machine reads 100-1000 bases on the ends of the fragments, producing pairs of reads. The fragment sizes are known up to ± 10-20%. CAAGCTGAT... …GTTTGGAAC Unknown sequence Pair of reads

  4. The mathematical problem • We start with millions of pairs of reads, 100 - 1000 bases each • Multiple copies of DNA provide multiple coverage by reads • The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…).

  5. Assembling a jigsaw puzzle 1 • The task of the assembly becomes the task of assembling a giant jigsaw puzzle • We look for reads whose sequences suggest that they came from the same place in the genome:AGTGATTAGATGATAGTAGA||||||||||||GATGATAGTAGAGGATAGATTTA

  6. Assembling a jigsaw puzzle 2 • Then we put “overlapping” reads together AGTGATTAGATGATAGTAGA AGATGATAGTAGAGATAGATAGACC ATAGATAGACCACTCATCATAC AGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATAC reads This yields a “contig”

  7. Assembling a jigsaw puzzle 3 • We use read pairing information to order and orient contigs to produce scaffolds– the final product of assembly Pairs of reads belonging to the same fragment of DNA contig contig

  8. Difficulties in assembly • Sequencing errors: two reads that came from the same place in the genome often have mismatching sequences • AGTGATTAGATCATAGTAGAG|| ||||||||| • ATGATAGTAGAGGATAGAT • Repetitive DNA (~ 5-20% of human DNA is repetitive): • TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG

  9. Repeat regions may cause omissions A R B R C A R C

  10. Erroneous duplications • Two recent published assemblies of the cow genome: UMD2 and BosTau4 • Segmental duplications were a central theme in BosTau4 genome paper • UMD2 assembly had many fewer duplications We examined the duplications, > 99.5% identity, >5000bp, one copy in the UMD2 assembly and two copies in the BosTau4 UMD2 BosTau4 Each base in the genome is covered by 6 reads, on average. A way to judge which assembly is correct is to compute the average read coverage for these regions.

  11. Examining read coverage reveals errors The thick solid vertical line is placed at the coverage at which it is as likely to have two copies as it is to have one.

  12. Next Gen vs. Sanger Sequencing • Sanger sequencing for a mammalian (~ 3Gbp) genome  Expensive: $50M for a mammalian genome  Large amount of DNA required  We get 700-1000 bp reads all with mate pairs • Illumina and 454 Sequencing for the same genome  Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome  Small amount of DNA required (e.g. one insect)  Only 100 or 400 bp reads, some with mate pairs • Assembly is a much harder problem now

  13. Difficulties in denovo Assembly of Illumina and 454 data • Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware • Error patterns in the reads: • substitution errors in Illumina reads • homopolymer errors (unable to tell AAAA from AAA) in 454 reads • Biased coverage by Illumina reads depending on the CG content • Unreliable mate pairs: • Assembly techniques have much larger impact now could actually be

  14. NGS Assemblers • New assemblers developed for different kinds of NGS data: • Newbler for 454 data • SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data • We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team • CA is capable of assembling mixed data sets

  15. Assembly quality varies significantly with the software used • Example 1: Argentine ant assembly comparison. • Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs

  16. Assembly quality varies significantly with the software used • Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). • Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads

  17. Post-assembly steps • Assemblers output scaffolds – ordered and oriented collections of contigs. • Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. • Some mate pair linking information remains unused by assemblers. • Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes.

  18. UMD Chromosome builder • Uses contigs, mate pairs and markers, discarding unreliable scaffold information • Mapping steps: • Use mate pairs to orient contigs • Use markers and mate pairs to assign oriented contigs to the chromosomes • Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data

  19. Computing contig orientations • An orientation problem: A C B

  20. Computing contig orientations • An orientation problem: A C B

  21. Computing contig orientations • An orientation problem: • Matrix form: • Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations A C B A B C 0 1 -1 A M = 1 0 -1 B -1 -1 0 C

  22. Computing contig orientations • The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue l=2: y=(0.5774, 0.5774, -0.5774). • sign(y) = (1, 1, -1), that is the solution is to flip contig C • Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): • Flipping C is the correct solution! 1 0 0 0 1 -1 1 0 0 0 1 1 0 1 0 1 0 -1 0 1 0 = 1 0 1 0 0 -1 -1 -1 0 0 0 -1 1 1 0

  23. Conclusions • Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data • Assembly techniques have large impact on the quality of the assembly • Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences

More Related