Problems of Genome Assembly

Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park

WGS sequencing Multiple copies of DNA Fragments of 200 - 200,000 bases No information is retained on which part of the DNA the fragments came from.

WGS sequencing: fragments Sequencing machine reads 100-1000 bases on the ends of the fragments, producing pairs of reads. The fragment sizes are known up to ± 10-20%. CAAGCTGAT... …GTTTGGAAC Unknown sequence Pair of reads

The mathematical problem • We start with millions of pairs of reads, 100 - 1000 bases each • Multiple copies of DNA provide multiple coverage by reads • The problem of genome assembly is to recover the original sequence of bases of the genome (as much as possible…).

Assembling a jigsaw puzzle 1 • The task of the assembly becomes the task of assembling a giant jigsaw puzzle • We look for reads whose sequences suggest that they came from the same place in the genome:AGTGATTAGATGATAGTAGA||||||||||||GATGATAGTAGAGGATAGATTTA

Assembling a jigsaw puzzle 2 • Then we put “overlapping” reads together AGTGATTAGATGATAGTAGA AGATGATAGTAGAGATAGATAGACC ATAGATAGACCACTCATCATAC AGTGATTAGATGATAGTAGAGATAGATAGACCACTCATCATAC reads This yields a “contig”

Assembling a jigsaw puzzle 3 • We use read pairing information to order and orient contigs to produce scaffolds– the final product of assembly Pairs of reads belonging to the same fragment of DNA contig contig

Difficulties in assembly • Sequencing errors: two reads that came from the same place in the genome often have mismatching sequences • AGTGATTAGATCATAGTAGAG|| ||||||||| • ATGATAGTAGAGGATAGAT • Repetitive DNA (~ 5-20% of human DNA is repetitive): • TTAGGGTTAGGGTTAGGGTTAGGGTTAGGG

Repeat regions may cause omissions A R B R C A R C

Erroneous duplications • Two recent published assemblies of the cow genome: UMD2 and BosTau4 • Segmental duplications were a central theme in BosTau4 genome paper • UMD2 assembly had many fewer duplications We examined the duplications, > 99.5% identity, >5000bp, one copy in the UMD2 assembly and two copies in the BosTau4 UMD2 BosTau4 Each base in the genome is covered by 6 reads, on average. A way to judge which assembly is correct is to compute the average read coverage for these regions.

Examining read coverage reveals errors The thick solid vertical line is placed at the coverage at which it is as likely to have two copies as it is to have one.

Next Gen vs. Sanger Sequencing • Sanger sequencing for a mammalian (~ 3Gbp) genome  Expensive: $50M for a mammalian genome  Large amount of DNA required  We get 700-1000 bp reads all with mate pairs • Illumina and 454 Sequencing for the same genome  Inexpensive: as low as $25K (Illumina), or $1M (454) for a mammalian genome  Small amount of DNA required (e.g. one insect)  Only 100 or 400 bp reads, some with mate pairs • Assembly is a much harder problem now

Difficulties in denovo Assembly of Illumina and 454 data • Reads are short – high coverage needed, imposing demanding requirements on the software and computer hardware • Error patterns in the reads: • substitution errors in Illumina reads • homopolymer errors (unable to tell AAAA from AAA) in 454 reads • Biased coverage by Illumina reads depending on the CG content • Unreliable mate pairs: • Assembly techniques have much larger impact now could actually be

NGS Assemblers • New assemblers developed for different kinds of NGS data: • Newbler for 454 data • SOAPdenovo, Velvet, ABYSS, ALLPATHS, and others for Illumina data • We use open source Celera Assembler currently supported by J. Craig Venter Institute bioinformatics team • CA is capable of assembling mixed data sets

Assembly quality varies significantly with the software used • Example 1: Argentine ant assembly comparison. • Both assemblies used the same 75bp Illumina reads, unmated and in 3kb and 8kb mate pairs

Assembly quality varies significantly with the software used • Example 2. Pogonomyrmex barbatus, the Red Harvester Ant assembly comparison (454 data). • Both assemblies used the same 454 data in 3kb mate pairs, 8kb mate pairs and shotgun reads

Post-assembly steps • Assemblers output scaffolds – ordered and oriented collections of contigs. • Scaffolds typically are much smaller than chromosomes and may contain large-scale errors. • Some mate pair linking information remains unused by assemblers. • Marker maps, i.e. collections of short sequences whose positions on the chromosomes are known, can be used to position the contigs on the chromosomes.

UMD Chromosome builder • Uses contigs, mate pairs and markers, discarding unreliable scaffold information • Mapping steps: • Use mate pairs to orient contigs • Use markers and mate pairs to assign oriented contigs to the chromosomes • Compute position of each contig on the chromosome as the best least-square fit to the available mate pair and marker data

Computing contig orientations • An orientation problem: A C B

Computing contig orientations • An orientation problem: • Matrix form: • Compute y, the eigenvector corresponding to the largest eigenvalue of M. The signs of the eigenvector components provide recipe to flipping the contigs to achieve consistent orientations A C B A B C 0 1 -1 A M = 1 0 -1 B -1 -1 0 C

Computing contig orientations • The eigenvector of M corresponding to the largest eigenvalue, or Frobenius – Perron eigenvalue l=2: y=(0.5774, 0.5774, -0.5774). • sign(y) = (1, 1, -1), that is the solution is to flip contig C • Final matrix of orientations = diag(sign(y))*M* diag(sign(y)): • Flipping C is the correct solution! 1 0 0 0 1 -1 1 0 0 0 1 1 0 1 0 1 0 -1 0 1 0 = 1 0 1 0 0 -1 -1 -1 0 0 0 -1 1 1 0

Conclusions • Genome assembly is a difficult problem that has gotten harder because of Next Gen Sequencing data • Assembly techniques have large impact on the quality of the assembly • Output of the assembler is not the final assembly; extensive post-processing is required to produce chromosome sequences

Problems of Genome Assembly