240 likes | 263 Views
How to Build a Horse: Final Report. Megan Smedinghoff. Project Goals. Goal 1. Produce an assembly of the horse genome using the Celera Assembler. Goal 2. Reconcile my assembly with the Arachne assembly done by Broad Institute in February 2007. Goal 3.
E N D
How to Build a Horse:Final Report Megan Smedinghoff
Project Goals Goal 1 Produce an assembly of the horse genome using the Celera Assembler Goal 2 Reconcile my assembly with the Arachne assembly done by Broad Institute in February 2007 Goal 3 Produce a parallelized version of the UMD Overlapper to help facilitate easier assembly of large organisms in the future
DNA • Cells in all organisms have a DNA molecule in their nucleus • DNA is an abbreviation forDeoxyriboNucleic Acid • DNA is translated into proteins, which determine the structure, function, and behavior of the cell
What does DNA look like? • Linear, double-stranded, helical molecule. • Consists of four types of nucleotides (bases): • adenine (A) • thymine (T) • guanine (G) • cytosine (C)
What is a genome? • A genome is the linear sequence of bases of the DNA molecule: …..GATGACATGTAT….. • Bacteria: ~2-5 million bases • Insects (fly): ~200 million bases • Mammals: ~3 billion bases
The Horse Genome • In February 2007, Broad Institute released a draft assembly of the horse genome • The sequencing cost $15 million • The assembly contains approximately 2.7 billion bases and was done using 6.8-fold coverage The horse that was sequenced
Why Sequence the Horse? • Allows scientists to study diseases that primarily affect horses such as Glanders • Over 80 known genetic conditions exist in both horses and humans; comparative genomics can lead to better treatments for both species • Work done on the horse will lead to improvements in the assembly pipeline and allow easier assembly of large mammals in the future
DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 750bp LIGATE & CLONE Primer SEQUENCE Vector Review of Genome Sequencing Slide courtesy of Art Delcher
Review of Assembly Process Calculating Overlaps • We compare each read to every other read and record pairs whose ends overlap by a certain amount (usually ~40 nucleotides) AGTGATTAGATGATAGTAGA | | | | | | | | | | | | GATGATAGTAGAGGATAGATTTA • We use the University of Maryland overlapper to calculate the overlaps • UMD Overlapper is advantageous because it corrects for sequencing errors and does additional trimming to help produce the most accurate list of overlaps possible
Unitig Contig Scaffold Review of Assembly Process Unitigs We use overlap information to combine reads into unitigs Contigs We use mate pair information to combine unitigs into contigs Scaffolds We further use mate pair information to determine how far apart contigs should be (the result is called a scaffold)
Goal 1: Creating an Assembly Running the Assembler • It took nine days for the Celera Assembler to produce an assembly • Celera generated over a terabyte of data • The final assembly used 24.9 million reads (7.9x coverage) Results
Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig B’ Contig C’ Contig D’ Arachne Scaffold Mapping of Celera Scaffold to Arachne Scaffold Goal 2: Produce a Reconciled Assembly Step 1: Match the Celera assembly to the Arachne assembly Step 2: Compare scaffolds from the two assemblies Step 3: Run reconciliation software
Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig B’ Contig C’ Contig D’ Arachne Scaffold Illustration of an orientation problem between the two assemblies Orientation Problems Definition: An orientation problem occurs when a set of Celera contigs matches a set of Arachne contigs, but one of the contigs is placed in the opposite orientation
Contig A Contig B Contig C Contig D Celera Scaffold Contig A’ Contig C’ Contig B’ Contig D’ Arachne Scaffold Illustration of an ordering problem between the two assemblies Ordering Problems Definition: An ordering problem occurs when a set of Celera contigs map to a set of Arachne contigs, but the matches occur in a different order in each scaffold
Contig A Contig B Contig C Contig D Gap Celera Scaffold Contig A’ Contig D’ Contig B’ Contig C’ Arachne Scaffold Gap Illustration of an gap size problem between the two assemblies Gap Size Problems Definition: A gap problem occurs when the size of the Arachne gap is more than four standard deviations away from the estimated Celera gap size
Results of the Comparison Note: In most Celera scaffolds, there was at most one contig that matched a contig in the Arachne assembly. In these scaffolds, there was no way to look for discrepancies between the two assemblies.
Matching sequences Assembly A Assembly B Matching sequences Assembly A Assembly B Compression error in Assembly A or expansion error in Assembly B Reconciliation Gaps Gap in Assembly B Compressions
ROM MOR Overlaps Additional Post-Processing Goal 3: Parallelizing the Overlapper Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get candidate pairs for overlap Run “checker” on candidate pairs and produce edit-transcripts Output Overlap List
Parallelization of Overlapper • The parallelization occurs during the “checker” process when the reads with common minimizers are compared • Since the reads are already grouped by minimizers, there is no overhead for doing the comparisons in parallel • I used the Sun Grid Engine routines installed on the genome cluster to submit the jobs to available processors
Results of Parallelization Bacteria Fly Horse
Summary Goal 1 I created an assembly of the horse using the Celera assembler. The assembly size was 2.5 billion bases. Goal 2 I created a reconciled assembly that fixed 551 out of 687 compressions in the Arachne assembly and increased the assembly size by 4.3 million bases Goal 3 I created a parallelized version of the overlapper that significantly reduces the running time on larger genomes
Acknowledgements Thanks to Aleksey Zimin, Guillaume Marçais, Mike Roberts, James Yorke, and Poorani Subramanian for instruction and advice on this project Thanks also to Matthew Pragel for moral support
ROM MOR Overlaps 10-8 More Technical Overlapper Slide Get associated hex value for each 20-mer in read Record “minimizer” (20-mer with minimum hex value) for sliding 20-base window Create Read, Offset, Minimizer file Filter by certain minimizers and then sort to get Candidate pairs for overlap Run “checker” on candidate pairs And produce edit-transcripts Output Overlap List Do Multi-compare and split overlaps when they have a certain number of differences Do Poisson test and eliminate any overlaps that have less than a 10-8 probability of occurring