210 likes | 221 Views
Explore the process of sequencing, assembling, and validating the horse genome to study diseases, infer evolution, and understand genetic conditions analogous to humans.
E N D
How to Build a Horse Megan Smedinghoff
Background • In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) • The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health • 300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany
Horse Genome Statistics • The horse genome contains approximately 2.7 billion base pairs • The assembly was done using 6.8-fold coverage • The sequenced horse was a thoroughbred mare named Twilight from Cornell University Twilight posing for a picture at Cornell
Why Sequence the Horse? • Allows scientists to study diseases that primarily affect horses such as Glanders • SNP information can be used to connect DNA to physical characteristics and explain differences between breeds • Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced
How the Horse Genome Affects Us • There are over 80 known genetic conditions in the horse that are analogous to human disorders • Horses have some conditions traditionally found in humans such as allergies and arthritis • Having the complete horse genome helps infer the order of evolution • Horse Racing?
Project Proposal • Reassemble the horse genome using the Celera Assembler • Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome • Deposit the improved assembly in GenBank Advisor: Jim Yorke
DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 750bp LIGATE & CLONE Primer SEQUENCE Vector Introduction to Genome Sequencing Slide courtesy of Art Delcher
Trim the Reads Calculate Overlaps Build Unitigs Build Contigs Build Scaffolds Closure How Genomes are Assembled
5’ 3’ 3’ 3’ 5’ 5’ 5’ 3’ Read B Read A Read B Read B Read A Read A Read B Read A 5’ 3’ 5’ 5’ 3’ 3’ 3’ 5’ Assembly: Calculating Overlaps • Compare every possible combination of reads to find every overlap of a certain length (~40bp) • Must compare forward and reverse orientation of each pair of reads • Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman
Unitig Reads Assembly: Creating Unitigs • A unitig is a set of reads that have been linked together based on overlaps • A unitig has no ambiguities
A A B B C C D Assembly: Creating Unitigs (cont.) Best Buddy Algorithm for Unitig Assembly: If the longest overlap with read A is read B and the longest overlap with read B is read A, then reads A and B are best buddies D Read A and Read B are best buddies Read A and Read B are NOT best buddies
Read 2 Read 1 Unitig A Unitig B Assembly: Creating Contigs • A contig is a set of overlapping unitigs • Contigs are assembled by using mate pair information • Since we know the distance between mates and the orientation of the mates, we can infer the placement of the unitigs Read 1 and Read 2 are mates
Scaffold Contig A Contig B Reads Assembly: Building Scaffolds • Scaffolds are built from contigs • The orientation and approximate distances between contigs are inferred from mate pair information • When possible, the gaps between contigs are filled in with leftover sequence
Arachne Assembler • 24-mer indexing • Any two reads that share at least one 24-mer are paired • Each pair is scored • Contigs are created by merging paired pairs • Repeat regions are avoided during contig assembly but used during scaffold assembly • Subreads are placed after scaffold assembly Serafim Batzoglou Arachne Author
Celera Assembler • Find overlaps of at least 40bp with less than 6% error • Overlaps are found using 22-mers • After overlaps are calculated, Celera does error correction using a voting algorithm • Contigs are assembled using best buddy algorithm • Scaffolds are assembled from mate pair information • Scaffold gaps are filled when possible Gene Meyers Former vice president of Celera Genomics
Project Expectations Fall 2007 Produce Celera Assembly Spring 2008 Produce Reconciled Assembly General Goals Tackle the unexpected problems that accompany genome assembly Document my work Validate my work wherever possible
Validation • Genome assemblies are not perfect • I plan to validate my assembly by comparing it to the current draft • I expect about 1.5% difference between the Celera Assembly and the Broad Assembly • I will use Mummer to measure similarity between genomes
Mummer • Mummer is a piece of software created by CBCB that is used to compare genomes • Mummer locates strings of at least 18bp that are present in each genome • Plotting the results makes it easy to see insertions, deletions, inversions, etc. Graphs courtesy of Adam Phillippy
Implementation Details • I plan to use the Genome cluster at University of Maryland to produce my assembly • Much of my project will utilize existing software • I intend to use Perl to write any additional scripts that may be needed
Time Permitting • The University of Maryland has recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes • I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly Mihai Pop James White
Acknowledgements • James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project • Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics • Gene Myers paper on Drosophila • Serafim Batzoglou paper on Arachne • Wikipedia