250 likes | 424 Views
Vicky Choi Department of Genetics Case Western Reserve University. The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome. Repeats of the Human Genome. High-copy repeats. e.g. ALU, L1. Segmental duplications (low-copy repeats). Large block (>200Kb)
E N D
Vicky Choi Department of Genetics Case Western Reserve University The Sequence and Assembly of the Highly Duplicated Regions of the Human Genome
Repeats of the Human Genome • High-copy repeats e.g. ALU, L1 • Segmental duplications (low-copy repeats) • Large block (>200Kb) • Highly Similar (>97%) • Without characteristic sequences
Identification of Duplications Sequence Assembly vs. Repeats vs. Sequence Assembly • HGP Goal: To provide an accurate reference sequence of the euchromatic portions of the genome
Public: Clone-Based Sequencing Celera: Whole-Genome Shotgun Sequencing Human Genome Human Genome 500bp read BAC library (100-200Kb) BAC 100-200Kb fragment Draft, Finished Duplications Detection Method: Public Clones + Celera Random Reads Jeff Bailey & Celera Genomics
5-10 copies 10-20 copies ~40 copies Over-representation of Celera reads in the duplicated regions of clones
500 450 400 350 300 Coverage Duplicated 250 X Chromosome 200 Autosome 150 100 50 0 0 2 4 6 8 10 12 14 Copy # of Duplication Depth of Coverage vs. Copy Number of Duplication
47 At 81 100 5kb WGS Read Cov 200 400 Unique Duplication Duplicon Transition (Unique to Repeat)
Segmental Duplication Database(SDD) • Scanned 32,610 clones • Statistics: 8,595 regions (2,972 clones) with 130.4 Mb—duplications 94-100%. • Confirmation: • Detected 42/43 FISH positive clones • Detected 21/23 Sequence duplications
True Overlap Repeat-induced Overlap True Overlap vs. Repeat-induced Overlap
A overlaps with B B B B A A A A A overlaps with C C C Yes. Consistent Overlap Inconsistent Overlap No. C Consistent Overlaps Does B overlap with C?
Necessary… But Not Sufficient Long Repeats Under-represented
Basic Idea of the Assembly Algorithm 1.Classify Overlaps: U-Overlaps & D-Overlaps 2.“Conservatively” assemble fragments into subcontigs 2.1 Assemble consistent U-Overlapping fragments 2.2 Assemble the overlapping BACs (u-driven)
Clone Graph Algorithm (cont.) 3. Interval Realization of the Clone Graph 4. Orient and order subcontigs
Contig 1 Contig 2 6. Undetermined D-overlaps Algorithm (cont.) 5. Eliminate the “interior” D-overlaps Need Additional Information
SDD: 8,595 regions 13,186 fragments (3,949 BACs) contain duplications (>95% similar) (total length 247 Mbp) 403,937 overlaps = 379,507 (U-overlap) + 24,430 (D-overlap) 8,233 (u-driven) + 2,880 + 10,715 + 2,602(interior) Build 28 (Dec 2001 freeze) (Greg Schuler)
165 interior BAC pairs collapse in Build 28, 89 BACs are not in TPF
Incorporation of Duplications Data Assembled BAC Length Build 28 + SDD Build 28 250K - 300K 418 461 300K - 500K 480 1328 500K - 800K 13 798 800K - 1M 0 248 1M - 2M 0 496 2M - 3M 0 129 3M - 10M 0 259 10M - 20M 0 67 911 3786 BACs contain duplications 140 699
AC011767 AC004583 AC055876 AC025138 AC027267 AC018348 AC026495 AC016033 Example : 15q11-13 ~4Mbp (Devin Locke) (HERC2) Build 28 :
Conclusion • SDD allows to assemble duplicated regions more conservatively, improving the assembly quality • Problematic regions: 165 “collapsed” regions, 89 BACs are not in TPF, “warped” regions • Incorporate additional data: e.g Phrap scores, BAC ends information Paralogous Sequence Variations(PSV)
Jeff Bailey Greg Schuler (NCBI) Zhiping Gu (Celera Genomics) Peter Li (Celera Genomics) Martin Farach-Colton (Rutgers University) Evan Eichler Program in Mathematics and Molecular Biology (PMMB) Burroughs Wellcome (BWF) fellowship