430 likes | 577 Views
Repetitive and Duplicitous Structure of Genomes. Jeff Bailey S5-432. Human Genome Structure. Hetrochromatic Sequence (tandem satellite repeats) Centromeric alpha-satellite, telomere CAGGG, acrocentric rRNA and beta-satellite Euchromatic sequence ~3.1 gigabases Genes (35%) ~25,000
E N D
Repetitive and Duplicitous Structure of Genomes Jeff Bailey S5-432
Human Genome Structure Hetrochromatic Sequence (tandem satellite repeats) Centromeric alpha-satellite, telomere CAGGG, acrocentric rRNA and beta-satellite Euchromatic sequence ~3.1 gigabases Genes (35%) ~25,000 Exons (1%) (transcription more ubiquitous ENCODE) Repetitive Sequences 3% Simple Sequence Repeats (poly A runs, dinucleotide and trinucleotide repeats) 45% Interspersed Repetitive Elements Repetitive Element Size Copies Fraction LINE elements (retrotransposon) up to 8 kb 850,000 21% Alu elements (retrotransposon) 300 bp 1,500,000 13% LTR-retrovirus-like 6-11 kb 450,000 8% DNA transposons 1-3 kb 300,000 3% (International Human Genome Sequencing Consortium. Science 2001 Vast majority of sequence is non-coding and repetitive.
Centromeric Sequence Human: 171 bp alpha-satellite in array of 2-5 Mb higher order structure (only in Great Apes) 4-20 4-30 k-mer (A-B-C-D-A-B-C-D-A-B-C-D) A-B-C-D to A-B-C-D (2-5%) A-D- 20-40% Further flanked by other satellites (beta satellite) Mouse: 234 bp major satellite (6 Mb) an 120 bp (600 kb) minor satellite at centromeric constriction Arabibdopsis 178 bp satellite in 3 Mb array Drosophilia: 5 bp simple arrays of AATAT and AAGAG C. elegans: Holocentric – entire chromosome acts as centromere Yeast: CEN3 1-2 kb of 83 bp repeat
Repeat unit Number of SSRs per Mb Simple sequence repeats (SSRs) ATGATGATGATG • SSR: perfect or slightly imperfect tandem repeats of a particular k-mer • About 3% of the human genome (~0.5% by dinucleotide) • Derived from slippage during DNA replication Microsatellites: n=1-13 bases Minisatellites: n=14-500 bases
Interspersed Repeats DNA transposons “extinct” in primate lineage (~40 mya). Quiescent in mammalian lineages.
Variation in Relative Content Annu Rev Genet. 2007; 41: 331–368. Sc: Saccharomyces cerevisiae; Sp: Schizosaccharomyces pombe; Hs: Homo sapiens; Mm: Mus musculus; Os: Oryza sativa; Ce: Caenorhabditis elegans; Dm: Drosophila melanogaster; Ag: Anopheles gambiae, malaria mosquito; Aa: Aedes aegypti, yellow fever mosquito; Eh: Entamoeba histolytica; Ei: Entamoeba invadens; Tv: Trichomonas vaginalis.
DNA Transposons Copy / pastel
Human Retrotransposons Serial evolution of master elements L1: 80-100 active L1s (6 hot L1-Ta) Alu 143 active elements Alu Yb (puncuated) 2000 copies; only handufl in other primates. SVA (~25 mya) pol II, 3000 copies New integration: L1 and Alu ~ 1 in 20 meioses; SVA 1 in 90 Pol II Pol III Pol III
Mouse vs. Human MGSC Nature, Volume 420, Issue 6915, pp. 520-562 (2002).
Biological Impact of Retrotransposons Cordaux and batzer Nature Reviews Genetics 10, 691-703 (October 2009)
Biological Importance (cont.) Boundary / Insulator Elements Alternative splicing / novel exons / novel genes Role in suppression of poly II transcription in cellular stress What accounts for long-term maintenance?
Human Genome Structure Hetrochromatic Sequence (tandem satellite repeats) Centromeric alpha-satellite, telomere CAGGG, acrocentric rRNA and beta-satellite Euchromatic sequence ~3.1 gigabases Genes (35%) ~25,000 Exons (1%) (transcription more ubiquitous ENCODE) Repetitive Sequences 3% Simple Sequence Repeats (poly A runs, dinucleotide and trinucleotide repeats) 45% Interspersed Repetitive Elements Repetitive Element Size Copies Fraction LINE elements (retrotransposon) up to 8 kb 850,000 21% Alu elements (retrotransposon) 300 bp 1,500,000 13% LTR-retrovirus-like 6-11 kb 450,000 8% DNA transposons 1-3 kb 300,000 3% (International Human Genome Sequencing Consortium. Science 2001 Vast majority of sequence is non-coding and repetitive.
Whole Genome Duplication Ancient 4N 2N Segmental Duplications Tandem Interspersed Interchromosomal intrachromosomal Types of Duplications
Susumu Ohno 2n 4n rearrangement 2n • Whole Genome Duplication • Vertebrate Paradigm: ancient whole genome duplications and recent tandem duplications • (review: Panopoulou (2005) TIG 10:560) • KEY CONCEPT: New genes usually derived from copies
Paralogy--two genes/proteins in the same species which share sequence similarity due to duplication. 2b. Orthology--two genes/proteins in different species which share sequence similarity and are descended from a common ancestor. 3. Xenology--introduction of a new sequence into the genome by horizontal transfer between two species
Segmental Duplications Segmental Duplication (SD) Time (1-50 mya) ` Time (100s mya) Key raw material for the evolution of novel genes Repetitive Element Exon
Segmental Duplications (SD) 99.1% identical over 180 kb (VCF/DiGeorge Syndrome in 1 in 3000 births) 5.4% of the genome (>90% identity and >1 kb) chr22 • Properties: • Clustered • Complex regions • Dynamic regions Bailey and Eichler (2006) Nat Rev Genet
SDs Underlie Recurrent Germline Deletions and Duplications I D D’ Cen Tel I D’ D Cen Non-allelic Homologous Recombination (Lupski, 1999) I I D’- D D D’ Cen Tel GAMETES D - D’ Cen Tel Change in Dosage Sensitive Genes → phenotype or disease Dynamic Regions – predisposed to further rearrangements
Detection of Segmental Duplications: Whole genome assembly comparison Figure 1 identify high-copy repeats splice out blast comparisons --allowing for large gaps reinsert repeats heuristic end trimming global alignments Analyze alignments (>1 KB; >90% identity) Human Draft: Regions of SD poorly assembled (collapsed) and many unique regions with unmerged overlaps (allelic) (Bailey et al. Genome Res 2001)
Genome Wide Detection Problem: Allelic/True Overlap vs. Duplication
Publicsequence Align Reads: >96% identity Examine All Public Sequence Absent SD (collapsed or missing) False Positive SD 99.8% Combined with whole-genome assembly comparison: 5.4% of the human genome composed of SDs >1 kb and >90% identity Shotgun Sequence: assembly-independent detection of high-identity SD Celera (27.1 M reads) Whole Genome Shotgun Sequence: random sample Bailey et al. Science 2002
Xq28 donor REPEATS Public Celera 47 # Reads / 5 kb 100 200 223
2000 1800 1600 R2=0.96 1400 1200 1000 800 600 400 200 0 0 10 20 30 40 50 60 Depth of Coverage vs. Copy Number Coverage Number of Reads/5kb window Diploid Copy # ofDuplication
Global Alignments filtered with SDD 68.6.% 40.7% 25% 25% Duplicated Bases (% Total Chromosome) INITIAL INITIAL FILTERED FILTERED 20% 20% 15% 15% 10.9% 9.8% 8.5% 8.8% 8.2% 8.2% 8.1% 10% 10% 7.8% 5.7% 5.7% 5.5% 5.2% 4.4% 3.4% 3.4% 3.4% 3.3% 3.2% 3.2% 3.1% 3.0% 5% 5% 2.8% 2.1% 2.1% 0% 0% 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 X X Y Y Chromosome
SD “Hotspot” Map of Human Genome • 130 candidate regions (298 Mb) • 23 associated with genetic disease Bailey et al. Science 2002 Interrogation of these regions has lead to detection of 16 additional pathogenic rearrangements including new microdeletions on 1q21.1, 15q13, 15q24 and 17q12. (Sharp et al. Nat Genet 2006; Mefford et al. Am J Hum Genet 2007; Mefford et al. N Engl J Med 2008)
Genetic Distance Finished Sequence 1000 900 800 700 600 500 400 300 200 100 0 0.10 0.08 0.09 0.06 0.07 0.02 0.03 0.01 0.04 0.05 Intrachromosomal Interchromosomal Total Aligned bases (kbp) 1600 1400 1200 1000 800 600 400 200 0 0.09 0.01 0.02 0.03 0.04 0.08 0.10 0.05 0.06 0.07 Genetic distance (K) Sept 2000 NT data set (>2KB; >90%; no X—Y)
Species SDs Marques-bonet et al. TIG 2009
Duplicated Genes Gene Enrichments Immunological Environmental response Reproduction: sperm-egg interactions Johnson et al 2001 Nature
Organizing the MESS Jiang et al. 2007 Nat Gen:39:1361-8
437 Hubs Jiang et al. 2007 Nat Gen:39:1361-8
Mechanism: Junction Content Control +/- 1 kb Junction (50 bp) • Duplications >95% and < 99.5% • Only finished sequence • Enrichment for Alu elements
Alu Proximity to Junctions 25% 10 bp window Average Alu Content(bp) 15% 5% UNIQUE DUPLICATED -500 -400 -300 -200 -100 0 100 200 300 400 500 Center of Window(bp from Junction)
Alu Simulation 350 Number of replicates 300 250 200 150 23.8% 100 50 0 0 5 10 15 20 25 Proportion Alu (%) Computer simulations to determine significance.
Subfamily Enrichment Number of Elements 100,000 AluY 80,000 AluS 60,000 AluJ 40,000 20,000 0 20 40 60 80 mya Mammal Prosimian New World Old World orangutan gorilla AluY AluS AluJ chimp human ≥90%1.8 1.9 1.1 ≥95%2.2 1.8 1.1
Whole Genome Duplication Yeast Kellis and Lander (Nature 428:617-24 2004)
Explore Resources REMINDER OF CLASSExercises for analysis of repetitive elements and segmental duplications