Exploring Human Population Genomics: Insights on Evolutionary Forces and Chromosomal Duplications

Human Population Genomics Man,Woman, Birth,  Death,  Infinity,Plus  Altruism,  Cheap Talks,  Bad Behavior,¥ Money, God and  Diversity on Steroids

Jack Schwartz (1930 – 2009)

Lord Jeffrey (misattributed; badly paraphrased) “Damn the Human Genomes. Small populations; Genes too distant; Pestered with duplications; Feeble contrivance; Could make a better one myself!”

Small Populations Non-equilibrium Models Population Bottlenecks Not Well-mixed Migration/Colonization Patterns Catastrophic Infections Heterozygous Advantages

Ancestral allele Derived allele Wright-Fisher Process Derived allele extinction! mutation N individuals generation Mickey (Coalescent talk)

Moran Process death time • Overlapping generations • Distribution of time to replication

Forces in Population Genetics • How to understand forces that produce and maintain inherited genetic variation • Forces • Mutation • Recombination • Natural Selection • Population Structure/Migration • Random birth/death (drift)

Genes Too Distant 20,000 Genes (Estimate in 80’s 120,000) Occurring about every 150 Kb Many more functional ncRNA snoRNA, siRNA, piRNA, etc. Uncharacterized

Y • “From a gene’s point of view, reshuffling is a great restorative… • “The Y, in its solitary state disapproves of such laxity. Apart from small parts near each tip which line up with a shared section of the X, it stands aloof from the great DNA swap. Its genes, such as they are, remain in purdah as the generations succeed. As a result, each Y is a genetic republic, insulated from the outside world. Like most closed societies it becomes both selfish and wasteful. Every lineage evolves an identity of its own which, quite often, collapses under the weight of its own inborn weaknesses. • “Celibacy has ruined man’s chromosome.” • Steve Jones, Y: The descent of Men, 2002.

DAZ locus on Y Chromosome

Optical Mapping • Capture and immobilize whole genomes as massive collections of single DNA molecules Cells gently lysed to extract genomic DNA DNA captured in parallel arrays of long single DNA molecules using microfluidic device Genomic DNA, captured as single DNA molecules produced by random breakage of intact chromosomes

Overlapping single molecule maps are aligned to produce a map assembly covering an entire chromosome ⌘⌘⌘

Sizing Error (Bernoulli labeling, absorption cross-section, PSF) Partial Digestion False Optical Sites Orientation Spurious molecules, Optical chimerism, Calibration ⌘⌘⌘⌘ Image of restriction enzyme digested YAC clone: YAC clone 6H3, derived from human chromosome 11, digested with the restriction endonuclease Eag I and Mlu I, stained with a fluorochrome and imaged by fluorescence microscopy.

⌘⌘⌘⌘⌘ Various combinations of error sources lead to NP-hard Problems

Pestered with duplications Complex Genome Structures Segmental Duplications Many types of Polymorphisms (SNPs, CNVs, SVs, etc.) Models of Genome Dynamics GOD (Genome Organizing Devices) Models of Coalescence

Segmental Duplications • Segmental duplications have been found to be associated with genomic disorders. • Deletions: Williams-Beuren syndrome • Duplications: Charcot-Marie-Tooth disease type 1A • Inversions: Haemophilia A • Translocations: Derivative 22 [der(22)] syndrome. • Segmental duplications may be related to cancer development by causing copy number fluctuations • Duplication of myc in lung cancer, and ERBB2 in breast cancer.

Recent Segmental Duplications Human • 3.5% ~ 5% of the human genome is found to contain • segmental duplications, with length > 5 or 1kb, identity > 90%. • August, 2001 assembly, • [Bailey, et al. 2002]. • April, 2003 assembly, • [Cheung, et al. 2003]. • These duplications are estimated to have emerged about 40Mya under neutral assumption. • The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally. From [Bailey, et al. 2002]

Recent Segmental Duplications Mouse • 1.2% of the mouse genome is found to contain segmental duplications, with length > 5kb, identity > 90%. • February, 2003 mouse assembly, • [Cheung, et al. 2003]. • These duplications are estimated to have emerged about 25Mya under neutral assumption. • The duplications happen both inter- and intra-chromosomally. From [Cheung, et al. 2003]

Duplication Flanking Sequences • What are the molecular mechanisms that caused the recent segmental duplications in the human and mouse genomes? • Thermodynamic instability in the DNA sequences; • Recombination between homologous repeat elements; • Other unknown mechanisms.

Thermodynamics Control Data 5’-breakpoint 3’-breakpoint 5’ 3’ -512bp +512bp duplicated region

SINE ** ** * * ** Alu-Jb Alu-Sc~Sx Alu-Y Alu-Ya~Yb MIR FLAM/FRAM Alu-Jo Divergence: 14% 8% 5% >1% 30% 20% 14% LINE ** ** ** ** ** ** ** ** ** L2 L1M4 L1M3 L1M2 L1M1 L1P5 L1P4 L1P3 L1P2 L1P1 L1Hs Divergence: 30% 22% 21% 19% 18% 12% 11% 7% 4% 2% <1% ⌘ Frequencies of the repeats Control set Data set

f - - f - - deletion or mutation insertion f + - f + - Duplication by recombination between other repeats or other mechanisms deletion or mutation insertion f ++ f ++ Duplication by recombination between repeats Mutation accumulation in the duplicated sequences The Model

The Mathematical Model Time after duplication 1-α-2β 1-α-2β 1-α-2β h0-- α α α α f - - h1-- γ 2β γ 2β 2β γ h0+- h0 α α α α H0 f + - 1-α-β/2-γ 1-α-β/2-γ 1-α-β/2-γ 2γ β/2 2γ β/2 2γ β/2 h0++ α α α α H1 f ++ h1 h1++ 1-α-2γ 1-α-2γ 1-α-2γ 0 ≤ d < ε ε ≤ d < 2ε (k-1)ε ≤ d < kε h1: proportion of duplications by repeat recombination; h1++: proportion of duplications by recombination of the specific repeat; h1- -: proportion of duplications by recombination of other repeats; h0: proportion of duplications by other repeat-unrelated mechanism; h0++: proportion of h0 with common specific repeat in the flanking regions; h0+-: proportion of h0 with no common specific repeat in the flanking regions; h0- -: proportion of h0 with no specific repeat in the flanking regions; α: mutation rate in duplicated sequences; β: insertion rate of the specific repeat; γ: mutation rate in the specific repeat; d: divergence level of duplications; ε: divergence interval of duplications.

Model Fitting Alu L1 f - - f - - f + - f + - f ++ f ++ Diversity: Diversity: The model parameters (αAlu, βAlu, γAlu, αL1, βL1, γL1) are estimated from the reported mutation and insertion rates in the literature. The relative strengths of the alternative hypotheses can be estimated by model fitting to the real data. h1++Alu≈0.3; h1++L1≈0.35.

Chr1 Ns ATs Reps MER57A L1P CDs ΔG Dup Copy# Mer Freq Mer Frequencies

Copy Number Variation Data HapMap data China46 people Japan45 people Utah European origin: 90 people Yoruba89 people Made available to us by Drs. Evan Eichler and Andy Sharp

CNVs in Unique regions OR

CNVs in Unique regions

CNVs in SD regions AND

CNV in SD regions Unique and SD regions show completely different behavior of CNVs!

Distance-dependent recombination The chance of recombination depends on the distance between Allele A and its copy

Simulation (probabilistic model)

Observations & Conclusions • Mutation rate of 0.0001 and recombination rate of 0.001 in SD regions constitute the best fit to observed real life data. • Single mutations cannot explain observed data, but can be explained by convergence via recombination. • Evolution-by-Duplication (EBD) appears to play a crucial role in evolution and molds the genetic circuitry in a rather constrained way, before it is subject to selection pressure

Feeble Contrivance GWAS (Genome-Wide Association Studies) Common Variants vs. Rare Variants Haplotype Phasing/Linkage Analysis Poor Experiment Design Reference Sequences Genotypic vs. Haplotypic References Weak Technologies

Common vs. Rare Disease Variants • From Ionita-Laza (2009) • There are two disease models: • CDCV - common disease, common variants • CDRV - common disease, rare variants • The current genome-wide association studies only consider common variants (frequency at least 5%). • Feasible with available resources • The common loci identified so far have small effects (ORs 1:1 -1:5) and only explain a small percentage of the estimated heritability. • Rare susceptibility variants are expected to play an important role: • population genetics theory (Pritchard, 2001) • empirical evidence (BMI, blood pressure, autism, Mendelian diseases etc.)

Effect Size Distribution

Capture-Recapture Model • Suppose we have sequence data on Nind individuals in a genomic region. • An individual shows variation at a position if the corresponding allele is different from the ancestral one. • A position is variable or is a variant if there is at least one individual in the dataset with a variation at that position. • Let xs be the number of individuals with variation at position s: xs > 0. • What is N: the total, unknown number of variants in the region.

One can estimate the following: • Δ(t) = # NEW variants expected to be found in a FUTURE dataset of size t . Nind. • t is a multiplier of initial dataset size, Nind. • Δf(t) = # new variants with frequency at least f . . .

ENCODE dataset • Ten 500Kb genomic regions were sequenced in several unrelated DNA samples: • 8 Yoruba (YRI) • 16 CEPH European (CEPH) • 7 Han Chinese (CHB) • 8 Japanese (JPT) • To make results comparable across the four populations (YRI, CEPH, CHB and JPT), they considered only 7 of the sequenced individuals for each dataset.

ENCODE - Δf(t) • From Ionita-Laza et al. 2009

How to Make a Better Human? Debugging a human better Sequencing a genome Sequencing a population

Single Molecule Approach to Sequencing-by- Hybridization S ★M ★ A ★ S ★ H

S*M*A*S*H • Sequence a human size genome of about 6 Gb—include both haplotypes. • Integrate: • Optical Mapping (Ordered Restriction Maps) • Hybridization (with short nucleobase probes [PNA or LNA oligomers] with dsDNA on a surface, and • Positional Sequencing by Hybridization (efficient polynomial time algorithms to solve “localized versions” of the PSBH problems)

Fig 1 ⌘ • Genomic DNA is carefully extracted

Fig 2 ⌘⌘ • LNA probes of length 6 – 8 nucleotides are hybridized to dsDNA (double-stranded genomic DNA) • The modified DNA is stretched on a 1” x 1” chip.

Fig 3 ⌘⌘⌘ • DNA adheres to the surface along the channels and stretches out. • Size from 0.3 – 3 million base pairs in length. • Bright emitters are attached to the probes and imaged (Fig 3).

Fig 4 ⌘⌘⌘⌘ • A restriction breaks the DNA at specific sites. • The cut fragments of DNA relax like entropic springs, leaving small visible gaps

Fig 5 ⌘⌘⌘⌘⌘ • The DNA is then stained with a fluorogen (Fig 5) and reimaged. • The two images are combined in a composite image • suggesting the locations of a specific short word (e.g., probes) within the context of a pattern of restriction sites.

Fig 6 ⌘⌘⌘⌘⌘⌘ • The integrated intensity measures the length of the DNA fragments. • The bright-emitters on probes provides a profile for locations of the probes. The restriction sites are represented by a tall rectangle & The probe sites by small circles

Exploring Human Population Genomics: Insights on Evolutionary Forces and Chromosomal Duplications

Exploring Human Population Genomics: Insights on Evolutionary Forces and Chromosomal Duplications

Presentation Transcript

Population Growth and Population Projections

Population

The Birth, Life and Death of Stars

Ecology Notes

Applying Population Ecology: The Human Population

Ch 8: Population Ecology

Human genetics and genomics The Science of the 21 st Century …and why it needs infrastructure

Demographic transitions

Human Population Growth

Changes to Populations

Birth and Death Rates, Worldwide

SoNHS CPE Workshop PowerPoint Concepts

Population Characteristics

Population Characteristics Population Part III

Populations

Population genomics of human gene expression

APES Chapter 12

Populations

The structure of a population depends on birth and death rates and also on migratory movements.

Genomics

Population Patterns in Africa

Individual after death