Population genetics

Population genetics Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen

Contents • Background • Linkage disequilbrium • SNP tagging • Population studies • Accessing data

What is population genetics? Population genetics is the study of genetic variation both within and between human populations. • 5-7% of worldwide human genetic variation is due to genetic differences between human populations. • The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).

Why study population genetics? • Learn about human migration patterns and history • Improve power to identify and localize disease genes • Use differences in linkage disequilibrium for fine-mapping • Avoid false positives due to population stratification • Admixture mapping for diseases with varying prevalence • Signals of natural selection at genes related to disease

Linkage Disequilibrium Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers. Linkage Disequilibrium can be used for association studies

Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G C C A A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G T T A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G TT A A SNP 1 YES, in LD SNP 2 3 billion letters

Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G G C C A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G TT A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G T T A A SNP 1 YES, in LD SNP 2 3 billion letters NOT in LD SNP 3

Linkage Disequilibrium: Example Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 11 0 0 0 0 11 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 SNP 1 r2=1, in LD SNP 2 3 billion letters r2=0,NOT in LD SNP 3 r2is squared correlation

LD: Haplotype Blocks Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 11 0 0 0 0 00 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 SNP 1 SNP 2 3 billion letters SNP 3 These 3 SNPs form a “haplotype block”

Haplotype blocks • Variants (alleles) that are located close to one another are inherited together (in LD) • This pattern is disrupted by recombination (shuffles chromosomes) • Recombination is finite and time since origin of humans insufficient to break down all linkage => haplotype blocks • African chromosomes: 50% of the genome lies in haplotype blocks >22kb. • Europeans and Asians: 50% of the genome lies in haplotype blocks >44kb. • Longer haplotype blocks in Europeans/Asians due to out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya. Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)

Population bottlenecks Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS, Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)

Linkage Disequilibrium and tag SNPs Direct association: genotype SNP1 in Cases and Controls. Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1: causal SNP 3 billion letters

Linkage Disequilibrium and tag SNPs Indirect association: genotype SNP2 in Cases and Controls. If SNP1 affects disease risk, then SNP2 will also be associated! Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1 r2=1, in LD 3 billion letters SNP 2

LD: Haplotype Blocks Risk haplotype Case Control Case Control Case Control Case Control Case Control Question: Which SNP to genotype? Answer: Choose 1 SNP per haplotype block, and take advantage of indirect association! Use known resources

HapMap for “SNP tagging • How to select SNPs to genotype in an association study: • Choose genomic region(s) of interest. • Look up HapMap SNPs in the genomic region(s) • Choose a subset of HapMap SNPs which “tag” haplotype blocks in the genomic region(s) • Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study

SNP Imputation Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

SNP Imputation cont. r2 = 0.8 Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

SNP Imputation cont. Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

Why do Imputation? • Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP) • Enable meta-analysis of studies on Affymetrix + Illumina chips • Improve genotype data quality • Imputation algorithms available

Population studies • Population structure: refers to genetic differences between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry • Population admixture: Mixed ancestry from multiple continental populations. e.g. African Americans, Latino Americans Classify local ancestry at each location in the genome • Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls

Previous GWAS studies Rosenberg et al. 2010, Nat Rev Genet

The International HapMap Project • 270 samples from 4 populations • Found 3.1 million SNPs

HapMap3

Measuring distances- FST • The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population. OR • The FST between two populations is equal to the proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences. • High FST implies a high degree of differentiation

Some example FSTs Yoruba (Nigeria) Luhya (Kenya) FST = 0.008 Japanese Chinese FST = 0.007 Southeast Eur. Northwest Eur. FST = 0.005

More FST for HapMap3

Studying population structure • Can study population structure by: • Principal Component Analysis (PCA) • Clustering

Principal Components Analysis 10 points in 1,000,000-dimensional space.

Axes of variation (PCs, eigenvectors) Axis 1 is the axis explaining the maximum amount of variation. Axis 1

Axes of variation (PCs, eigenvectors) Axis 2 Axis 1

Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 Axis 1

Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 … Axis 9 Axis 10 Axis 1

Distinguishing populations using PCA 100 markers

Distinguishing populations using PCA 3 million markers

PCA in Europe Novembre et al. 2008 Nature

Population structure using clustering • Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics) Rosenberg et al. 2002 Science

More examples Oceania America AfricaEuropeWestern EurasiaEast Asia

Clustering versus PCA • Model-based clustering: • Output for each individual: ancestry in N population clusters • Fractional ancestry (20% pop1, 80% pop2) may be allowed • Number N of population clusters must be decided in advance • Results may be sensitive to number of population clusters • Principal components analysis (PCA): • Output for each individual: ancestry as principal components • PCs do not necessarily correspond to specific populations

Ancestry Informative Markers • Standard approach to inferring genetic ancestry: • Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers). • Apply model-based clustering or PCA. OR • AIM approach to inferring genetic ancestry: • Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry. • Apply model-based clustering or PCA.

Working with the data • Public data available in e.g. • HapMap • 1000 Genomes • dbSNP • Ensembl, etc • Can be retrieved and used with user-owned data

The 1000 Genomes Project Aims • Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%. • Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms. • Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas • Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp) • How much coverage is needed? 1000 Genomes Project Consortium 2010 Nature

1000 Genomes Pilot Projects 1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child) at high coverage: >40x. 2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x. 3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.

Implications of genetic diversity Gene expression: Microarrays, RNASeq –thousands of data points Potential disease phenotype Environmental data: socio-economic impact Protein abundance: Mass spectrometry –thousands of possibilities Pathways and interactions: binary and directed interactions

Applications of genetic variation • Disease association studies depend on population group & genetic diversity • Genome wide association studies (GWAS) • >1000 cases + >1000 controls • Identify SNPs with significantly different frequencies between the groups • Correlate this with the disease phenotype • Pharmacogenetics

Pharmacogenetics Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but higher success rate in Japan BiDil for heart failure –only approved for African Americans Warfarin anticoagulant –variations caused by SNPs in CYP2C9 or VKORC1

Summary • Population genetics is a large field! • Used for: • Identifying population structure for history and medical studies • Looking for ancestry, e.g. in admixed populations • Use LD, haplotypes and population structure in disease association studies • pharmacogenetics

Population genetics

Population genetics

Presentation Transcript

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

POPULATION GENETICS

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics

Population Genetics: