591 likes | 1.24k Views
Population genetics. Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen. Contents. Background Linkage disequilbrium SNP tagging Population studies Accessing data. What is population genetics?.
E N D
Population genetics Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen
Contents • Background • Linkage disequilbrium • SNP tagging • Population studies • Accessing data
What is population genetics? Population genetics is the study of genetic variation both within and between human populations. • 5-7% of worldwide human genetic variation is due to genetic differences between human populations. • The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).
Why study population genetics? • Learn about human migration patterns and history • Improve power to identify and localize disease genes • Use differences in linkage disequilibrium for fine-mapping • Avoid false positives due to population stratification • Admixture mapping for diseases with varying prevalence • Signals of natural selection at genes related to disease
Linkage Disequilibrium Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers. Linkage Disequilibrium can be used for association studies
Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G C C A A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G T T A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G TT A A SNP 1 YES, in LD SNP 2 3 billion letters
Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G G C C A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G TT A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G T T A A SNP 1 YES, in LD SNP 2 3 billion letters NOT in LD SNP 3
Linkage Disequilibrium: Example Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 11 0 0 0 0 11 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 SNP 1 r2=1, in LD SNP 2 3 billion letters r2=0,NOT in LD SNP 3 r2is squared correlation
LD: Haplotype Blocks Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 11 0 0 0 0 00 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 SNP 1 SNP 2 3 billion letters SNP 3 These 3 SNPs form a “haplotype block”
Haplotype blocks • Variants (alleles) that are located close to one another are inherited together (in LD) • This pattern is disrupted by recombination (shuffles chromosomes) • Recombination is finite and time since origin of humans insufficient to break down all linkage => haplotype blocks • African chromosomes: 50% of the genome lies in haplotype blocks >22kb. • Europeans and Asians: 50% of the genome lies in haplotype blocks >44kb. • Longer haplotype blocks in Europeans/Asians due to out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya. Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)
Population bottlenecks Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS, Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)
Linkage Disequilibrium and tag SNPs Direct association: genotype SNP1 in Cases and Controls. Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1: causal SNP 3 billion letters
Linkage Disequilibrium and tag SNPs Indirect association: genotype SNP2 in Cases and Controls. If SNP1 affects disease risk, then SNP2 will also be associated! Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1 r2=1, in LD 3 billion letters SNP 2
LD: Haplotype Blocks Risk haplotype Case Control Case Control Case Control Case Control Case Control Question: Which SNP to genotype? Answer: Choose 1 SNP per haplotype block, and take advantage of indirect association! Use known resources
HapMap for “SNP tagging • How to select SNPs to genotype in an association study: • Choose genomic region(s) of interest. • Look up HapMap SNPs in the genomic region(s) • Choose a subset of HapMap SNPs which “tag” haplotype blocks in the genomic region(s) • Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study
SNP Imputation Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
SNP Imputation cont. r2 = 0.8 Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
SNP Imputation cont. Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet
Why do Imputation? • Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP) • Enable meta-analysis of studies on Affymetrix + Illumina chips • Improve genotype data quality • Imputation algorithms available
Population studies • Population structure: refers to genetic differences between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry • Population admixture: Mixed ancestry from multiple continental populations. e.g. African Americans, Latino Americans Classify local ancestry at each location in the genome • Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls
Previous GWAS studies Rosenberg et al. 2010, Nat Rev Genet
The International HapMap Project • 270 samples from 4 populations • Found 3.1 million SNPs
Measuring distances- FST • The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population. OR • The FST between two populations is equal to the proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences. • High FST implies a high degree of differentiation
Some example FSTs Yoruba (Nigeria) Luhya (Kenya) FST = 0.008 Japanese Chinese FST = 0.007 Southeast Eur. Northwest Eur. FST = 0.005
Studying population structure • Can study population structure by: • Principal Component Analysis (PCA) • Clustering
Principal Components Analysis 10 points in 1,000,000-dimensional space.
Axes of variation (PCs, eigenvectors) Axis 1 is the axis explaining the maximum amount of variation. Axis 1
Axes of variation (PCs, eigenvectors) Axis 2 Axis 1
Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 Axis 1
Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 … Axis 9 Axis 10 Axis 1
Distinguishing populations using PCA 100 markers
Distinguishing populations using PCA 3 million markers
PCA in Europe Novembre et al. 2008 Nature
Population structure using clustering • Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics) Rosenberg et al. 2002 Science
More examples Oceania America AfricaEuropeWestern EurasiaEast Asia
Clustering versus PCA • Model-based clustering: • Output for each individual: ancestry in N population clusters • Fractional ancestry (20% pop1, 80% pop2) may be allowed • Number N of population clusters must be decided in advance • Results may be sensitive to number of population clusters • Principal components analysis (PCA): • Output for each individual: ancestry as principal components • PCs do not necessarily correspond to specific populations
Ancestry Informative Markers • Standard approach to inferring genetic ancestry: • Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers). • Apply model-based clustering or PCA. OR • AIM approach to inferring genetic ancestry: • Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry. • Apply model-based clustering or PCA.
Working with the data • Public data available in e.g. • HapMap • 1000 Genomes • dbSNP • Ensembl, etc • Can be retrieved and used with user-owned data
The 1000 Genomes Project Aims • Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%. • Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms. • Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas • Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp) • How much coverage is needed? 1000 Genomes Project Consortium 2010 Nature
1000 Genomes Pilot Projects 1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child) at high coverage: >40x. 2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x. 3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.
Implications of genetic diversity Gene expression: Microarrays, RNASeq –thousands of data points Potential disease phenotype Environmental data: socio-economic impact Protein abundance: Mass spectrometry –thousands of possibilities Pathways and interactions: binary and directed interactions
Applications of genetic variation • Disease association studies depend on population group & genetic diversity • Genome wide association studies (GWAS) • >1000 cases + >1000 controls • Identify SNPs with significantly different frequencies between the groups • Correlate this with the disease phenotype • Pharmacogenetics
Pharmacogenetics Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but higher success rate in Japan BiDil for heart failure –only approved for African Americans Warfarin anticoagulant –variations caused by SNPs in CYP2C9 or VKORC1
Summary • Population genetics is a large field! • Used for: • Identifying population structure for history and medical studies • Looking for ancestry, e.g. in admixed populations • Use LD, haplotypes and population structure in disease association studies • pharmacogenetics