1 / 46

Population genetics

Population genetics. Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen. Contents. Background Linkage disequilbrium SNP tagging Population studies Accessing data. What is population genetics?.

milly
Download Presentation

Population genetics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Population genetics Nicky Mulder Acknowledgments -some slides based on course from Noah Zaitlen

  2. Contents • Background • Linkage disequilbrium • SNP tagging • Population studies • Accessing data

  3. What is population genetics? Population genetics is the study of genetic variation both within and between human populations. • 5-7% of worldwide human genetic variation is due to genetic differences between human populations. • The remaining 93-95% of human genetic variation is due to genetic variation within human populations (Rosenberg et al. 2002 Science).

  4. Why study population genetics? • Learn about human migration patterns and history • Improve power to identify and localize disease genes • Use differences in linkage disequilibrium for fine-mapping • Avoid false positives due to population stratification • Admixture mapping for diseases with varying prevalence • Signals of natural selection at genes related to disease

  5. Linkage Disequilibrium Definition: Linkage Disequilibrium (LD) refers to correlations between genotypes of nearby markers. Linkage Disequilibrium can be used for association studies

  6. Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G C C A A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G T T A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G TT A A SNP 1 YES, in LD SNP 2 3 billion letters

  7. Linkage Disequilibrium: Example Individuals A A G A T T A A C G T T G G C C A A A A G G T T A A C C T T G G C T A A A A AA T T A A GG T T G G T C A A A A G G T T A A C C T T G G TT A A A A G A T T A A C G T T G G C T A A A A G G T T A A C C T T G G C T A A A A G A T T A A C C T T G G C C A A A A G G T T A A C C T T G G T T A A SNP 1 YES, in LD SNP 2 3 billion letters NOT in LD SNP 3

  8. Linkage Disequilibrium: Example Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 11 0 0 0 0 11 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 SNP 1 r2=1, in LD SNP 2 3 billion letters r2=0,NOT in LD SNP 3 r2is squared correlation

  9. LD: Haplotype Blocks Individuals 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 0 0 0 11 0 0 0 0 00 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 SNP 1 SNP 2 3 billion letters SNP 3 These 3 SNPs form a “haplotype block”

  10. Haplotype blocks • Variants (alleles) that are located close to one another are inherited together (in LD) • This pattern is disrupted by recombination (shuffles chromosomes) • Recombination is finite and time since origin of humans insufficient to break down all linkage => haplotype blocks • African chromosomes: 50% of the genome lies in haplotype blocks >22kb. • Europeans and Asians: 50% of the genome lies in haplotype blocks >44kb. • Longer haplotype blocks in Europeans/Asians due to out-of-Africa population bottleneck: descended from small number of ancestors who left Africa 60-40 kya. Gabriel et al. 2002 Science (also see Reich 2001 Nature, Daly 2001 Nat Genet)

  11. Population bottlenecks Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS, Green et al. 2010 Science (Neanderthal genome), Reich et al. 2010 Nature (Denisova)

  12. Linkage Disequilibrium and tag SNPs Direct association: genotype SNP1 in Cases and Controls. Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1: causal SNP 3 billion letters

  13. Linkage Disequilibrium and tag SNPs Indirect association: genotype SNP2 in Cases and Controls. If SNP1 affects disease risk, then SNP2 will also be associated! Individuals Cases Controls 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 SNP 1 r2=1, in LD 3 billion letters SNP 2

  14. LD: Haplotype Blocks Risk haplotype Case Control Case Control Case Control Case Control Case Control Question: Which SNP to genotype? Answer: Choose 1 SNP per haplotype block, and take advantage of indirect association! Use known resources

  15. HapMap for “SNP tagging • How to select SNPs to genotype in an association study: • Choose genomic region(s) of interest. • Look up HapMap SNPs in the genomic region(s) • Choose a subset of HapMap SNPs which “tag” haplotype blocks in the genomic region(s) • Note: because LD patterns vary by population, it is important to choose tag SNPs using a HapMap population similar to the population in the association study

  16. SNP Imputation Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

  17. SNP Imputation cont. r2 = 0.8 Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

  18. SNP Imputation cont. Causal SNP Howie et al. 2009 PLoS Genet; also see Marchini et al. 2007 Nat Genet

  19. Why do Imputation? • Increase power to detect disease association at untyped causal SNP (imputed causal SNP may have stronger association than tag SNP) • Enable meta-analysis of studies on Affymetrix + Illumina chips • Improve genotype data quality • Imputation algorithms available

  20. Population studies • Population structure: refers to genetic differences between populations due to geographic ancestry. Use genome-wide data to classify genome-wide ancestry • Population admixture: Mixed ancestry from multiple continental populations. e.g. African Americans, Latino Americans Classify local ancestry at each location in the genome • Population stratification: Refers specifically to a genotype-phenotype association study. Differences in genetic ancestry between cases and controls

  21. Previous GWAS studies Rosenberg et al. 2010, Nat Rev Genet

  22. The International HapMap Project • 270 samples from 4 populations • Found 3.1 million SNPs

  23. HapMap3

  24. Measuring distances- FST • The FST between two populations is the value such that the allele frequency difference between the two populations has mean 0 and variance 2FSTp(1 – p), where p is the allele frequency in the ancestral population. OR • The FST between two populations is equal to the proportion of genotypic variance in a set of N individuals from each population that is attributable to population differences. • High FST implies a high degree of differentiation

  25. Some example FSTs Yoruba (Nigeria) Luhya (Kenya) FST = 0.008 Japanese Chinese FST = 0.007 Southeast Eur. Northwest Eur. FST = 0.005

  26. More FST for HapMap3

  27. Studying population structure • Can study population structure by: • Principal Component Analysis (PCA) • Clustering

  28. Principal Components Analysis 10 points in 1,000,000-dimensional space.

  29. Axes of variation (PCs, eigenvectors) Axis 1 is the axis explaining the maximum amount of variation. Axis 1

  30. Axes of variation (PCs, eigenvectors) Axis 2 Axis 1

  31. Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 Axis 1

  32. Axes of variation (PCs, eigenvectors) Axis 3 Axis 2 … Axis 9 Axis 10 Axis 1

  33. Distinguishing populations using PCA 100 markers

  34. Distinguishing populations using PCA 3 million markers

  35. PCA in Europe Novembre et al. 2008 Nature

  36. Population structure using clustering • Model-based clustering programs such as STRUCTURE (Pritchard et al. 2000 Genetics) Rosenberg et al. 2002 Science

  37. More examples Oceania America AfricaEuropeWestern EurasiaEast Asia

  38. Clustering versus PCA • Model-based clustering: • Output for each individual: ancestry in N population clusters • Fractional ancestry (20% pop1, 80% pop2) may be allowed • Number N of population clusters must be decided in advance • Results may be sensitive to number of population clusters • Principal components analysis (PCA): • Output for each individual: ancestry as principal components • PCs do not necessarily correspond to specific populations

  39. Ancestry Informative Markers • Standard approach to inferring genetic ancestry: • Genotype each individual on a GWAS chip (500,000-1,000,000 random genetic markers). • Apply model-based clustering or PCA. OR • AIM approach to inferring genetic ancestry: • Genotype each individual on a small set of 50-300 AIMs: markers that are highly informative for genetic ancestry. • Apply model-based clustering or PCA.

  40. Working with the data • Public data available in e.g. • HapMap • 1000 Genomes • dbSNP • Ensembl, etc • Can be retrieved and used with user-owned data

  41. The 1000 Genomes Project Aims • Characterize allele frequencies and linkage disequilibrium patterns of 95% of variants with allele frequency >1%. • Pioneer and evaluate methods for generating and analyzing data from next-generation sequencing platforms. • Sequence the entire genomes of 2,500 individuals: 500 from Europe, East Asia, West Africa, South Asia and the Americas • Used next-generation sequencing technologies: e.g. Illumina/Solexa, 454, SOLiD (read lengths 25-400bp) • How much coverage is needed? 1000 Genomes Project Consortium 2010 Nature

  42. 1000 Genomes Pilot Projects 1. Trio pilot project: Sequence 1 CEU trio (mother, father, child) and 1 YRI trio (mother, father, child) at high coverage: >40x. 2. Low-coverage pilot project: Sequence 60 unrelated CEU and 60 unrelated YRI at low coverage: about 4x. 3. Exon sequencing pilot project: Exon capture sequencing of 8,140 exons from 906 genes in 697 individuals from 7 populations.

  43. Implications of genetic diversity Gene expression: Microarrays, RNASeq –thousands of data points Potential disease phenotype Environmental data: socio-economic impact Protein abundance: Mass spectrometry –thousands of possibilities Pathways and interactions: binary and directed interactions

  44. Applications of genetic variation • Disease association studies depend on population group & genetic diversity • Genome wide association studies (GWAS) • >1000 cases + >1000 controls • Identify SNPs with significantly different frequencies between the groups • Correlate this with the disease phenotype • Pharmacogenetics

  45. Pharmacogenetics Most drugs only work for 40% people Most likely related to population specificity Some drugs cause adverse drug reactions related to SNPs in metabolizing enzymes (60%) IRESSA cancer drug only works in 10% patients, but higher success rate in Japan BiDil for heart failure –only approved for African Americans Warfarin anticoagulant –variations caused by SNPs in CYP2C9 or VKORC1

  46. Summary • Population genetics is a large field! • Used for: • Identifying population structure for history and medical studies • Looking for ancestry, e.g. in admixed populations • Use LD, haplotypes and population structure in disease association studies • pharmacogenetics

More Related