490 likes | 1.05k Views
中国科学院上海生命科学研究院研究生课程 人类群体遗传学. 人类群体遗传学 基本原理和分析方法. 中科院 - 马普学会计算生物学伙伴研究所. 徐书华 金 力. 第八讲. 人群遗传结构分析 ( II ). 第八讲. 人群分化与遗传多样性 STRUCTURE 分析 文件格式 参数设定 结果解释 软件展示 STRUCTURE 2.2.3. 人群遗传结构分析. 人群遗传结构分析 Gene tree based AMOVA (hierarchical F statistics) Factor analysis
E N D
中国科学院上海生命科学研究院研究生课程人类群体遗传学中国科学院上海生命科学研究院研究生课程人类群体遗传学 人类群体遗传学基本原理和分析方法 中科院-马普学会计算生物学伙伴研究所 徐书华 金 力
第八讲 人群遗传结构分析(II)
第八讲 • 人群分化与遗传多样性 • STRUCTURE分析 • 文件格式 • 参数设定 • 结果解释 • 软件展示 • STRUCTURE 2.2.3
人群遗传结构分析 • 人群遗传结构分析 • Gene tree based • AMOVA (hierarchical F statistics) • Factor analysis • Principle Component analysis • STRUCTURE analysis
Previous genome-wide data in HGDP panel • Science 2002 • 52 populations, 1,056 individuals • 377 autosomal STRs • Plos Genet 2005 • 52 populations, 1,048 individuals • 783 STRs, 210 indels • Nature Genetics 2006 • 52 populations, 927 individuals • 3,024 SNPs in 36 genomic regions
NIH & University of Michigan Stanford University
Genotype, haplotype and copy-number variation in worldwide human populations • Study design: • Genome-wide patterns of variation; • Fine-scale population structure. • Data structure: • 29 HGDP populations, 485 individuals. • 4 HapMap populations, 112 individuals. • 525,910 SNPs, 396 CNVs (Illumina HumanHap550K). • New findings: • Increasing linkage disequilibrium is observed with increasing geographic distance from Africa (a serial founder effect). • The global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. • Conclusions: • Support the utility of CNVs in human population-genetic research.
Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation • Study design: • Human genetic diversity; • Fine-scale population structure. • Data structure: • 51 populations; 938 individuals. • 650,000 SNPs (Illumina HumanHap650K). • New findings: • The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. • Observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. • Conclusions: • This data set allows the most comprehensive characterization to date of human genetic variation. Individual ancestry and population substructure are detectable with very high resolution.
Maximum likelihood tree of 51 populations Oceania America 150,000 SNPs East Asia South/Central Asia Europe Middle East North Africa
MDS plots of individuals SNP Haplotype CNV
MDS Chrom 21 220 SNPs Nei’s DA
STR can not, SNP can Europe Middle East
All other Han Chinese Shy blue: CN-GA CN-PH Olive green: TW-HA TW-HB Brown: SG-CH
Inference on population structure using multi-locus genotype dataSTRUCTURE V2.2.3 Pritchard, Stephens, and Donnelly (2000) Falush, Stephens, and Pritchard (2003)
Main objective • Assign individuals to populations on the bases of their genotypes, while simultaneously estimating population allele frequencies
Other objectives • Begin with a set of predefined populations and to classify individuals of unknown origin • Identify the extent of admixture of individuals • Infer the origin of particular loci in the sampled individuals
Structure is a Model Based method of clustering (we must be assumptions about a lot of parameters and distributions)
Four basic models • Model without admixture each individual is assumed to originate in one (only one) of K populations • Model with admixture each individual is assumed to have inherited some proportion of its ancestry from each of K populations
Four basic models • Linkage model “Chunks” of chromosomes as derived as intact units from one or another K population and all allele copies on the same “chunk” derive from the same population. The model consider the derived correlations in ancestry
Four basic models • F model The populations all diverged from a common ancestral population at the same time, but allows that the populations may have experienced different amounts of drift since the divergence event
Assumptions • “Our main modeling assumptions are Hardy-Weinberg equilibrium within populations and complete linkage equilibrium between loci within populations” • “Loosely speaking, the idea here is that the model accounts for the presence oh HWD or LD by introducing population structure and attempts to find populations groupings that (as far as possible) are not in disequilibrium”
Data • Consider a sample of N individuals each one genotyped at L loci • Assume that the individuals represent a mixture of K unobserved populations (K unknown) • If diploid, we have an N×2L data matrix X • If n-ploid X is N× where Jl is the number of alleles at the lth locus
Parameter setting • Main parameters (mainparams.txt) • Extra parameters (extraparams.txt)
常用软件 • STRUCTURE • http://pritch.bsd.uchicago.edu/software/structure2_2.html • EIGENSOFT • http://genepath.med.harvard.edu/~reich/Software.htm • SPSS
练习 • 利用HapMap数据进行STRUCTURE分析; • http://www.hapmap.org