中国科学院上海生命科学研究院研究生课程 人类群体遗传学. 人类群体遗传学 基本原理和分析方法. 中科院 - 马普学会计算生物学伙伴研究所. 徐书华 金 力. 第六讲. 人群遗传结构分析 ( I ). 第四讲. 群体遗传学中的基本概念( 4 ) 群体遗传结构 描述群体遗传结构的统计量 Hierarchical F statistics 软件演示 利用 Arlequin 计算人群的 Hierarchical F statistics. 什么是遗传结构?. 从 差异 中发现 结构 ! 遗传多态性在时间上和空间上的不同分布 模式就是 遗传结构 。
第六讲 人群遗传结构分析(I)
第四讲 • 群体遗传学中的基本概念(4) • 群体遗传结构 • 描述群体遗传结构的统计量 • Hierarchical F statistics • 软件演示 • 利用Arlequin计算人群的Hierarchical F statistics
什么是遗传结构? • 从差异中发现结构! • 遗传多态性在时间上和空间上的不同分布 模式就是遗传结构。 • 时间:不同时代; 不同世代。 • 空间:不同地理分布; 同域不同人群; 不同基因组区域。
为什么研究人类的遗传结构? • 人类起源、迁徙、进化历史及前景 • 现代人群(民族)之间的亲缘关系 • 复杂疾病的遗传基础和基因定位 • 癌症 • 肥胖 • 哮喘 • 精神病 • II型糖尿病 • 心血管系统疾病 • 公共卫生保健 • 个性化用药和个性化治疗 • 法医学
An example • Population structures and association studies
Population structures make trouble in association studies • Population stratification in Epidemiology. • Analysis of mixed samples having different allele frequencies is a primary concern in human genetics, as it leads to false evidence for allelic association.
Odds ratio Disease Exposure yes no total yes a b a + b no c d c + d total a + c b + d a + b + c + d Odds for case: a/c Odds for control: b/d Odds ratio
Explanation of OR • OR>1: exposure factors increase the risk of disease; positive association • OR<1: exposure factors decrease the risk of disease; negative association • OR=1: no association
Example Odds for case 50:50 = 1 Odds for control 20:80 = 0.25 Odds ratio = 50:50/20:80 = 1/0.25 = 4
Subpopulation 1Subpopulation 2 casecontrol casecontrol exp(+) 5050 100 exp(+) 19 10 exp(-) 450450 900 exp(-) 99891 990 500 500 1,000 100 900 1000 Total Population case control exp(+) 5159 110 exp(-) 5491,341 1,890 600 1,400 2,000 51 600 59 1,400 = 8.5% = 4.2% Heterogeneity/Stratification OR=2.02
Human migration • Anatomically modern humans evolve in Africa > 160,000 ybp. • Some leave Africa sometime around 75,000 - 55,000 ybp. • Replace Neanderthals in Europe and archaic humans around the world. • Arrive in Western hemisphere between 34,000 and 18,000 ybp. • Multiple migrations in different pre-historic periods, followed by different migrations in historical periods.
Note on Definitions: Biological Race • morphology (phenotype) • Geographical location • Population based (frequency of genes) Socially Constructed Race: Arbitrarily utilizes aspects of morphology, geography, culture, language, religion, etc. in the service of a social dominance hierarchy.
描述遗传结构的统计量 • Hierarchical F statistics
固定指数 • 固定指数(F): • 如果一个座位上有两个等位基因,Hardy-Weinberg比率的任何偏差可以由参量F来度量,F称为固定指数,则基因型频率可以由下式给出: • 由以上第二式可得:
随机交配(h)情况下杂合子的预期频率 群体(h0)中下杂合子的观察频率 • 固定指数F可正可负,视情况而定。 • 可以看出,当h0小于h时,F取正值;当h0大于h时,F取负值。在近亲交配时,杂合子频率的观察值减小,F就取正值。 上式可写成
亚群体(sub-population) • 以上考虑的是一个简单的群体,不论其是否近亲交配。 • 然而,实际上大多数的自然群体可被再分为许多不同的繁殖单位或亚群体(sub-population),尽管这些群体并不是完全隔离的。这种情况下,研究群体内和群体间的遗传变异就显得十分重要。
可再分群体中的基因型频率 • 假定一个群体可分为s个亚群体,每一个亚群体都满足Hardy-Weiberg平衡。设xk为第k个亚群体中等位基因A1的频率,则基因型A1A1,A1A2,A2A2的频率分别为 • 我们用wk来表示第k个亚群体的相对大小,且总和为1。则A1A1,A1A2,A2A2在整个群体中的频率为: 其中 和 是亚群体中等位基因频率的均值和方差。
可再分群体中的固定指数 • 比较 ,因此 我们知道
Wahlund定律 • 表明如果一个群体被分为多个交配单位,纯合子的频率要高于Hardy-Weinberg比率。这个性质首先由Wahlund(1928)发现,被称为Wahlund定律,也称Wahlund现象。 • 当等位基因频率在所有亚群体中一致时,F为0;而当每个亚群体都被固定为某一个等位基因时,F为1。
Wahlund现象的启示 • 群体结构(population structure)的存在! • 反之,当F为负值的时候, 杂合子频率比Hardy-Weinberg平衡时预期的要高,意味着杂合优势,某种程度的自然选择发生。 杂合优势与平衡选择(后面“自然选择”章节细谈)
Wright’s Fixation Index (FST) Sewall Wright 1889-1988
F-statistics • Different F-statistics for different scales • Individual (I) • Subpopulation (S) • Total population (T) • Those are the traditional scales but in theory there can be no limit to the # of levels of analysis . • Originally defined for 2 alleles • Extended to >2 alleles as G-statistics
F-statistics Derived from inbreeding coefficient • FIS • inbreeding in individuals relative to subpopulation (Weir and Cockerham’s f) • FST • inbreeding among subpopulations relative to total population (Weir and Cockerham’s ) • FIT • inbreeding among individuals relative to total population (Weir and Cockerham’s F)
Remember that inbreeding coefficient, F, is related to loss of heterozygosity F = 1 – (Ho/He) • F-statistics can be expressed in the same way FIS = 1 – (HI/HS) FST = 1 – (HS/HT) FIT = 1 – (HI/HT) HI= HO averaged across subpopulations HS= He averaged across subpopulations HT= He for total population = He
Deficit of heterozygote aa AA aa AA FST = 1 – (HS/HT) aa AA aa AA P(A) = p = 1 P(a) = q = 0 p = 0 q = 1 HS = Hewithin subpopulation HS = 1 - pi2 = 1 - (12 + 02) = 0 HS = 0 Mean HS = 0 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0/0.5) = 1
Deficit of homozygote Aa Aa Aa Aa FST = 1 – (HS/HT) Aa Aa Aa Aa P(A) = p = 0.5 P(a) = q = 0.5 p = 0.5 q = 0.5 Mean HS = 0.5 HS = 1 - pi2 = = 1 - (0.52 + 0.52) = 0.5 HS = 0.5 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0
P(A) = p = 0.5 P(a) = q = 0.5 p = 0.5 q = 0.5 Mean HS = 0.5 HS = 1 - pi2 = = 1 - (0.52 + 0.52) = 0.5 HS = 0.5 HT= He for total population For total population, p = 0.5 & q = 0.5 HT =1 - pi2 = 1- [(0.5)2 + (0.5)2] = 0.5 FST = 1 – (HS/HT) = 1 - (0.5/0.5) = 0 AA AA Aa Aa FST = 1 – (HS/HT) Aa Aa aa aa FST uses expected heterozygosity, not observed heterozygosity!!
F statistics • FIS tells us if there is inbreeding within subpopulations by comparing HI and HS: • Bars mean that the values are the averages over all the subpopulations that we are considering. • So FIS measures whether there is, on average, a deficit of heterozygotes within subpopulations.
F statistics • FST is the statistic that tells us how differentiated the subpopulations are. Formally, FST tells us if there is a deficit of heterozygotes in the metapopulation, due to differentiation among subpopulations: • Bars mean that the values are the averages over all the subpopulations that we are considering.
F statistics • FIT tells us how much population structure has affected the average heterozygosity of individuals within the population: • Also (1-FIS) (1-FST) = (1-FIT).
F-statistics Measure departure from Hardy-Weinberg equilibrium • FIS = departure from HW in local subpopulations • FST = genetic divergence among subpopulations • FIT = total departure from HW including that within and among subpopulations
FIT FIS FST Partitioning of structure Individuals Subpopulations Total population Inbreeding Wahlund Effect or fragmentation 1 – FIT= (1 – FST)(1 – FIS) FIT = FIS + FST – (FIS)(FST)
The three F statistics are related to each other • FST = (FIT - FIS) / (1 - FIS) • FST is always positive • FIS is frequently positive, is negative if there is systematic avoidance of inbreeding • FIT is positive unless there are not clear subdivisions and there is avoidance
Extensions • Variance of allele frequencies across subpopulations • When in HW, Var(q) = 0, therefore FST = 0 • As Var(q) increases, divergence of subpopulations increases
Intuitive meaning of FST • The proportion of total genetic variation that is distributed among subpopulations, rather than within subpopulations.
Unbiased estimates of FST • Unbiased estimates of FST were calculated as described by Weir and Hill 2002. • Suppose we have i subpopulations (where i = 1,…, r), we denote sample allele frequency as , and denote the average frequency over samples as • and denote the average frequency over samples as
The observed mean square for loci within populations are denoted by MSG:
The observed mean square for between populations are denoted by MSP:
Then FSTcan be estimated as follows: Where is the average sample size across samples that also incorporates and corrects for the variance in sample size over subpopulations:
Problems with FST • Assumes Infinite Alleles Model (IAM) or K-alleles model with very low mutation rates (not appropriate for microsat data) • All alleles differ equally from each other (magnitude of difference between alleles ignored) • Does not work well with high heterozygosity • Assumes alleles arrive in population via migration rather than mutation
Special version for microsatellites • RST (Slatkin 1995) • Analogue of FST • Assumes Stepwise Mutation Model (mutation model most appropriate for microsats) • Allows for high mutation rates • Allows differences in magnitude between alleles to be accounted for • Where S = average sum of differences in allele sizes in total population, and SW = average sum within populations