Association Analysis of Rare Genetic Variants

Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics

Rare Variants Low allele frequency: usually less than 1% Low power: for most analyses, due to less variation of observations High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p-value. 2

An Example of Low Power 3 Jonathan C. Cohen, et al. Science 305, 869 (2004)

An Example of High False Positive Rate(Q-Q plots from GWAS data, unpublished) N=~2500 MAF>0.03 N=~2500 MAF<0.03 N=50000 MAF<0.03 Bootstrapped N=~2500 MAF<0.03 Permuted

Three Levels of Rare Variant Data Level 1: Individual-level Level 2: Summarized over subjects Level 3: Summarized over both subjects and variants 5

Level 1: Individual-level 6

Level 2: Summarized over subjects (by group) 7 Jonathan C. Cohen, et al. Jonathan C. Cohen, et al. Science 305, 869 (2004) Science 305, 869 (2004)

Level 3: Summarized over subjects (by group) and variants (usually by gene)

Methods For Level 3 Data 9

Single-variant Test vs Total Freq.Test (TFT) Jonathan C. Cohen, et al. Science 305, 869 (2004)

What we have learned … • Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01) • Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)

Methods For Level 2 Data • Allowing different samples sizes for different variants • Different variants can be weighted differently 12

CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56 Under H0: S(cases)/2N(cases)−S(controls)/2N(controls) =0 S: variant number; N: sample size T= S(cases) − S(controls)N(cases)/N(controls) = S(cases) − S∗(controls) (S can be calculated variant by variant and can be weighted differently, the final T=sum(WiSi) ) Z=T/SQRT(Var(T)) ~ N (0,1) Var(T)= Var (S(cases) − S* (controls) ) =Var(S(cases)) + Var(S* (controls)) =Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2 13

C-alpha PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322 Effect direction problem

C-alpha 15

QQ Plots of Existing Methods(under the null) EFT and C-alpha inflated with false positives TFT and CAST no inflation, but assuming single effect-direction Objective More general, powerful methods … EFTTFT CAST C-alpha

More Generalized Methods For Level 2 Data 17

Structure of Level 2 data variant 1 variant 2 … … variant 3 variant k variant i Strategy Instead of testing total freq./number, we test the randomness of all tables.

Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for all k tables 3. Enumerating all possible tables and L scores 4. Calculating p-value P= Prob.( ) ASHG Meeting 1212, Zhang

Likelihood Ratio Test (LRT) Binomial distribution ASHG Meeting 1212, Zhang

Q-Q Plots of EPT and LRT(under the null) EPT N=500 LRT N=500 LRT N=3000 EPT N=3000

Power Comparison significance level=0.00001 Variant proportion Positive causal 80% Neutral 20% Negative Causal 0% Power Power Power Sample size Sample size Sample size

Power Comparisonsignificance level=0.00001 Variant proportion Positive causal 60% Neutral 20% Negative Causal 20% Power Sample size

Power Comparison significance level=0.00001 Variant proportion Positive causal 40% Neutral 20% Negative Causal 40% Power Sample size

Methods For Level 1 Data • Including covariates • Extended to quantitative trait • Better control for population structure • More sophisticate model 25

Collapsing (C) test Li and Leal,The American Journal of Human Genetics 2008(83): 311–321 Step 1 Step 2 logit(y)=a + b* X + e (logistic regression)

Variant Collapsing

WSS

WSS 29

WSS 30

Weighted Sum Test Collapsing test (Li & Leal, 2008), wi=1 and s=1 if s>1 Weighted-sum test (Madsen & Browning ,2009), wicalculated based-on allele freq. in control group aSum: Adaptive sum test (Han & Pan ,2010), wi= -1 if b<0 and p<0.1, otherwise wj=1 KBAC (Liu and Leal, 2010), wi = left tail p value RBT (Ionita-Laza et al, 2011), wi = log scaled probability PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail p value, incorporating both significance and directions EREC( Lin et al, 2011), wi = estimated effect size 31

When there are only causal(+) variants … Collapsing (Li & Leal,2008) works well, power increased 32

When there are causal(+) and non-causal(.) variants … Collapsing stillworks, power reduced 33

When there are causal(+) non-causal(.) and causal (-) variants … Power of collapsing test significantly down 34

P-value Weighted Sum Test (PWST) Rescaled left-tail p-value [-1,1] is used as weight 35

P-value Weighted Sum Test (PWST) Power of collapsing test is retained even there are bidirectional effects 36

PWST:Q-Q Plots Under the Null Direct test Inflation of type I error Corrected by permutation test (permutation of phenotype) 37

Generalized Linear Mixed Model (GLMM)& Weighted Sum Test (WST) 38

GLMM & WST Y : quantitative trait or logit(binary trait) α: intercept β: regression coefficient of weighted sum m: number of RVs to be collapsed wi : weight of variant i gi: genotype (recoded) of variant i Σwigi: weighted sum (WS) X: covariate(s), such as population structure variable(s) τ : fixed effect(s) of X Z: design matrix corresponding to γ γ: random polygene effects for individual subjects, ~N(0,G), G=2σ2K, K is the kinship matrix and σ2 the additive ploygene genetic variance ε: residual 39

Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold; Based on function annotation/prediction; SIFT, PolyPhen etc. Based on sequencing quality (coverage, mapping quality, genotyping quality etc.); Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test; Any combination … Weight 40

Adjusting relatedness in family data for non-data-driven test of rare variants. Application 1: Family Data Unadjusted: Adjusted: γ ~N(0,2σ2K) 41

Q-Q Plots of –log10(P) under the Null Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error Li & Leal’s collapsing test, modeling family structure via GLMM, inflation is corrected (From Zhang et al, 2011, BMC Proc.) 42

Application 2: Permuting Family Data Permuted Non-permuted, subject IDs fixed MMPT: Mixed Model-based Permutation Test Adjusting relatedness in family data for data-driven permutation test of rare variants. γ ~N(0,2σ2K) 43

Q-Q Plots under the Null WSS Permutation test, ignoring family structure, inflation of type-1 error aSum PWST SPWST 44 (From Zhang et al, 2011, IGES Meeting)

Q-Q Plots under the Null WSS Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected aSum PWST SPWST (From Zhang et al, 2011, IGES Meeting)

Burden Test vs. Non-burden Test Burden test Non-burden test T-test, Likelihood Ratio Test, F-test, score test, … SKAT: sequence kernel association test 46

SKAT: sequence kernel association test

Extension of SKAT to Family Data kinship matrix Polygenic heritability of the trait Residual Han Chen et al., 2012, Genetic Epidemiology

Other problems • Missing genotypes & imputation • Genotyping errors & QC (family consistency, sequence review) • Population Stratification • Inherited variants and de novo mutation • Family data & linkage infomation • Variant validation and association validation • Public databases • And more … 49

Association Analysis of Rare Genetic Variants