A GLMM-based Collapsing Method for Rare CNV Analysis

1 Jung-Ying Tzeng Bioinformatics Research Center & Department of StatisticsNC State University Joint work with Jin Szatkiewicz and Patrick Sullivan @ UNC-CH A GLMM-based Collapsing Method for Rare CNV Analysis ENAR March 18, 2014

2 Copy Number Variants (CNVs) (Source: Ferreira and Purcell 2009) Duplication ATG... ...CG • CNVs : changes in the number of DNA copies comparing to the reference • Although SNPs outnumber CNVs, their relative contributions to genomic variation (as measured in nucleotides) are similar(Malhotra and Sebat 2012) GGG... ...GTG GAA... ...TT ...CG ATG... 1bp - Mb Deletion ATG... ...CG

3 Copy Number Variants (CNVs) • CNVs can affect disease risk Ex. CNVs play an important role in the etiology of multiple psychiatric disorders, e.g., developmental delay, autism, schizophrenia

Malhotra and Sebat 2012

6 Collapsing Analysis for rare CNVs • Collapsing analysis serves as a key approach to evaluate the collective effect of rare CNVs (Sullivan et al. 2012; Collins and Sullivan 2013; Malhotra & Sebat 2012) • CNVs are typically collapsed across the genome • Ex. a greater genome-wide burden of rare CNVs in SCZ cases than in controls (Walsh et al. 2008 Science; International Schizophrenia Consortium 2008 Nature; Kirov et al. 2009 Hum. Mol. Genet; Buizer-Voskampet al. 2011 Biol. Psychiatry) or within genes • Ex. the burden of rare CNVs in NRXN1 was significantly greater in SCZ cases than in controls (Szatkiewicz et al. submitted)

7 Developments in SNP Collapsing Analysis • Depending on how genotype information are modeled, SNP collapsing methods can be roughly classified into • Fixed effects approaches • Random effects approaches

8 SNP Collapsing Analysis • Fixed effects approaches • Focus on testing mean level of genetic effects • Optimal if the effects of different loci are additive, have similar size and same direction

9 SNP Collapsing Analysis • 2. Random effects approaches • Focus on testing variance level of genetic effects • , genetic similarity between and

10 SNP Collapsing Approaches • 2. Random effects approaches • Methods differ by the choices of weights and • E.g., Global test (Goeman et al. 2004) • and no weights • C-alphamethod (Neale et al. 2011) • and with weight = I{MAF < cut} • Kernel Machine Regression (Wu et al. 2010, 2011) • = IBS at locus between and and weight = (1-MAF)24 • Similarity Regression (Tzeng et al. 2009, 2011, 2014) • = IBS at locus between and and weight = • Optimal if genetic effects are interactive / non-linear among loci or vary across loci

11 Challenges in CNV Collapsing Analysis--- Cautions about applying SNP collapsing methods • Copy number (dosage) is not binary • Deletion (0,1), normal copy (2) and duplication(3,4+) • SNP collapsing methods assume binary event (i.e., mutant allele vs. not) and only keep track of number of “events” • CNV polymorphisms are multi-faceted • CNVs can vary in dosage, lengthand details of gene intersections • Each of these ”features” affects CNVs’ impact on disease risk. • SNP collapsing methods target only on one feature (i.e., mutation burden).

12 Challenges in CNV Collapsing Analysis--- Cautions about applying SNP collapsing methods • Etiological heterogeneity is often observed in CNVs • Different dosage may have different effects Ex. 22q11.2 deletion is a risk factor for SCZ (Bassett et al. 2005; Murphy et al. 1999) whereas 22q11.2 duplication is a protective factor (Rees et al. 2014) Ex. In gene VIPR2, triplication has higher risk than duplication for SCZ (Vacic et al. 2011) • Collapsing with random effects methods have greater potentials than fixed effect methods for CNV analyses (for between-locus heterogeneity) • Cautions are still needed for within-locus heterogeneity

13 Current CNV Collapsing Methods (All are fixed effects methods) • PLINK Burden Tests(International Schizophrenia Consortium 2008; Kirov et al. 2009) • Dichotomize CNV genotypes based on the event of interests, e.g., • CNV () vs. no CNVs () • Del (<2) vs. No Del • Dup (>2) vs. No Dup • Genes intersected (GI) by CNVs vs. no GI • Compare the event rates between cases and controls • Drawbacks: • Need to dichotomize data based on event of interests • Do not address the issue of etiological heterogeneity • Only evaluate marginal effects of a CNV feature, which subjects to spurious association (Raychaudhuri et al. 2010)

14 Under no GI effect (Raychaudhuri et al. 2010)

15 Current CNV Collapsing Methods 2. PLINK Enrichment Tests (Raychaudhuri et al. 2010) • Pros: assess conditional effect of CNV features and avoid spurious association # of genes intersected (GI) by a CNV Mean CNV size (kb) Total # of CNVs

16 Under no GI effect (Raychaudhuri et al. 2010)

17 Current CNV Collapsing Methods 2. PLINK Enrichment Tests (Raychaudhuri et al. 2010) • Pros: assess conditional effect of CNV features and avoid spurious association • Cons: • Need to dichotomize data based on event of interests • Do not address the issue of etiological heterogeneity # of genes intersected (GI) by a CNV Mean CNV size (kb) Total # of CNVs

18 Proposed CNV Collapsing Method

19 Plan • Use random effects model approaches • To account for between-locus and within-locus etiological heterogeneity • Model multiple features of CNVs • To assess the conditional effect of a CNV feature • Accommodate multi-nominal nature of dosage • To avoid dichotomizing data

20 1. Input Data Format (0) Start with a PLINK format CNV file (1) Define CNV region (CNVR): • Clusters of CNV segments with ≥1bp overlap • Retain region-specific effect when collapsing CNVR1 CNVR2 • -------------------------------------------------------------------------------------------------- 1 2 3 4 5 …

21 1. Input Data Format CNVR CNVR 0,1,2,3,4} (2) Create design matrix for each CNV feature: dosage, length, and gene intersection • Dosage (DS) : • Length (Len): CNVR CNVR

22 1. Input Data Format (2) Create design matrix for each CNV feature: dosage, length, and gene intersection • Gene intersection (GI) : Gene Gene

23 2. Model • For subject , be the continuous or binary trait, be a covariate vector including the intercept, and design vector of feature , • Model • Assume exponential family with density where and models the effect of CNV feature

24 2. Model • Example of • Ex. Linear regression: • Ex. Random effect: and • Ex. In Raychaudhuriet al (2010), • (total of CNVs of subject ) • (of GI by CNV for subject ). • Propose to model the covariates and background CNV features using fixed effects and model the CNV feature of interests using random effects

25 Example: Assessing Dosage Effect • GLM Model: where matrix with similarity between • mean CNV length in kb • Dosage effect can be evaluated by testing • Test statistic: follows a weighted distribution

26 Remark 1: Connection with Other Random Effects Methods • The GLMM has a direct connection with kernel machine regression (Kwell et al. 2008; Wu et al. 2010) and gene-trait similarity regression (Tzeng et al. 2009; 2011) • Under the kernel machine framework, the GLMM is equivalent to set with being the unknown parameters (the dual representation) • Under the similarity regression framework, regression coefficient of genetic similarity that is quantified by the similarity metric .

27 Remark 2: Quantifying Similarity b/w Use the -th order polynomial function is the pre-specified weight for locus based on, e.g., MAF • Cannot directly use in the kernel function (both and are deviated from “normal reference”) • Solution: factorize dosage • for • for 3 • Then, which retains dosage-specific effect when collapsing

28 Simulation Studies

29 Simulation Scheme • Obtain CNV data from TwinGene Project (Heijmans 2005; Silventoinen et al. 2006) • Cross-sectional sampling design • 2000 unrelated samples (rarest CNV = ) • 1757 CNVRs • 688 genes (69 genes intersected by CNVs) • Sample with replacement to form an individual’s CNV • Determine based on CNV features of interests • Simulate individuals (1000 cases and 1000 controls)

30 Simulation Scheme Scheme A. Different dosage effects of Dup and Del A1. Between-locus heterogeneity • Randomly select 300 Dup-only CNVRs and 300 Del-only CNVRs as causal loci A2. Within-locus heterogeneity • Select the 38 CNVRs with both Dup and Del as causal Scheme B. Different gene-Intersection effect of Dup and Del (i.e., heterogeneous effect of genes intersected by Dup and by Del) B1. Across-gene heterogeneity • Randomly select 26 genes with Dup intersection on only and 26 genes with Del intersection only as causal B2. Within-gene heterogeneity • Select the 8 genes with both Dup and Del intersection as causal

31 Type I Error for (A) Dosage Analysis • Compare the proposed GLMM methods with plink.all = PLINK CNV rates plink.dup= PLINK Duplication rates plink.del= PLINK Deletion rates • Type I error rates:

32 Simulation Scheme Scheme A. Different dosage effects of Dup and Del A1. Between locus heterogeneity • Randomly select 300 Dup-only CNVRs and 300 Del-only CNVRs as causal A2. Within locus heterogeneity • Select the 38 CNVRs with both Dup and Del as causal Scheme B. Different gene-intersection effect of Dup and Del B1. Across-gene heterogeneity • Randomly select 26 genes with Dup intersection only and 26 genes with Del intersection only as causal B2. Within-gene heterogeneity • Select the 8 genes with both Dup and Del intersection as causal

33 Type I Error for Gene Intersection (GI) Analysis • Compare the proposed GLMM methods with PLINK Enrichment test (Raychaudhuri et al. 2010) • Type I error rates:

34 Power Analysis for (A) Dosage Effects

A1. (Dosage effect) Between-Locus Heterogeneity 35 All Dup causal are harmful All Del causal are protective 50% Dup causal (Del causal) are harmful and 50% are protective Power (vs. plink 2 sided) All Dup causal are protective All Del causal are harmful No Heterogeneity

A2. (Dosage effect) Within-Locus Heterogeneity 36 Power (vs. plink 2 sided)

37 B. Power Analysis for (B) GI Effects

B1. (GI effect) Between-Gene Heterogeneity 38 All Dup causal are harmful All Del causal are protective 50% Dup causal (Del causal) are harmful and 50% are protective Power (vs. plink.enrichment) All Dup causal are protective All Del causal are harmful No Heterogeneity

B2. (GI effect) Within-Gene Heterogeneity 39 Power (vs. plink.enrichment)

40 Summary For CNV collapsing analysis: • Developments in SNP collapsing can be applied in CNV collapsing with modification to account for the nature of CNVs, e.g., defining “locus” using CNVR or gene, calculating similarity based on factorized dosage / GI details, adjust for background CNV features • Random effect modeling has more potential to address etiological heterogeneity • For DS, random effects model has robustness across different scenarios • For GI, GLMM is more powerful than plink.enrichment • Note that GLMM has the same model as plink.enrichment except that GI effect is modeled using random effect with factorized coding • Current work: a fixed-effect imputation method to speed up the EM computation (for estimation the variance components) when using random effects on all CNV features

Testing with

Thank you

Kirov et al 2009

Under no GI effect (Raychaudhuri et al. 2010)

47 Multi-faceted Nature of CNVs Kirov et al 2009

A2. (Dosage effect) Within-Locus Heterogeneity 48 Power (vs. plink 2 sided)

A1. (Dosage effect) Between-Locus Heterogeneity 49 All Dup causal are harmful All Del causal are protective 50% Dup causal (Del causal) are harmful and 50% are protective Power (vs. plink 1 sided) All Dup causal are protective All Del causal are harmful No Heterogeneity

A GLMM-based Collapsing Method for Rare CNV Analysis

A GLMM-based Collapsing Method for Rare CNV Analysis

Presentation Transcript

A DSP-based method for transient restoration

Collapsing Can

Difficulties for the analysis of rare cancers

Corporate Responsibility in a Collapsing World

Collapsing Gracelessly

A method for pacing analysis

A Context Analysis Method for Developing Secure Embedded

What is “collapsing ”? ( for epidemiologists)

Market-based method

A Method for Evidence Based Quality Practice Engineering

A DDLM-based Method for Solving Distributed Problems

Choroidal neovascularisation CNV

Adjusting Relatedness for Family Data in Collapsing Test of Rare Variants

SIFT A Literary Analysis Method

Independence Fault Collapsing

Dominance Fault Collapsing

A method for combining family-based rare variant tests of association

Dominance Fault Collapsing

SIFT A Literary Analysis Method