230 likes | 237 Views
This study utilizes mixed model analysis and the GASED approach to identify cis-regulatory haplotypes in Arabidopsis thaliana. The objective is to differentiate between cis- and trans-regulatory changes and identify superior alleles.
E N D
Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke# * Ghent University # VIB (Flanders Institute for Biotechnology)
Overview • Genetic background • Objectives • Data • Methodology • Results • Conclusions
Genetic background • Regulation of gene expression is affected either in: • Cis:affecting the expression of only one of the two alleles in a • heterozygous individual; • - Trans : affecting the expression of both alleles in a heterozygous individual;
Genetic background • Why search for Cis-regulatory variants? • “low hanging fruit”: window is a small genomic region • Fast screening for markers in LD with expression trait. • How to search for Cis-regulatory variants? • Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006) - Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)
Genetic Background • What is GASED approach? • The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element) kth offspring of cross i j From parent j From parent i From both (cross-terms) Genotypic variation In case homozygous In case there is no trans-effect In case there is cis-effect A cis-regulatory divergence completely explains the difference between two parental lines
Objectives of this study • Using mixed model analysis to discover Cis-regulated Arabidopsis genes • Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation. • To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes. • Systematic surveys of cis-regulatory variation to identify “superior alleles”.
Flow chart Data contains all expressed genes (25527 genes) Choose genes with significant genotypic variation: Step I: Choose genes from Step 1 with no trans-regulatory variation: Step II: Choose genes from step 2 displaying significant allelic imbalance to cis-regulatory variation: Step III: Step IV: Choose genes from Step 3 showing significant association with founded haplotype blocks:
Data • Data acquisition: • Scan the arrays • Quantitate each spot • Subtract noise from background • Normalize • Export table Data for us to analyze
Methodology - Step I Mixed-Model Equations yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm Full model: Gene X: expression values Residual RANDOM effect FIXED effects Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm • error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix: • Calculated as 1 – dR;dR = Rogers’ distance • (Rogers ,1972; Reif et al. 2005)
Methodology - Step I Mixed-Model Equations K = 55 x 55 marker-based relatedness matrix: pij and qijare allele frequencies of the jth allele at the ith locus niis the number of alleles at the ith locus (i.e. ni= 2) m refers to the number of loci (i.e. m = 210,205) Rogers (1972); Reif et al. (2005) Melchinger et al. (1991)
Methodology - Step I Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) p-value Gene X: 25527 Genes Adjusted q-value (FDR) FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive John Storey et al. (2002) : q-value to represent FDR Estimate the proportion of features that are truly null: We use adjusted q-value to represent FDR
Methodology - Step I Multiple testing correction Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1) We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.
Methodology - Step II Mixed-Model Equations y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm Full model: Gene X: expression values Residual RANDOM effect FIXED effects L is the Cholesky decomposition Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm
Methodology - Step II Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) p-value Gene X: qa-value (FNR) 20976 Genes • FNR: false non-discovery rate (Genovese et al , 2002) • How many of the called negatives are false? • 5% FNR means 5% of calls are false negative • Since we are interested in selecting genes with negativescaij effect, we control FNR instead of FDR We use qa-value to represent FNR
Methodology - Step II Multiple testing correction False non-discovery rate (FNR) : π0 is the estimate of the proportion of features that are truly null
Methodology - Step III Mixed-Model Equations yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm model: Test 45 pairs ? Gene X: g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10? Two sample dependent t-test Non-standard P-value Distribution of true null p-values is not uniformly distributed from 0 to 1
Methodology - Step III Multiple testing correction two sample t-test testing BLUPs Gene X: Simulate H0 distribution from real data: simulation-basedp-value q-value (FDR) 1380 Genes
Methodology - Step IV Mixed-Model Equations Full model: yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm Gene X: (cis-regulated) FIXED effects RANDOM effect Residual Gene chromosome SNP1 SNP2 SNP3 ………SNPi (tag SNPs) • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix. • array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202e Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm
Methodology - Step IV Multiple testing correction Gene X: (cis-regulated) Likelihood ratio test (ML) p-value LRT ~ 2(2n) n is the number of SNPs q-value (FDR) 836 Genes
Results Data contains all expressed genes (25527 genes) Step I: Adjusted_q value<0.0005 20979 genes Step II: Adjusted_qa value<0.01 1328 genes Step III: q value<0.01 972 genes q value<0.01 Step IV: 859 genes
Results • Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I) • Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II) • Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III) • We confirm our discovery from these 972 cis-regulated genes in step IV: • an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD; • We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.
Conclusions • This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable). • Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR). • Using simulation-based pvalues when testing difference between random effects increases power of detecting association. • A comprehensive analysis of gene expression variation in plant populations has been described. • Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided. • This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes. • Advanced statistical methods look promising in identifying interesting discoveries in genetics.
Many thanks for your attention !