1 / 23

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

This study utilizes mixed model analysis and the GASED approach to identify cis-regulatory haplotypes in Arabidopsis thaliana. The objective is to differentiate between cis- and trans-regulatory changes and identify superior alleles.

mcornejo
Download Presentation

Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mixed model analysis to discover cis-regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*, Olivier Thas*, Marnik Vuylsteke# * Ghent University # VIB (Flanders Institute for Biotechnology)

  2. Overview • Genetic background • Objectives • Data • Methodology • Results • Conclusions

  3. Genetic background • Regulation of gene expression is affected either in: • Cis:affecting the expression of only one of the two alleles in a • heterozygous individual; • - Trans : affecting the expression of both alleles in a heterozygous individual;

  4. Genetic background • Why search for Cis-regulatory variants? • “low hanging fruit”: window is a small genomic region • Fast screening for markers in LD with expression trait. • How to search for Cis-regulatory variants? • Using GASED (Genome-wide Allelic Specific Expression Difference) approach (Kiekens et al, 2006) - Based on a diallel design which is very popular in plant breeding system to estimate GCA (generation combination ability) and SCA (specific combination ability)

  5. Genetic Background • What is GASED approach? • The expression of a gene in a F1 hybrid coming from the kth offspring of the cross can be written as: (c—cis-element, t-trans-element) kth offspring of cross i j From parent j From parent i From both (cross-terms) Genotypic variation In case homozygous In case there is no trans-effect In case there is cis-effect A cis-regulatory divergence completely explains the difference between two parental lines

  6. Objectives of this study • Using mixed model analysis to discover Cis-regulated Arabidopsis genes • Based on GASED approach, to partition between F1 hybrid genotypic variation for mRNA abundance into additive and non-additive variance components to differentiate between cis- and trans-regulatory changes and to assign allele specific expression differences to cis-regulatory variation. • To find its associated haplotypes (a set of SNPs) for these selected cis-regulated genes. • Systematic surveys of cis-regulatory variation to identify “superior alleles”.

  7. Flow chart Data contains all expressed genes (25527 genes) Choose genes with significant genotypic variation: Step I: Choose genes from Step 1 with no trans-regulatory variation: Step II: Choose genes from step 2 displaying significant allelic imbalance to cis-regulatory variation: Step III: Step IV: Choose genes from Step 3 showing significant association with founded haplotype blocks:

  8. Data • Data acquisition: • Scan the arrays • Quantitate each spot • Subtract noise from background • Normalize • Export table Data for us to analyze

  9. Methodology - Step I Mixed-Model Equations yklnm = μ + dyek + replicatel + genotypen + arraym + errorklnm Full model: Gene X: expression values Residual RANDOM effect FIXED effects Reduced model: yklnm = μ + dyek + replicatel + arraym + errorklnm • error ~ N(0,Σe) , Σe =I2202e ; array ~ N(0, Σa) , Σa =I1102a • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix: • Calculated as 1 – dR;dR = Rogers’ distance • (Rogers ,1972; Reif et al. 2005)

  10. Methodology - Step I Mixed-Model Equations K = 55 x 55 marker-based relatedness matrix: pij and qijare allele frequencies of the jth allele at the ith locus niis the number of alleles at the ith locus (i.e. ni= 2) m refers to the number of loci (i.e. m = 210,205) Rogers (1972); Reif et al. (2005) Melchinger et al. (1991)

  11. Methodology - Step I Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1)) p-value Gene X: 25527 Genes Adjusted q-value (FDR) FDR: false discovery rate How many of the called positives are false? 5% FDR means 5% of calls are false positive John Storey et al. (2002) : q-value to represent FDR Estimate the proportion of features that are truly null: We use adjusted q-value to represent FDR

  12. Methodology - Step I Multiple testing correction Storey et al estimate π0 = m0 /m under assumption that true null p-values is uniformly distributed (0,1) We estimate π0 –adj = m0 /m under assumption that true null p-values is 50% uniformly distributed (0,0.5) , 50% is just 0.5.

  13. Methodology - Step II Mixed-Model Equations y klijm= μ + dyek + replicatel + gcai + gcaj + scaij + arraym + error klijm Full model: Gene X: expression values Residual RANDOM effect FIXED effects L is the Cholesky decomposition Reduced model: y klijm= μ + dyek + replicatel + gcai + gcaj + arraym + error klijm

  14. Methodology - Step II Multiple testing correction Likelihood ratio test (REML) LRT ~ 0.52(0) + 0.52(1) p-value Gene X: qa-value (FNR) 20976 Genes • FNR: false non-discovery rate (Genovese et al , 2002) • How many of the called negatives are false? • 5% FNR means 5% of calls are false negative • Since we are interested in selecting genes with negativescaij effect, we control FNR instead of FDR We use qa-value to represent FNR

  15. Methodology - Step II Multiple testing correction False non-discovery rate (FNR) : π0 is the estimate of the proportion of features that are truly null

  16. Methodology - Step III Mixed-Model Equations yklijm = μ + dyek + replicatel + gcai + gcaj + arraym + errorkijlm model: Test 45 pairs ? Gene X: g1 =g2? g1 =g3? g1 =g4? … g1= g10? g2 =g3? g2= g4? g2=g5? … g2 =g10? ……, …… g9 = g10? Two sample dependent t-test Non-standard P-value Distribution of true null p-values is not uniformly distributed from 0 to 1

  17. Methodology - Step III Multiple testing correction two sample t-test testing BLUPs Gene X: Simulate H0 distribution from real data: simulation-basedp-value q-value (FDR) 1380 Genes

  18. Methodology - Step IV Mixed-Model Equations Full model: yklim = μ + dyek + replicatel + + genotypei + arraym + errorkijlm Gene X: (cis-regulated) FIXED effects RANDOM effect Residual Gene chromosome SNP1 SNP2 SNP3 ………SNPi (tag SNPs) • genotype ~ N(0,Σgenotype) , Σ genotype=G = K2g; • K = 55 x 55 marker-based relatedness matrix. • array ~ N(0,Σa) , Σ a=I1102a; error ~ N(0,Σe) , Σ e=I2202e Reduced model: yklim = μ + dyek + replicate+ genotypei + arraym + errorkilm

  19. Methodology - Step IV Multiple testing correction Gene X: (cis-regulated) Likelihood ratio test (ML) p-value LRT ~ 2(2n) n is the number of SNPs q-value (FDR) 836 Genes

  20. Results Data contains all expressed genes (25527 genes) Step I: Adjusted_q value<0.0005 20979 genes Step II: Adjusted_qa value<0.01 1328 genes Step III: q value<0.01 972 genes q value<0.01 Step IV: 859 genes

  21. Results • Among all 25527 genes, 20979 genes have significant genotypic variation (qvalue < 0.0005). (–Step I) • Among these 20979 genes, 1328 genes have no-trans regulated effect (qavalue < 0.01). (–Step II) • Among these 1328 genes, 972 genes have showed significant different allelic expressions (qvlaue < 0.01); these 972 genes are discovered as cis-regulated. (–Step III) • We confirm our discovery from these 972 cis-regulated genes in step IV: • an allelic expression difference caused by cis-regulatory variant implies a nearby polymorphism (SNP) that controls expression in LD; • We indeed found 96.5% selected cis-regulated genes have associated polymorphisms (haplotype blocks ) nearby.

  22. Conclusions • This mixed-model approach used here for association mapping analysis with Kinship matrix included are more appropriate than other recent methods in identifying cis-regulated genes ( p-values more reliable). • Each step’s statistical method is controlled in a more accurate way to specify statistical significance (referring to FDR, FNR). • Using simulation-based pvalues when testing difference between random effects increases power of detecting association. • A comprehensive analysis of gene expression variation in plant populations has been described. • Using this mixed-model analysis strategy, a detailed characterization of both the genetic and the positional effects in the genome is provided. • This detailed statistical analysis provides a robust and useful framework for the future analysis of gene expression variation in large sample sizes. • Advanced statistical methods look promising in identifying interesting discoveries in genetics.

  23. Many thanks for your attention !

More Related