1 / 20

A Shrinkage Regression Approach to Tackle the HLA Region

A Shrinkage Regression Approach to Tackle the HLA Region. Charlotte Vignal Variable Selection Workshop Vienna, July 26 th 2008. Outline. Overview of the HLA system and the challenge of analysing data from the HLA region

avidan
Download Presentation

A Shrinkage Regression Approach to Tackle the HLA Region

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Shrinkage Regression Approach to Tackle the HLA Region Charlotte Vignal Variable Selection Workshop Vienna, July 26th 2008

  2. Outline • Overview of the HLA system and the challenge of analysing data from the HLA region • Multivariate association test using a Bayesian-inspired shrinkage regression approach • Application to the rheumatoid arthritis case-control study • Conclusion

  3. The Human Leukocyte Antigen System • A genomic region found in almost all vertebrates, the major histocompatibility complex (MHC) - gene composition and arrangement vary between species (below) • In humans, the MHC is the HLA system • A set of genes encoding proteins essential to immune response • Major role in histocompatibility and protection against pathogens MOUSE RAT CHIMPANZEE HUMAN Kelley et al. Immunogenetics (2005)

  4. The Challenge • Susceptibility to many complex disorders maps to the HLA region • High degree of correlation within the region hampers the identification of causal variants • Widely used approaches test the effect of one genetic variable at a time • Require methods that allow the detection of (possibly multiple) causal variants among highly correlated data

  5. Multi-SNP Methods can be more Powerful than Single-SNP Analyses • Multivariate logistic regression • Problematic when nVars >> nObs • Stepwise procedures can be unstable in presence of many highly-correlated terms • Shrinkage method using Bayesian logistic regression • A variable selection approach • Based on the Least Absolute Shrinkage and Selection Operator approach (LASSO) (Tibshirani 1996) • Fast implementation using the Bayesian Binary Regression (BBR) software for text-categorisation analysis (Genkin et al. 2004, http:/www.stat.rutgers.edu/~madigan/BBR)

  6. Bayesian Logistic Regression for variable selection • Each coefficient βj has a Laplace prior distribution with mode 0 and prior variance ν=2/λ2, where λ is the penalty factor • Mode 0 encodes a prior belief of no effect • The prior variance determines the strength of this belief and hence the sparseness of the fitted model • The maximum a posteriori (posterior mode) estimates are often zero or else shrunk towards zero • Terms with non-zero are included in the final model, and treated as significant • The value of gives a (shrunk) measure of effect size

  7. The Density of the Laplace Distribution p(x) x ! Effect size estimates are biased towards zero; Over-shrinking true effects can lead to non-causal correlated variables to be retained

  8. ApplicationTheRheumatoid Arthritis Dataset • RA is an autoimmune disease and a complex disorder • Estimated genetic contribution of ~30-50% • The HLA region is strongly implicated in RA susceptibility • Genetic associations reported with a biomarker called the shared epitope (SE) defined by a class of alleles at HLA-DRB1 • The mechanism by which RA is determined is still unknown • Is the SE association the only HLA effect predisposing to RA? • The subjects: 842 RA cases and 957 controls(but 774 cases and 945 controls with no missing data analysed) • The independent variables: • 2,302 genetic markers, a continuous variable coded as 0, 1 and 2 based on the number of allele copies • The shared epitope, a continuous variable coded as 0, 1, 2 based on the number of shared epitope positive (SE+) alleles

  9. The Effect of Shared Epitope on RA • The presence of SE is strongly associated with RA • Increasing risk for RA associated with the number of SE+ allele copies • The objective: to investigate the presence of additional causal variants in the HLA region, possibly correlated with SE

  10. Specification of the Penalty λ • Cases and controls permuted 100 times for each λwithin each SE group (i.e. SE effect retained) • SE (additive term) included in each model • λ selected if false positive per model < 1 • λ = 62 was selected for further analyses

  11. The Effect of Shrinking a True Effect R2 between each genetic variables and SE across the HLA region • In blue are the genetic variables selected by BLR in addition to SE • Three variables selected are correlated with SE • Shrinking a known effect may cause correlated SNPs to be selected

  12. The Effect of Shrinkage on True Effects • To investigate the effect of shrinkage,SE included twice (SE & SEfake) in the model: • When SE and SEfake are shrunk, both variables retained • Shrinking a known effect may cause correlated SNPs to be selected • When SE is not shrunk, only SE is retained • Correlated SNPs could be eliminated • The shrinkage factor was not applied to SE in subsequent analyses (λ = 0)

  13. BLR and Correlated Data • Can the BLR approach distinguish positive effects from spurious associations in presence of correlation? • 4 variables correlated with SE were used to evaluate error rates and power • Records of each variables re-distributed in cases and controls to achieve different size of OR while maintaining correlation with SE • Error rate and power assessed by permuting cases and controls • Error rate: frequency of the variables selected beyond SE & the simulated correlated variables over 100 permutations • Power: frequency of the simulated variables over 100 permutations

  14. Power • Selection of simulated variables correlated with SE • variables moderately correlated with SE selected if OR> 2 • variables highly correlated with SE selected if OR> 5

  15. Error Rate • Selection of simulated variables correlated with SE • Under the null,expect 1 false positive per analysis (λ = 62) • Analysis generates 1 to 2 false positives per analysis

  16. ATT- BLR Results Comparison • Data were analysed by Armitage Trend Test (ATT) and BLR • With λ=62, BLR identified 10 SNPs • Single-point analysis using ATTidentified 109 associated SNPs at • α= 4.34e-04 = 1/2302 • Variables selected by BLR are not correlated with SE

  17. Additional AnalysisThe NEG Distribution • Data re-analysed using the normal-exponential-gamma (NEG) prior with parameters set to expect 1 false positive per model (Hoggart et al. PLoS (2008)) ! NEG has heavier tails to allow sparser solutions

  18. Additional AnalysisThe NEG Distribution • NEG identified 4 variables; of which three (snp271, snp384, snp545) were also retained by DE • Variables identified with NEG prior are less correlated among themselves and with SE than those selected using DE • Three of the selected variables are in genes/region reported to contribute to RA susceptibility: BAT1 and HLADQA1/DQB1

  19. Conclusions • BLR appears to perform better than single-point association analysis (ATT) when data are correlated • Computationally efficient • Identifies fewer positive results (10 vs.109) • Correlation might be more effectively handled • Simulation analyses confirm reasonable power and error rate • Three variables identified by both DE and NEG priors lie in genes previously implicated in RA • Results suggest the presence of independent RA-associated effects in the HLA region

  20. Acknowledgements • David Balding, Imperial College, UK • Clive Hoggart, Imperial College, UK • Aruna Bansal, GSK, UK • The Genetics Division at GSK

More Related